34 points by riyaneel 2 days ago | 5 comments
riyaneel 2 days ago
I am the author of this library. The goal was to reach RAM-speed communication between independent processes (C++, Rust, Python, Go, Java, Node.js) without any serialization overhead or kernel involvement on the hot path.

I managed to hit a p50 round-trip time of 56.5 ns (for 32-byte payloads) and a throughput of ~13.2M RTT/sec on a standard CPU (i7-12650H).

Here are the primary architectural choices that make this possible:

- Strict SPSC & No CAS: I went with a strict Single-Producer Single-Consumer topology. There are no compare-and-swap loops on the hot path. acquire_tx and acquire_rx are essentially just a load, a mask, and a branch using memory_order_acquire / release.

- Hardware Sympathy: Every control structure (message headers, atomic indices) is padded to 128-byte boundaries. False sharing between the producer and consumer cache lines is structurally impossible.

- Zero-Copy: The hot path is entirely in a memfd shared memory segment after an initial Unix Domain Socket handshake (SCM_RIGHTS).

- Hybrid Wait Strategy: The consumer spins for a bounded threshold using cpu_relax(), then falls back to a sleep via SYS_futex (Linux) or __ulock_wait (macOS) to prevent CPU starvation.

The core is C++23, and it exposes a C ABI to bind the other languages.

I am sharing this here for anyone building high-throughput polyglot architectures and dealing with cross-language ingestion bottlenecks.

zekrioca 4 hours ago
Why report p50 and not p95?
riyaneel 3 hours ago
Tail latency p99.9 (122ns) are reported
yc-kraln 25 minutes ago
How do you handle noisy neighbors?
riyaneel 16 minutes ago
Tachyon is lock-free and uses a strict alignment to avoid the MESI protocol, but relies on the environment for isolation. You still need core pinning and CPU isolation for true hardware determinism.
BobbyTables2 3 hours ago
Would be interesting to see performance comparisons between this and the alternatives considered like eventfd.

Sure, the “hot path” is probably very fast for all, but what about the slow path?

riyaneel 2 hours ago
eventfd always pays a syscall on both sides (~200-400ns) regardless of load. Tachyon slow path only kick in under genuine starvation: the consumer spins first, then FUTEX_WAIT, and the producer skips FUTEX_WAKE entirely if the consumer still spinning. At sustainable rates the slow path never activates.
mananaysiempre 1 hour ago
> eventfd always pays a syscall on both sides (~200-400ns) regardless of load.

It’s fairly standard to make the waiting side spin a bit after processing some data, and only issue another wait syscall if no more data arrives during the spin period.

(For instance, io_uring, which does this kind of IPC with a kernel thread on the receiving side, literally lets you configure how long said kernel thread should spin[1].)

[1] https://unixism.net/loti/tutorial/sq_poll.html

riyaneel 30 minutes ago
Fair point. The real difference is the narrower: with a futex the producer can inspect consumer_sleeping directly in shared memory and skip the FUTEX_WAKE entirely if the consumer is still spinning. With eventfd you need a write() regardless, or you add shared state to gate it, which is essentially rebuilding futex. Same idea but slightly less clean.
JSR_FDED 5 hours ago
What would need to change when the hardware changes?
riyaneel 3 hours ago
Absolutely not, the code following all Hardware principles (Cache coherence/locality, ...) not software abstraction. That not means the code is for a dedicated hardware but designed for modern CPUs.
Fire-Dragon-DoL 3 hours ago
Wow, congrats!
riyaneel 2 hours ago
Thanks!
Fire-Dragon-DoL 2 hours ago
I will be discussing this at work on monday, will let you know what they think.

I wouldn't be surprised if somebody develops a cross-language framework with this.

riyaneel 2 hours ago
Would love to hear the feedback