Hyper-DERP: Engineer Achieves Tailscale DERP Relay Throughput with Half the CPU Cores
Key Takeaways
- ▸Hyper-DERP matches Tailscale's DERP relay throughput with approximately 50% fewer CPU cores through kernel-level encryption and io_uring optimization
- ▸Replacing Go's userspace TLS handling with kernel-managed kTLS eliminates expensive context switches and garbage collection overhead in the data plane
- ▸A shard-per-core, share-nothing architecture using io_uring for batch I/O operations reduces syscall overhead from hundreds to one per batch, dramatically improving performance at scale
Summary
A systems engineer has developed Hyper-DERP, a high-performance relay implementation that matches Tailscale's DERP (Detoured Encrypted Routing Protocol) throughput while using significantly fewer CPU cores. The project began when the engineer, who works on industrial IR camera systems, examined Tailscale's relay architecture during interview preparation and decided to optimize it by replacing the Go-based data plane with C and leveraging kernel-level encryption via kTLS (kernel Transport Layer Security).
The key innovation in Hyper-DERP is the shift from userspace encryption handling to kernel-managed TLS session keys, combined with io_uring for efficient I/O operations instead of traditional epoll polling. The architecture employs a shard-per-core, share-nothing design pattern similar to the Seastar framework, eliminating context switches and lock contention in the packet forwarding path. Each core gets a dedicated io_uring instance with no shared state between shards, with cross-shard communication handled through lock-free SPSC rings.
Benchmarking on GCP c4-highcpu VMs with rigorous methodology (4,903 test runs with 95% confidence intervals) demonstrates that Hyper-DERP achieves parity with Tailscale's v1.96.4 derper while using approximately half the CPU resources. The implementation uses slab allocators and frame pools to eliminate malloc calls in the critical forwarding path and can potentially offload TLS operations to smart NICs for even greater performance.
- The design enables potential smart NIC offloading for TLS operations and represents a practical application of high-performance systems engineering principles similar to the Seastar framework
Editorial Opinion
This work represents excellent systems engineering that challenges the assumption that modern, well-engineered Go services are inherently performant. While Tailscale's DERP is production-hardened and thoughtfully designed, Hyper-DERP demonstrates that addressing fundamental architectural inefficiencies—particularly the overhead of userspace encryption and context-switch-heavy polling models—can yield dramatic performance improvements. The shift to kernel TLS and io_uring shows how modern Linux primitives can dramatically reduce CPU overhead in I/O-bound relay services. However, the comparison would benefit from factors like latency characteristics, memory footprint, and maintainability trade-offs alongside throughput metrics.



