Hyper-DERP: Engineer Achieves Tailscale DERP Relay Throughput with Half the CPU Cores

Key Takeaways

▸Hyper-DERP matches Tailscale's DERP relay throughput with approximately 50% fewer CPU cores through kernel-level encryption and io_uring optimization
▸Replacing Go's userspace TLS handling with kernel-managed kTLS eliminates expensive context switches and garbage collection overhead in the data plane
▸A shard-per-core, share-nothing architecture using io_uring for batch I/O operations reduces syscall overhead from hundreds to one per batch, dramatically improving performance at scale

Source:

Hacker Newshttps://hyper-derp.dev/blog/hyper-derp-announcement/↗

Summary

A systems engineer has developed Hyper-DERP, a high-performance relay implementation that matches Tailscale's DERP (Detoured Encrypted Routing Protocol) throughput while using significantly fewer CPU cores. The project began when the engineer, who works on industrial IR camera systems, examined Tailscale's relay architecture during interview preparation and decided to optimize it by replacing the Go-based data plane with C and leveraging kernel-level encryption via kTLS (kernel Transport Layer Security).

The key innovation in Hyper-DERP is the shift from userspace encryption handling to kernel-managed TLS session keys, combined with io_uring for efficient I/O operations instead of traditional epoll polling. The architecture employs a shard-per-core, share-nothing design pattern similar to the Seastar framework, eliminating context switches and lock contention in the packet forwarding path. Each core gets a dedicated io_uring instance with no shared state between shards, with cross-shard communication handled through lock-free SPSC rings.

Benchmarking on GCP c4-highcpu VMs with rigorous methodology (4,903 test runs with 95% confidence intervals) demonstrates that Hyper-DERP achieves parity with Tailscale's v1.96.4 derper while using approximately half the CPU resources. The implementation uses slab allocators and frame pools to eliminate malloc calls in the critical forwarding path and can potentially offload TLS operations to smart NICs for even greater performance.

The design enables potential smart NIC offloading for TLS operations and represents a practical application of high-performance systems engineering principles similar to the Seastar framework

Editorial Opinion

This work represents excellent systems engineering that challenges the assumption that modern, well-engineered Go services are inherently performant. While Tailscale's DERP is production-hardened and thoughtfully designed, Hyper-DERP demonstrates that addressing fundamental architectural inefficiencies—particularly the overhead of userspace encryption and context-switch-heavy polling models—can yield dramatic performance improvements. The shift to kernel TLS and io_uring shows how modern Linux primitives can dramatically reduce CPU overhead in I/O-bound relay services. However, the comparison would benefit from factors like latency characteristics, memory footprint, and maintainability trade-offs alongside throughput metrics.

Hyper-DERP: Engineer Achieves Tailscale DERP Relay Throughput with Half the CPU Cores

Key Takeaways

▸Hyper-DERP matches Tailscale's DERP relay throughput with approximately 50% fewer CPU cores through kernel-level encryption and io_uring optimization
▸Replacing Go's userspace TLS handling with kernel-managed kTLS eliminates expensive context switches and garbage collection overhead in the data plane
▸A shard-per-core, share-nothing architecture using io_uring for batch I/O operations reduces syscall overhead from hundreds to one per batch, dramatically improving performance at scale

Summary

The design enables potential smart NIC offloading for TLS operations and represents a practical application of high-performance systems engineering principles similar to the Seastar framework

Editorial Opinion

This work represents excellent systems engineering that challenges the assumption that modern, well-engineered Go services are inherently performant. While Tailscale's DERP is production-hardened and thoughtfully designed, Hyper-DERP demonstrates that addressing fundamental architectural inefficiencies—particularly the overhead of userspace encryption and context-switch-heavy polling models—can yield dramatic performance improvements. The shift to kernel TLS and io_uring shows how modern Linux primitives can dramatically reduce CPU overhead in I/O-bound relay services. However, the comparison would benefit from factors like latency characteristics, memory footprint, and maintainability trade-offs alongside throughput metrics.

Hyper-DERP: Engineer Achieves Tailscale DERP Relay Throughput with Half the CPU Cores

Key Takeaways

Summary

Editorial Opinion

More from Tailscale

Hyper-DERP: Engineering Team Achieves Tailscale Derper Throughput Using Half the CPU Cores

Tailscale Launches tailscale-rs: Official Rust Library for Embedding Tailscale in Applications

Tailscale Simplifies Pricing with Generous Free Plan and More Predictable Business Tiers

Comments

Suggested

Security Research Reveals How AI Code Reviewers Can Be Tricked Into Deploying Secret-Stealing Code

Thinking Machines Lab Releases Inkling, a 975B Open-Weight MoE with Architectural Innovations

TSMC Commits Additional $100B to US Operations as AI Chip Demand Surges

Hyper-DERP: Engineer Achieves Tailscale DERP Relay Throughput with Half the CPU Cores

Key Takeaways

Summary

Editorial Opinion

More from Tailscale

Hyper-DERP: Engineering Team Achieves Tailscale Derper Throughput Using Half the CPU Cores

Tailscale Launches tailscale-rs: Official Rust Library for Embedding Tailscale in Applications

Tailscale Simplifies Pricing with Generous Free Plan and More Predictable Business Tiers

Comments

Suggested

Security Research Reveals How AI Code Reviewers Can Be Tricked Into Deploying Secret-Stealing Code

Thinking Machines Lab Releases Inkling, a 975B Open-Weight MoE with Architectural Innovations

TSMC Commits Additional $100B to US Operations as AI Chip Demand Surges