Turbo charging the TLS handshake with Sandwich

by Jason Goertzen. Posted on Feb 26, 2024

“An image for a blog post named “TurboTLS implementation in Sandwich”.” by DALL·E 3

Welcome back to Cryptography Caffe! In this blog post, I’ll be talking about how Thomas Bailleux and I did some transport fancy protocol manipulation using Sandwich’s (github) tunnel abstraction to reduce TLS handshake latency. You perhaps have read about TurboTLS on our blog before.

In this post, I’ll walk you through some new Sandwich IO objects for the experimental protocol TurboTLS that are available to play with in Sandwich! Starting with how we implemented the TurboTLS objects, and then finishing with how you can go use them. Lets dive into how we achieve this by using the IO abstraction provided by Sandwich (SandwichIO, and Listener).

Sandwich’s IO

Before I go into how we implemented TurboTLS, we need to talk about how Sandwich provides a separation between the cryptographic and IO/transport layers of the TLS handshake. In order to use Sandwich for TLS, you must create a SandwichIO AND then give that SandwichIO to a new TunnelIO object. We support defining both SandwichIOs and TunnelIOs in all of our language bindings. In Rust we have traits, in Python we have classes, in Go we have interfaces, and in C we have structs. Since the IO is implemented in Rust, I will focus on the rust traits in this blog.

Sandwich’s IO trait definition:

pub trait sandwich::IO: std::io::Read + std::io::Write {}
impl<T> IO for T where T: std::io::Read + std::io::Write {}

Sandwich’s IO trait depends on the rust standard library’s read and write traits. Which means any rust struct that implements std::io::Read and std::io::Write can be used with sandwich with no extra effort!

Sandwich’s Tunnel IO definition:

pub trait sandwich::tunnel::IO: sandwich::IO {
    fn set_state(&mut self, _state: sandwich::pb::HandshakeState);
}

In order to use Sandwich IO’s with tunnels, we require that the tunnel::IO trait be implemented. We found that it can be beneficial for experimental protcols to know what stage of the handshake they are in. set_state notifies the IO of what state the handshake is in prior to every read and write call. For IO’s that do not require this extra knowledge, we provide a default implementation of set_state which is essentially a no-op. But there are cases where an IO object may want to behave differently while the handshake is still in progress vs when it is completed. I bet you are thinking: “That sounds crazy! Why would you want to do something like that!?” Well, let’s get to it!

Turbo-charging TLS

Let’s talk about reducing TLS latency by using TurboTLS. If you would like the nitty gritty details, check out our blog post on TurboTLS. But for those that want a summary, then continue reading…

The most common way to communicate using TLS involves three core steps:

Do a round trip to establish a TCP connection, and wait for the TCP handshake to complete.
Do a round trip to perform the TLS handshake over the established TCP connection to derive a authenticated shared secret.
Use the authenticated shared secret to encrypt application data, and transmit it over its TCP connection.

In general, these steps are sequential. You must wait for each step to complete before you can move on to the next one. But what if we could attempt to perform the TLS handshake at the same time as we are performing the TCP handshake? This would eliminate a round trip of communication, effectively halving the latency of TLS connection establishment.

The core idea of TurboTLS is this: Attempt to perform TLS handshake with a fast but unreliable protocol, and a slow but reliable protocol, both in parallel. If it the fast one succeeds, then you save a round trip. If it doesn’t, revert to the slow but reliable. In the best case you save a round trip, in the worst case, you add almost no latency. In practice we utilize UDP and TCP for TurboTLS, UDP for the handshake and TCP after the handshake (the so called record layer).

A natural question is then: why not try QUIC first and TLS as a fallback? As noted in the TurboTLS original blogpost, there are multiple reasons, the easiest to explain being that TLS can be converted into TurboTLS and vice versa by transparent proxies, making the migration possible even for applications that won’t (at first or ever) modify their code to pass to QUIC. TurboTLS is just about changing the way TLS is transported, which is the main subject of this post!

Before I dive into the implementation of TurboTLS in sandwich, I want to quickly talk about how the Handshake UDP packets are formatted. As we need a way to associate the handshake that was performed over UDP to a specific TCP connect, all UDP handshake messages contain a connection ID. Once the TLS handshake is complete, the connection ID is sent from the client to the server over the established TCP connection. The server can then identify which shared secret to use with this TCP connection. Another issue is that UDP does not guarantee the order which UDP packets will arrive, let alone if they will be delivered at all. Therefore, we include a sequence number in all UDP handshake packets so we can maintain order and detect losses on the server side.

Defining TurboTLS Sandwich IO objects

As mentioned earlier, if we want to use Sandwich to implement TurboTLS, we’ll need to define some IO objects that conform to the Sandwich IO interface. However, there are a few things we need to consider on the server side of communication. We’ll want to define a fixed publicly known UDP port to communicate with. However, since UDP is connectionless, we can’t simply listen and accept on that port and let the operating system handle delivery for us. We also don’t want to define several fixed ports to use because how would the client know which one to use? What would happen if all the ports are used? So we will need to use a single UDP port for all Turbo connections, and then route them to the correct server Turbo IO object.

In order to facilitate having a shared UDP port, we defined a TurboListener object which is responsible for listening and accepting new Turbo connections. When listen is called on the TurboListener, two threads are spawned: one that handles new TCP connection requests until it receives the connection ID, and one to handle the shared UDP port.

When accept is called on the TurboListener, and a new Turbo connection is received, it will return a shiny new Turbo IO which can be used with sandwich to communicate with the requesting client.

So we need to define three key objects. A client Turbo IO object. A server Turbo IO object, and a Turbo Listener.

Client IO.

The implementation details for the client TurboTLS IO is located in client.rs. Between the client and server side of Turbo communication, the client has much less to manage. The client only has to worry about a single connection across its TCP and UDP ports, so we don’t need to worry about sharing and routing between ports. That makes the implementation very straight forward! Depending on the sandwich tunnel state we will either read from/write to our TCP socket, or we will read from/write to our UDP socket. One catch with reading is because of UDP not guaranteeing packet order, we need to insert the UDP packet into our datagram_stream and then read from the stream rather than processing the UDP packet directly. The other catch comes from the fact TurboTLS requires the client to potentially send extra dummy packets (for what is called request-based fragmentation, as explained in the original blogpost), which requires us adding some code to handle this.

Server IO.

Now you might be thinking that since the server has a lot more to do than a client does, that its implementation in server.rs will be quite complex. Well, the good news is that it is not at all. In fact, the server’s read and write functions are almost exactly the same as the client’s read and write! The main difference is that the server does not read from the shared UDP socket directly. It only reads from its datagram_stream. So how do datagrams end up in this queue? The engine will put any UDP datagram meant for this TurboTLS connection into the stream.

Another difference is that rather than having a TCP socket, the server IO has a FutureTCPLink. This is because the server IO will not have a TCP connection assigned to it until the tombstone has been received. Similar to the datagram_stream, this is managed by the engine and will become readable/writable once the engine assigns the TCP link to this server IO.

The final major difference is that every time write OR read are called, the respond_all function is called. This is how the server side of request based fragmentation is handled. If we have received n UDP packets from the server, we can send up to n UDP packets back. This means that there may be instances where a call to write buffers the outgoing message rather than actually sending it across the wire. respond_all() essentially checks if there are any UDP packets that still need to be sent, and if enough UDP packets have been sent by the client. If so, it will send as many UDP packets as it can.

The Engine and Listener

Now this is a somewhat more complex file. engine.rs contains both the TurboTLS listener and the backend responsible for routing UDP messages from the shared UDP port, and assigning TCP connections to server IO objects once the tombstone is received. As previously mentioned, when a Turbo Listener has listen() called on it, it will spawn two threads: one responsible for routing UDP packets from the shared UDP socket to the appropriate server IO’s datagram_queue, and one responsible for listening for new TCP connections, and eventually associating them to the appropriate server IO’s FutureTCPLink once the tombstone is received.

Although there is a lot to process in engine.rs it’s actually quite easy to step through. When accept is called, if there is a new turbo connection waiting to be received, the server IO is popped off the queue and can then be used to create a sandwich tunnel. The server IO will be created when a UDP packet arrives with a connection ID we haven’t seen before. When a new TCP connection comes in, its added to our poll list. Once there is something to read on that TCP connection, it has to be the tombstone so we read the connection ID and give it to the appropriate Server IO.

Using Turbo transport objects with Sandwich

So how do we use this with Sandwich? Good news! We are shipping the IO objects as part of the Sandwich experimental module! This module will contain any new features or protocols which may not be ready for production, but we want people to be able to play with easily! As part of that, we have included some helper functions similar to the TCP helper functions we already ship with.

In an attempt to prevent people from accidentally using TurboTLS in their production environment, we require that Sandwich be compiled with the turbo feature enabled. If you are using our C language bindings, then you’ll have to specifically include turbo.h to your project as well.

Looking for some examples to get started? Checkout our test files for:

Conclusion

So that sums up the walk through of TurboTLS Sandwich IO objects. Although we think TurboTLS is really cool, I hope this blog post more so demonstrates the power and flexibility of the Sandwich IO abstraction. Use two streams of communication, wrap and unwrap TurboTLS packets, and perform a somewhat complex backend logic without having to worry about compromising the security TLS has to offer. This greatly reduces the burden for new transport protocol design. A networking expert can now test new ideas on how to deliver TLS without having to know a single thing about cryptography or cryptography libraries. I’m really excited to see what people are able to do both with TurboTLS, and the general Sandwich IO abstraction.

Thanks for reading!