This design is easier to misuse, because it allows the caller to modify
the contents of the slice after queueing it, but it avoids an extra
allocation + memmove per incoming packet.
Before:
$ go test -bench='Benchmark(QueueIncoming|WriteTo)' -benchtime=2s -benchmem
BenchmarkQueueIncoming-4 7001494 342.4 ns/op 1024 B/op 2 allocs/op
BenchmarkWriteTo-4 3777459 627 ns/op 1024 B/op 2 allocs/op
After:
$ go test -bench=BenchmarkWriteTo -benchtime 2s -benchmem
BenchmarkQueueIncoming-4 13361600 170.1 ns/op 512 B/op 1 allocs/op
BenchmarkWriteTo-4 6702324 373 ns/op 512 B/op 1 allocs/op
Despite the benchmark results, the change in QueueIncoming turns out not
to have an effect in practice. It appears that the compiler had already
been optimizing out the allocation and copy in QueueIncoming.
https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/40187
The WriteTo change, on the other hand, in practice reduces the frequency
of garbage collection.
https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/40199
Rather than use defer. It is only a tiny amount faster, but this
function is frequently called.
Before:
$ go test -bench=BenchmarkSendQueue -benchtime=2s
BenchmarkSendQueue-4 15901834 151 ns/op
After:
$ go test -bench=BenchmarkSendQueue -benchtime=2s
BenchmarkSendQueue-4 15859948 147 ns/op
https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/40177
This should increase the maximum amount of inflight data and hopefully
the performance of Snowflake, especially for clients geographically
distant from proxies and the server.
Introduce a waitgroup and done channel to ensure that both the read and
write gorouting for turbotunnel connections terminate when the
connection is closed.
The client opts into turbotunnel mode by sending a magic token at the
beginning of each WebSocket connection (before sending even the
ClientID). The token is just a random byte string I generated. The
server peeks at the token and, if it matches, uses turbotunnel mode.
Otherwise, it unreads the token and continues in the old
one-session-per-WebSocket mode.