An update the the kcp-go library removes the guarantee that all data
written to a KCP connection will be flushed before the connection is
closed. Moving the sleep call has no impact on the integrity of the
tests, and gives the connection time to flush data before the connection
is closed.
See https://github.com/xtaci/kcp-go/issues/273
With these not being closed, they were continuing to consume resources
after the return of the test function, which was affecting the later
BenchmarkSendQueue.
Before:
```
snowflake/common/turbotunnel$ go test -bench BenchmarkSendQueue -v
=== RUN TestQueueIncomingOversize
--- PASS: TestQueueIncomingOversize (0.00s)
=== RUN TestWriteToOversize
--- PASS: TestWriteToOversize (0.00s)
=== RUN TestRestoreMTU
--- PASS: TestRestoreMTU (0.00s)
=== RUN TestRestoreCap
--- PASS: TestRestoreCap (0.00s)
=== RUN TestQueuePacketConnWriteToKCP
--- PASS: TestQueuePacketConnWriteToKCP (1.01s)
goos: linux
goarch: amd64
pkg: gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/v2/common/turbotunnel
cpu: Intel(R) Core(TM) i5 CPU 680 @ 3.60GHz
BenchmarkSendQueue
BenchmarkSendQueue-4 8519708 136.0 ns/op
PASS
ok gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/v2/common/turbotunnel 3.481s
```
After:
```
snowflake/common/turbotunnel$ go test -bench BenchmarkSendQueue -v
=== RUN TestQueueIncomingOversize
--- PASS: TestQueueIncomingOversize (0.00s)
=== RUN TestWriteToOversize
--- PASS: TestWriteToOversize (0.00s)
=== RUN TestRestoreMTU
--- PASS: TestRestoreMTU (0.00s)
=== RUN TestRestoreCap
--- PASS: TestRestoreCap (0.00s)
=== RUN TestQueuePacketConnWriteToKCP
--- PASS: TestQueuePacketConnWriteToKCP (1.02s)
goos: linux
goarch: amd64
pkg: gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/v2/common/turbotunnel
cpu: Intel(R) Core(TM) i5 CPU 680 @ 3.60GHz
BenchmarkSendQueue
BenchmarkSendQueue-4 11620237 105.7 ns/op
PASS
ok gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/v2/common/turbotunnel 3.244s
```
The noise-generating goroutine was meant to stop when the parent
function returned and closed the `done` channel. The `break` in the loop
was wrongly exiting only from the `select`, not from the `for`.
This was the cause of banchmark anomalies in
https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/40260#note_2885832.
The noise-generating loop from the test was continuing to run while the
benchmarks were running.
This design is easier to misuse, because it allows the caller to modify
the contents of the slice after queueing it, but it avoids an extra
allocation + memmove per incoming packet.
Before:
$ go test -bench='Benchmark(QueueIncoming|WriteTo)' -benchtime=2s -benchmem
BenchmarkQueueIncoming-4 7001494 342.4 ns/op 1024 B/op 2 allocs/op
BenchmarkWriteTo-4 3777459 627 ns/op 1024 B/op 2 allocs/op
After:
$ go test -bench=BenchmarkWriteTo -benchtime 2s -benchmem
BenchmarkQueueIncoming-4 13361600 170.1 ns/op 512 B/op 1 allocs/op
BenchmarkWriteTo-4 6702324 373 ns/op 512 B/op 1 allocs/op
Despite the benchmark results, the change in QueueIncoming turns out not
to have an effect in practice. It appears that the compiler had already
been optimizing out the allocation and copy in QueueIncoming.
https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/40187
The WriteTo change, on the other hand, in practice reduces the frequency
of garbage collection.
https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/40199