With these not being closed, they were continuing to consume resources
after the return of the test function, which was affecting the later
BenchmarkSendQueue.
Before:
```
snowflake/common/turbotunnel$ go test -bench BenchmarkSendQueue -v
=== RUN TestQueueIncomingOversize
--- PASS: TestQueueIncomingOversize (0.00s)
=== RUN TestWriteToOversize
--- PASS: TestWriteToOversize (0.00s)
=== RUN TestRestoreMTU
--- PASS: TestRestoreMTU (0.00s)
=== RUN TestRestoreCap
--- PASS: TestRestoreCap (0.00s)
=== RUN TestQueuePacketConnWriteToKCP
--- PASS: TestQueuePacketConnWriteToKCP (1.01s)
goos: linux
goarch: amd64
pkg: gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/v2/common/turbotunnel
cpu: Intel(R) Core(TM) i5 CPU 680 @ 3.60GHz
BenchmarkSendQueue
BenchmarkSendQueue-4 8519708 136.0 ns/op
PASS
ok gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/v2/common/turbotunnel 3.481s
```
After:
```
snowflake/common/turbotunnel$ go test -bench BenchmarkSendQueue -v
=== RUN TestQueueIncomingOversize
--- PASS: TestQueueIncomingOversize (0.00s)
=== RUN TestWriteToOversize
--- PASS: TestWriteToOversize (0.00s)
=== RUN TestRestoreMTU
--- PASS: TestRestoreMTU (0.00s)
=== RUN TestRestoreCap
--- PASS: TestRestoreCap (0.00s)
=== RUN TestQueuePacketConnWriteToKCP
--- PASS: TestQueuePacketConnWriteToKCP (1.02s)
goos: linux
goarch: amd64
pkg: gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/v2/common/turbotunnel
cpu: Intel(R) Core(TM) i5 CPU 680 @ 3.60GHz
BenchmarkSendQueue
BenchmarkSendQueue-4 11620237 105.7 ns/op
PASS
ok gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/v2/common/turbotunnel 3.244s
```
It was silently exiting at the "recordingStart":"2022-09-23T17:06:59.680537075Z"
line, the first line whose length (66873) exceeds
bufio.MaxScanTokenSize. Now distinctcounter exits with an error status
instead of reporting partial results.
$ ./distinctcounter -from 2023-01-01T00:00:00Z -to 2023-01-10T00:00:00Z -in metrics-ip-salted.jsonl
2023/04/20 13:54:11 unable to count:bufio.Scanner: token too long
The noise-generating goroutine was meant to stop when the parent
function returned and closed the `done` channel. The `break` in the loop
was wrongly exiting only from the `select`, not from the `for`.
This was the cause of banchmark anomalies in
https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/40260#note_2885832.
The noise-generating loop from the test was continuing to run while the
benchmarks were running.
This design is easier to misuse, because it allows the caller to modify
the contents of the slice after queueing it, but it avoids an extra
allocation + memmove per incoming packet.
Before:
$ go test -bench='Benchmark(QueueIncoming|WriteTo)' -benchtime=2s -benchmem
BenchmarkQueueIncoming-4 7001494 342.4 ns/op 1024 B/op 2 allocs/op
BenchmarkWriteTo-4 3777459 627 ns/op 1024 B/op 2 allocs/op
After:
$ go test -bench=BenchmarkWriteTo -benchtime 2s -benchmem
BenchmarkQueueIncoming-4 13361600 170.1 ns/op 512 B/op 1 allocs/op
BenchmarkWriteTo-4 6702324 373 ns/op 512 B/op 1 allocs/op
Despite the benchmark results, the change in QueueIncoming turns out not
to have an effect in practice. It appears that the compiler had already
been optimizing out the allocation and copy in QueueIncoming.
https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/40187
The WriteTo change, on the other hand, in practice reduces the frequency
of garbage collection.
https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/40199
Rather than use defer. It is only a tiny amount faster, but this
function is frequently called.
Before:
$ go test -bench=BenchmarkSendQueue -benchtime=2s
BenchmarkSendQueue-4 15901834 151 ns/op
After:
$ go test -bench=BenchmarkSendQueue -benchtime=2s
BenchmarkSendQueue-4 15859948 147 ns/op
https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/40177
I had thought to set a buffer size of 2048, half the websocket package
default of 4096. But it turns out when you don't set a buffer size, the
websocket package reuses the HTTP server's read/write buffers, which
empirically already have a size of 2048.
$ go test -bench=BenchmarkUpgradeBufferSize -benchmem -benchtime=5s
BenchmarkUpgradeBufferSize/0-4 25669 234566 ns/op 32604 B/op 113 allocs/op
BenchmarkUpgradeBufferSize/128-4 24739 238283 ns/op 24325 B/op 117 allocs/op
BenchmarkUpgradeBufferSize/1024-4 25352 238885 ns/op 28087 B/op 116 allocs/op
BenchmarkUpgradeBufferSize/2048-4 22660 234890 ns/op 32444 B/op 116 allocs/op
BenchmarkUpgradeBufferSize/4096-4 25668 232591 ns/op 41672 B/op 116 allocs/op
BenchmarkUpgradeBufferSize/8192-4 24908 240755 ns/op 59103 B/op 116 allocs/op
Otherwise the buffers are re-allocated on every iteration, which is a
surprise to me. I thought the compiler would do this transformation
itself.
Now there is just one allocation per client←server read (one
messageReader) and two allocations per server←client read (one
messageReader and one messageWriter).
$ go test -bench=BenchmarkReadWrite -benchmem -benchtime=5s
BenchmarkReadWrite/c←s_150-4 481054 12849 ns/op 11.67 MB/s 8 B/op 1 allocs/op
BenchmarkReadWrite/s←c_150-4 421809 14095 ns/op 10.64 MB/s 56 B/op 2 allocs/op
BenchmarkReadWrite/c←s_3000-4 208564 28003 ns/op 107.13 MB/s 16 B/op 2 allocs/op
BenchmarkReadWrite/s←c_3000-4 186320 30576 ns/op 98.12 MB/s 112 B/op 4 allocs/op
In the client←server direction, this hits a fast path that avoids
allocating a messageWriter.
https://github.com/gorilla/websocket/blob/v1.5.0/conn.go#L760
Cuts the number of allocations in half in the client←server direction:
$ go test -bench=BenchmarkReadWrite -benchmem -benchtime=5s
BenchmarkReadWrite/c←s_150-4 597511 13358 ns/op 11.23 MB/s 33709 B/op 2 allocs/op
BenchmarkReadWrite/s←c_150-4 474176 13756 ns/op 10.90 MB/s 34968 B/op 4 allocs/op
BenchmarkReadWrite/c←s_3000-4 156488 36290 ns/op 82.67 MB/s 68673 B/op 5 allocs/op
BenchmarkReadWrite/s←c_3000-4 190897 34719 ns/op 86.41 MB/s 69730 B/op 8 allocs/op