By forwarding the method to the inner smux.Stream. This is to prevent
io.Copy in the top-level proxy function from allocating a buffer per
client.
The smux.Stream WriteTo method returns io.EOF on success, contrary to
the contract of io.Copy that says it should return nil. Ignore io.EOF in
the proxy loop to avoid a log message.
/anti-censorship/pluggable-transports/snowflake/-/issues/40177
Rather than use defer. It is only a tiny amount faster, but this
function is frequently called.
Before:
$ go test -bench=BenchmarkSendQueue -benchtime=2s
BenchmarkSendQueue-4 15901834 151 ns/op
After:
$ go test -bench=BenchmarkSendQueue -benchtime=2s
BenchmarkSendQueue-4 15859948 147 ns/op
https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/40177
We use a call to test -z together with go fmt because it doesn't output
a non-zero exit status (triggering CI test failure). However, we lose
useful debugging output from the go fmt call because test -z swallows
it. This adds very verbose formatting output to the CI test.
I had thought to set a buffer size of 2048, half the websocket package
default of 4096. But it turns out when you don't set a buffer size, the
websocket package reuses the HTTP server's read/write buffers, which
empirically already have a size of 2048.
$ go test -bench=BenchmarkUpgradeBufferSize -benchmem -benchtime=5s
BenchmarkUpgradeBufferSize/0-4 25669 234566 ns/op 32604 B/op 113 allocs/op
BenchmarkUpgradeBufferSize/128-4 24739 238283 ns/op 24325 B/op 117 allocs/op
BenchmarkUpgradeBufferSize/1024-4 25352 238885 ns/op 28087 B/op 116 allocs/op
BenchmarkUpgradeBufferSize/2048-4 22660 234890 ns/op 32444 B/op 116 allocs/op
BenchmarkUpgradeBufferSize/4096-4 25668 232591 ns/op 41672 B/op 116 allocs/op
BenchmarkUpgradeBufferSize/8192-4 24908 240755 ns/op 59103 B/op 116 allocs/op
Otherwise the buffers are re-allocated on every iteration, which is a
surprise to me. I thought the compiler would do this transformation
itself.
Now there is just one allocation per client←server read (one
messageReader) and two allocations per server←client read (one
messageReader and one messageWriter).
$ go test -bench=BenchmarkReadWrite -benchmem -benchtime=5s
BenchmarkReadWrite/c←s_150-4 481054 12849 ns/op 11.67 MB/s 8 B/op 1 allocs/op
BenchmarkReadWrite/s←c_150-4 421809 14095 ns/op 10.64 MB/s 56 B/op 2 allocs/op
BenchmarkReadWrite/c←s_3000-4 208564 28003 ns/op 107.13 MB/s 16 B/op 2 allocs/op
BenchmarkReadWrite/s←c_3000-4 186320 30576 ns/op 98.12 MB/s 112 B/op 4 allocs/op
In the client←server direction, this hits a fast path that avoids
allocating a messageWriter.
https://github.com/gorilla/websocket/blob/v1.5.0/conn.go#L760
Cuts the number of allocations in half in the client←server direction:
$ go test -bench=BenchmarkReadWrite -benchmem -benchtime=5s
BenchmarkReadWrite/c←s_150-4 597511 13358 ns/op 11.23 MB/s 33709 B/op 2 allocs/op
BenchmarkReadWrite/s←c_150-4 474176 13756 ns/op 10.90 MB/s 34968 B/op 4 allocs/op
BenchmarkReadWrite/c←s_3000-4 156488 36290 ns/op 82.67 MB/s 68673 B/op 5 allocs/op
BenchmarkReadWrite/s←c_3000-4 190897 34719 ns/op 86.41 MB/s 69730 B/op 8 allocs/op