Commit graph

194 commits

Author SHA1 Message Date
Cecylia Bocovich
766988b77b
Add proxy type label to ProxyPollTotal metric
Annotate ProxyPollTotal prometheus metrics with the proxy type so that
we can track counts of proxies that are matched and that answer by
implementation. This will help us catch bugs by implementation or
deployment.
2025-09-09 18:18:42 -04:00
Cecylia Bocovich
d08efc34c3
Add prometheus metric for proxy answer counts
This adds a prometheus metric that tracks snowflake proxy answers. If
the client has not timed out before the proxy responds with an answer,
the proxy type is recorded along with a status of "success". If the
client has timed out, the type is left blank and the status is recorded
as "timeout".

The goal of these metrics is to help us determine how many proxies fail
to respond and to help narrow down which proxy implementations are
causing client timeouts.
2025-09-09 12:41:27 -04:00
Cecylia Bocovich
f777a93b29
Add failing concurrency test
This test fails when run with
  go test -v -race
due to data races in calls to prometheus update functions.
2025-08-20 16:37:31 -04:00
Cecylia Bocovich
c8b0b31601
Clear map of seen proxy IP addresses
We were not previously clearing the map we keep of seen IP addresses,
which resulted in our unique proxy IP counts representing churn rather
than unique IP counts per day, except during broker process restarts.
2025-08-20 09:35:32 -04:00
Cecylia Bocovich
43a35655ad
Add failing test for cleared IP map
We are not clearing the map of seen IP addresses when metrics are
printed, resulting in lower than expected unique IP address counts for
daily metrics.

See https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/40472
2025-08-20 09:35:32 -04:00
David Fifield
fd42bcea8a Comment typo. 2025-08-19 14:50:23 +00:00
David Fifield
74c39cc8e9
Bin country stats counts before sorting, not after.
This avoids an information leak where, if two countries have the same
count but are not in alphabetical order, you know the first one had a
larger count than the second one before binning.

Example: AA=25,BB=27

Without binning, these should sort descending by count:
BB=27,AA=25

But with binning, the counts are the same, so it should sort ascending
by country code:
AA=32,BB=32

Before this change, BB would sort before AA even after binning, which
lets you infer that the count of BB was greater than the count of AA
*before* binning:
BB=32,AA=32
2025-08-19 09:58:51 -04:00
David Fifield
7a003c9bb1
Add a failing test for ordering of country stats that bin to the same value. 2025-08-19 09:58:50 -04:00
David Fifield
9946c0f1d8
Add a test for formatAndClearCountryStats with binned=true. 2025-08-19 09:58:50 -04:00
David Fifield
bd04cd7752
Refactor TestFormatAndClearCountryStats. 2025-08-19 09:58:50 -04:00
David Fifield
cc0a33faea
Move formatAndClearCountryStats test into a new metrics_test.go.
This is more unit test–y. We don't need a full broker instantiation for
testing this function, unlike other tests in snowflake-broker_test.go.
2025-08-19 09:58:50 -04:00
David Fifield
ec39237e69 Add a test that formatAndClearCountryStats clears the map. 2025-08-15 19:24:58 +00:00
David Fifield
ed3bd99df6 Rename displayCountryStats to formatAndClearCountryStats.
The old name did not make it clear that the function has the side effect
of clearing the map.
2025-08-15 19:24:58 +00:00
David Fifield
75daf2210f Refactor displayCountryStats.
Move the record types closer to where they are used.

Use a strings.Builder rather than repeatedly concatenating strings
(which creates garbage).

Use the value that m.Range already provides us, don't look it up again
with LoadAndDelete.

Add documentation comments.
2025-08-15 19:24:58 +00:00
David Fifield
6e0e5f9137 Express records.Less more clearly. 2025-08-15 19:24:58 +00:00
David Fifield
fed11184c7 Have records.Less express the order we want directly.
The ordering is descending by count, then ascending by cc. Express that
directly, rather than specifying the opposite ordering and using
sort.Reverse.
2025-08-15 19:24:58 +00:00
David Fifield
b058b10a94 Express binCount using integer operations.
No need to bring a float64 into this.
2025-08-15 19:24:58 +00:00
Cecylia Bocovich
70974640ab
Defer SQS client IP extraction to ClientOffers
Now that both SQS and AMP cache are pulling remote addresses from the
SDP, avoid duplicate decodings of the ClientPollRequest by extracting
the remote addr in ClientOffers.
2025-08-14 14:13:47 -04:00
Cecylia Bocovich
0bbcb1eca4
Add test for AMP cache geolocation 2025-08-14 14:13:47 -04:00
Cecylia Bocovich
31f879aad5
Pull client IP from SDP for AMP cache rendezvous
The remote address for AMP cache rendezvous is always geolocated to the
AMP cache server address. For more accurate metrics on where this
rendezvous method is used and working, we can pull the remote address
directly from the client SDP sent in the poll request.
2025-08-14 14:13:47 -04:00
Cecylia Bocovich
58b1d48e54
Increment prometheus proxy_total count once per IP
This fixes a regression from !574 that did not check whether the IP was
unique before incrementing the counter.

Closes https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/40470
2025-07-10 10:41:26 -04:00
Cecylia Bocovich
1d73e14f34
Rename metrics update functions
This changes the metrics update functions to UpdateProxyStats and
UpdateClientStats, which is more accurate and clear than the previous
CountryStats and RendezvousStats names.
2025-06-24 13:12:10 -04:00
Cecylia Bocovich
78cf8e68b2
Simplify broker metrics and remove mutexes
This is a large change to how the snowflake broker metrics are
implemented. This change removes all uses of mutexes from the metrics
implementation in favor of atomic operations on counters stored in
sync.Map.

There is a small change to the actual metrics output. We used to count
the same proxy ip multiple times in our snowflake-ips-total and
snowflake-ips country stats if the same proxy ip address polled more
than once with different proxy types. This was an overcounting of the
number of unique proxy IP addresses that is now fixed.

If a unique proxy ip polls with more than one proxy type or nat type,
these polls will still be counted once for each proxy type or nat type
in our proxy type and nat type specific stats (e.g.,
snowflake-ips-nat-restricted and snowflake-ips-nat-unrestricted).
2025-06-24 13:12:10 -04:00
David Fifield
64c7a26475 Comment typo. 2025-06-19 15:39:24 +00:00
Cecylia Bocovich
08239cca2a
Remove broker log messages for invalid SDP and SQS cleanup 2025-03-27 15:34:09 -04:00
Cecylia Bocovich
dd5fb03c49
Remove default relay pattern option from broker
This was only useful to us when we first implemented the feature, to be
able to support proxies that hadn't yet updated, when we had a single
Snowflake bridge. Now that we have multiple bridges, it is unecessary as
proxies that don't send their accepted relay pattern are rejected
anyway.
2025-03-26 13:32:30 -04:00
Cecylia Bocovich
c0ac0186f1
Remove bad relay pattern log message
We already count proxies rejected for their supported relay URL in
snowflake metrics and these messages are filling up our broker logs.
2025-03-26 13:32:30 -04:00
Cecylia Bocovich
8343bbc336
Add context with timeout for client requests
Client timeouts are currently counted from when the client is matched
with a proxy. Instead, count client timeouts from the moment when the
request is received.

Closes #40449
2025-03-26 13:30:59 -04:00
WofWca
1923803124 fix: potential race conditions with non-local err
Some of the changes do not appear to have a potential race condition,
so there it is purely a refactor,
while in others (e.g. in broker.go and in proxy/lib/snowflake.go)
we do use the same variable from multiple threads / functions.
2025-03-12 00:47:07 +04:00
Cecylia Bocovich
57dc276e48
Update broker metrics to count matches, denials, and timeouts
Our metrics were undercounting client polls by missing the case where
clients are matched with a snowflake but receive a timeout before the
snowflake responds with its answer. This change adds a new metric,
called client-snowflake-timeout-count, to the 24 hour broker stats and a
new "timeout" status label for prometheus metrics.
2025-03-11 12:36:27 -04:00
Cecylia Bocovich
9e619a3654
Remove metrics race condition in sqs test
To test that the broker responds with a proxy answer if available, have
only one valid client offer to ensure metrics will always be in the
first multiple of 8.
2025-03-04 10:37:37 -05:00
Cecylia Bocovich
80374c6d93
Move nonblocking AddSnowflake out of goroutine in sqs test
This fixes a race condition in tests where sometimes snowflake matching
happens before enough snowflakes get added to the heap.
2025-03-04 10:37:37 -05:00
Cecylia Bocovich
63613cc50a
Fix minor data race in Snowflake broker metrics 2025-02-20 09:39:11 -05:00
Cecylia Bocovich
1180d11a66
Remove data races from sqs tests
Our SQS tests were not concurrency safe and we hadn't noticed until now
because we were processing incoming SQS queue messages sequentially
rather than in parallel.

This fix removes the log output checks, which were prone to error
anyway, and relies instead on gomock's expected function calls and
strategic use of the context cancel function for each test.
2025-02-20 09:39:11 -05:00
Cecylia Bocovich
2250bc86f6
Process and read broker SQS messages more quickly
We're losing a lot of messages from the broker SQS queue because they
are exceeding their maximum lifetime before being read and processed by
the broker. This change speeds up that process by increasing the size of
messagesChn and processing the messages within a go routine.
2025-02-20 09:37:18 -05:00
WofWca
cb0fb02cd5
fix(proxy): not answering before client timeout
This is related to
https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/40230.

The initial MR that closed that issue,
https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/merge_requests/391,
was not semantically correct, because `DataChannelTimeout`
starts after the client has already received the answer.

After
https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/merge_requests/498#note_3156256
the code became not only semantically incorrect,
but also functionally incorrect because now if this timeout is hit
by the proxy, the client is guaranteed to be gone already.
This commit fixes it, by lowering the timeout.

This addresses a suggestion in
https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/40447.

This also closes
https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/40381
and supersedes
https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/merge_requests/415.
2025-02-12 10:17:08 -05:00
Cecylia Bocovich
4a1e075ee0
Lower broker ClientTimeout to 5 seconds
Matches the observed timeout for CDN77, based on user reports.
https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/40446
2025-02-04 15:41:35 -05:00
Neel Chauhan
8792771cdc
broker and proxy must not reject client offers with no ICE candidates
Fixes #40371. Partially reverts !141.
2024-10-17 15:46:02 -04:00
WofWca
71828580bb fix(broker): empty pattern if bridge-list is empty
i.e. if no bridge list file is provided, the relay pattern
would not get set.

AFAIK this is not a breaking change because the broker
can't be used as a library, unlike client and server.
2024-09-21 15:11:37 +00:00
WofWca
ec9476e5ab
Better error msg on bad fingerprint 2024-09-04 10:47:08 -04:00
meskio
0804d8651f
Merge remote-tracking branch 'gitlab/mr/362' 2024-08-22 13:35:53 +02:00
WofWca
677146c9d5 add test_bridgeList.txt file
As an example for the `bridge-list-path` parameter
2024-08-21 20:50:59 +04:00
WofWca
103278d6fa
docs(broker): clarify allowed-relay-pattern
Specify that the broker will reject proxies
whose AcceptedRelayPattern is more restrictive than this,
and not less restrictive.

The parameter was introduced here
https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/merge_requests/87
> The proxy sends its allowed URL pattern to the broker.
> The broker rejects proxies that are too restrictive.
2024-08-20 12:43:31 +01:00
meskio
f64f234eeb
New ptuitl/safeprom doesn't have Rounded in the type names
This version fixes the test issue of double registering metrics.

* Closes: #40367
2024-07-11 17:45:57 +02:00
meskio
a9df5dd71a
Use ptutil for safelog and prometheus rounded metrics
* Related: #40354
2024-05-09 16:24:33 +02:00
Michael Pu
b512e242e8 Implement better client IP per rendezvous method tracking for clients
Implement better client IP per rendezvous method tracking for clients

Add tests for added code, fix existing tests

chore(deps): update module github.com/miekg/dns to v1.1.58

Implement better client IP tracking for http and ampcache

Add tests for added code, fix existing tests

Implement GetCandidateAddrs from SDP

Add getting client IP for SQS

Bug fixes

Bug fix for tests
2024-03-09 13:36:25 -05:00
am3o
acce1f1fd9
refactor: change deprecated "io/ioutil" package to recommended "io" package 2024-02-17 12:47:22 +01:00
Michael Pu
5f5cbe6431
Prune metrics that are reported for rendezvous
Signed-off-by: Cecylia Bocovich <cohosh@torproject.org>
2024-01-31 14:34:32 -05:00
Anthony Chang
dbecefa7d2
Move RendezvousMethod field to messages.Arg 2024-01-31 14:34:29 -05:00
Michael Pu
26ceb6e20d
Add metrics for tracking rendezvous method
Update tests for metrics

Add rendezvous_method to Prometheus metrics

Update broker spec docs with rendezvous method metrics

Bug fix
2024-01-31 14:34:29 -05:00