Annotate ProxyPollTotal prometheus metrics with the proxy type so that
we can track counts of proxies that are matched and that answer by
implementation. This will help us catch bugs by implementation or
deployment.
This adds a prometheus metric that tracks snowflake proxy answers. If
the client has not timed out before the proxy responds with an answer,
the proxy type is recorded along with a status of "success". If the
client has timed out, the type is left blank and the status is recorded
as "timeout".
The goal of these metrics is to help us determine how many proxies fail
to respond and to help narrow down which proxy implementations are
causing client timeouts.
We were not previously clearing the map we keep of seen IP addresses,
which resulted in our unique proxy IP counts representing churn rather
than unique IP counts per day, except during broker process restarts.
This avoids an information leak where, if two countries have the same
count but are not in alphabetical order, you know the first one had a
larger count than the second one before binning.
Example: AA=25,BB=27
Without binning, these should sort descending by count:
BB=27,AA=25
But with binning, the counts are the same, so it should sort ascending
by country code:
AA=32,BB=32
Before this change, BB would sort before AA even after binning, which
lets you infer that the count of BB was greater than the count of AA
*before* binning:
BB=32,AA=32
Move the record types closer to where they are used.
Use a strings.Builder rather than repeatedly concatenating strings
(which creates garbage).
Use the value that m.Range already provides us, don't look it up again
with LoadAndDelete.
Add documentation comments.
Now that both SQS and AMP cache are pulling remote addresses from the
SDP, avoid duplicate decodings of the ClientPollRequest by extracting
the remote addr in ClientOffers.
The remote address for AMP cache rendezvous is always geolocated to the
AMP cache server address. For more accurate metrics on where this
rendezvous method is used and working, we can pull the remote address
directly from the client SDP sent in the poll request.
This changes the metrics update functions to UpdateProxyStats and
UpdateClientStats, which is more accurate and clear than the previous
CountryStats and RendezvousStats names.
This is a large change to how the snowflake broker metrics are
implemented. This change removes all uses of mutexes from the metrics
implementation in favor of atomic operations on counters stored in
sync.Map.
There is a small change to the actual metrics output. We used to count
the same proxy ip multiple times in our snowflake-ips-total and
snowflake-ips country stats if the same proxy ip address polled more
than once with different proxy types. This was an overcounting of the
number of unique proxy IP addresses that is now fixed.
If a unique proxy ip polls with more than one proxy type or nat type,
these polls will still be counted once for each proxy type or nat type
in our proxy type and nat type specific stats (e.g.,
snowflake-ips-nat-restricted and snowflake-ips-nat-unrestricted).
This was only useful to us when we first implemented the feature, to be
able to support proxies that hadn't yet updated, when we had a single
Snowflake bridge. Now that we have multiple bridges, it is unecessary as
proxies that don't send their accepted relay pattern are rejected
anyway.
Client timeouts are currently counted from when the client is matched
with a proxy. Instead, count client timeouts from the moment when the
request is received.
Closes#40449
Some of the changes do not appear to have a potential race condition,
so there it is purely a refactor,
while in others (e.g. in broker.go and in proxy/lib/snowflake.go)
we do use the same variable from multiple threads / functions.
Our metrics were undercounting client polls by missing the case where
clients are matched with a snowflake but receive a timeout before the
snowflake responds with its answer. This change adds a new metric,
called client-snowflake-timeout-count, to the 24 hour broker stats and a
new "timeout" status label for prometheus metrics.
To test that the broker responds with a proxy answer if available, have
only one valid client offer to ensure metrics will always be in the
first multiple of 8.
Our SQS tests were not concurrency safe and we hadn't noticed until now
because we were processing incoming SQS queue messages sequentially
rather than in parallel.
This fix removes the log output checks, which were prone to error
anyway, and relies instead on gomock's expected function calls and
strategic use of the context cancel function for each test.
We're losing a lot of messages from the broker SQS queue because they
are exceeding their maximum lifetime before being read and processed by
the broker. This change speeds up that process by increasing the size of
messagesChn and processing the messages within a go routine.
i.e. if no bridge list file is provided, the relay pattern
would not get set.
AFAIK this is not a breaking change because the broker
can't be used as a library, unlike client and server.
Implement better client IP per rendezvous method tracking for clients
Add tests for added code, fix existing tests
chore(deps): update module github.com/miekg/dns to v1.1.58
Implement better client IP tracking for http and ampcache
Add tests for added code, fix existing tests
Implement GetCandidateAddrs from SDP
Add getting client IP for SQS
Bug fixes
Bug fix for tests