Annotate ProxyPollTotal prometheus metrics with the proxy type so that
we can track counts of proxies that are matched and that answer by
implementation. This will help us catch bugs by implementation or
deployment.
This adds a prometheus metric that tracks snowflake proxy answers. If
the client has not timed out before the proxy responds with an answer,
the proxy type is recorded along with a status of "success". If the
client has timed out, the type is left blank and the status is recorded
as "timeout".
The goal of these metrics is to help us determine how many proxies fail
to respond and to help narrow down which proxy implementations are
causing client timeouts.
We were not previously clearing the map we keep of seen IP addresses,
which resulted in our unique proxy IP counts representing churn rather
than unique IP counts per day, except during broker process restarts.
This avoids an information leak where, if two countries have the same
count but are not in alphabetical order, you know the first one had a
larger count than the second one before binning.
Example: AA=25,BB=27
Without binning, these should sort descending by count:
BB=27,AA=25
But with binning, the counts are the same, so it should sort ascending
by country code:
AA=32,BB=32
Before this change, BB would sort before AA even after binning, which
lets you infer that the count of BB was greater than the count of AA
*before* binning:
BB=32,AA=32
Move the record types closer to where they are used.
Use a strings.Builder rather than repeatedly concatenating strings
(which creates garbage).
Use the value that m.Range already provides us, don't look it up again
with LoadAndDelete.
Add documentation comments.
This changes the metrics update functions to UpdateProxyStats and
UpdateClientStats, which is more accurate and clear than the previous
CountryStats and RendezvousStats names.
This is a large change to how the snowflake broker metrics are
implemented. This change removes all uses of mutexes from the metrics
implementation in favor of atomic operations on counters stored in
sync.Map.
There is a small change to the actual metrics output. We used to count
the same proxy ip multiple times in our snowflake-ips-total and
snowflake-ips country stats if the same proxy ip address polled more
than once with different proxy types. This was an overcounting of the
number of unique proxy IP addresses that is now fixed.
If a unique proxy ip polls with more than one proxy type or nat type,
these polls will still be counted once for each proxy type or nat type
in our proxy type and nat type specific stats (e.g.,
snowflake-ips-nat-restricted and snowflake-ips-nat-unrestricted).
Our metrics were undercounting client polls by missing the case where
clients are matched with a snowflake but receive a timeout before the
snowflake responds with its answer. This change adds a new metric,
called client-snowflake-timeout-count, to the 24 hour broker stats and a
new "timeout" status label for prometheus metrics.
Implement better client IP per rendezvous method tracking for clients
Add tests for added code, fix existing tests
chore(deps): update module github.com/miekg/dns to v1.1.58
Implement better client IP tracking for http and ampcache
Add tests for added code, fix existing tests
Implement GetCandidateAddrs from SDP
Add getting client IP for SQS
Bug fixes
Bug fix for tests
Follow up to 160ae2d
Analysis by @dcf,
> I don't think the sync.Once around logMetrics is necessary anymore.
Its original purpose was to inhibit logging on later file handles of
metrics.log, if there were more than one opened. See 171c55a9 and #29734
(comment 2593039) "Making a singleton *Metrics variable causes problems
with how Convey does tests. It shouldn't be called more than once, but
for now I'm using sync.Once on the logging at least so it's explicit."
Commit ba4fe1a7 changed it so that metrics.log is opened in main, used
to create a *log.Logger, and that same instance of *log.Logger is passed
to both NewMetrics and NewBrokerContext. It's safe to share the same
*log.Logger across multiple BrokerContext.
Doesn't seem like it needs to exist outside of the metrics struct.
Also, the call to logMetrics is moved to the constructor. A metrics
instance is only created when a BrokerContext is created, which only
happens at startup. The sync of only doing that once is left for
documentation purposes, since it doesn't hurt, but also seems redundant.
The default prometheus registry exports data that may be useful for
side-channel attacks. This removes all of the default metrics and makes
sure we are only reporting snowflake metrics from the broker.
This change adds a prometheus exporter for our existing snowflake broker
metrics. Current values for the metrics can be fetched by sending a GET
request to /prometheus.
We currently don't sort the snowflake-ips metrics:
snowflake-ips CA=1,DE=1,AR=1,NL=1,FR=1,GB=2,US=4,CH=1
To facilitate eyeballing our metrics, this patch sorts snowflake-ips by
value. If the value is identical, we sort by string, i.e.:
snowflake-ips US=4,GB=2,AR=1,CA=1,CH=1,DE=1,FR=1,NL=1
This patch fixes tpo/anti-censorship/pluggable-transports/snowflake#40011
As we now partition proxies by NAT type, our stats are more useful if they
capture how many proxies of each type we have, and information on
whether we have enough proxies of the right NAT type for our clients.
This change adds proxy counts by NAT type and binned counts of denied clients by NAT type.
Added another lock to the metrics struct to synchronize accesses to the
broker stats. There's a possible race condition if stats are updated at
the same time they are being logged.
Many of our log messages were being used to generate metrics, but are
now being aggregated and logged to a separate metrics log file and so we
don't need them in the regular logs anymore.
This addresses the goal of ticket #30830, to remove unecessary messages
and keep broker logs for debugging purposes.