api | ||
ingest | ||
.editorconfig | ||
.env.sample | ||
.gitignore | ||
Cargo.lock | ||
Cargo.toml | ||
LICENSE.md | ||
README.md | ||
rustfmt.toml |
Grimhilde
Usecase
At dotsrc, mirror statistics are currently generated by the mirror_stats go program. While this approach works decently, it could do with improved granularity, and working with the very large json blobs it generates can be taxing.
Grimhilde is a proposed replacement, currently in development.
Planning & Structure
By using the log_format
directive in our nginx configuration, the access log output can be
customized to include only the data we need in machine-readable form rather than the default
hunman-readable form. Different logging facilities can also be configured, including logging over
the syslog protocol. Grimhilde emulates a syslog server by opening a local UDP socket, enabling fast
and reliable communication between nginx and Grimhilde.
Each request processed by nginx is sent to Grimhilde, where its details are stored in a Postgresql database. As the DotSrc mirror processes millions of requests each day, the amount of data we need to store would quickly become untenable if stored traditionally. The proposed way around this is to assign each unique piece of data in every request (request path, referrer, hostname, etc) an ID, and only store it once, replacing any subsequent identical data with references to the initial copy.
While this will massively decrease the amount of data stored, the database will still grow large with time. The proposed solution to this is to limit the length of time the data is stored, and to use the API to generate static views of interest prior to purging data older than f.x. three months. This will allow staff members to generate advanced reports on realtime data to see and respond to problematic trends, while also retaining interesting historical usage data to be published publicly for years to come.
Todo
- Write tests
- Implement the GraphQL API for fetching statistics
- Find an appropriate solution for generating reports based on data from the API