Nginx Mirror log parsing and statistics collation
Go to file
Emelie Graven 3ca1cec8bb
Formalize planning document, update DB logic
2022-04-05 13:11:32 +02:00
api Initial commit 2022-02-19 20:30:21 +01:00
ingest Formalize planning document, update DB logic 2022-04-05 13:11:32 +02:00
.editorconfig Send log entries to postgres db 2022-02-19 21:57:51 +01:00
.env.sample Send log entries to postgres db 2022-02-19 21:57:51 +01:00
.gitignore Send log entries to postgres db 2022-02-19 21:57:51 +01:00
Cargo.lock Send log entries to postgres db 2022-02-19 21:57:51 +01:00
Cargo.toml Initial commit 2022-02-19 20:30:21 +01:00
LICENSE.md Initial commit 2022-02-19 20:30:21 +01:00
README.md Formalize planning document, update DB logic 2022-04-05 13:11:32 +02:00
rustfmt.toml Send log entries to postgres db 2022-02-19 21:57:51 +01:00

README.md

Grimhilde

Usecase

At dotsrc, mirror statistics are currently generated by the mirror_stats go program. While this approach works decently, it could do with improved granularity, and working with the very large json blobs it generates can be taxing.

Grimhilde is a proposed replacement, currently in development.

Planning & Structure

By using the log_format directive in our nginx configuration, the access log output can be customized to include only the data we need in machine-readable form rather than the default hunman-readable form. Different logging facilities can also be configured, including logging over the syslog protocol. Grimhilde emulates a syslog server by opening a local UDP socket, enabling fast and reliable communication between nginx and Grimhilde.

Each request processed by nginx is sent to Grimhilde, where its details are stored in a Postgresql database. As the DotSrc mirror processes millions of requests each day, the amount of data we need to store would quickly become untenable if stored traditionally. The proposed way around this is to assign each unique piece of data in every request (request path, referrer, hostname, etc) an ID, and only store it once, replacing any subsequent identical data with references to the initial copy.

While this will massively decrease the amount of data stored, the database will still grow large with time. The proposed solution to this is to limit the length of time the data is stored, and to use the API to generate static views of interest prior to purging data older than f.x. three months. This will allow staff members to generate advanced reports on realtime data to see and respond to problematic trends, while also retaining interesting historical usage data to be published publicly for years to come.

Todo

  • Write tests
  • Implement the GraphQL API for fetching statistics
  • Find an appropriate solution for generating reports based on data from the API