Nginx Mirror log parsing and statistics collation
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Emelie Graven 3ca1cec8bb
Formalize planning document, update DB logic
2 months ago
api Initial commit 3 months ago
ingest Formalize planning document, update DB logic 2 months ago
.editorconfig Send log entries to postgres db 3 months ago
.env.sample Send log entries to postgres db 3 months ago
.gitignore Send log entries to postgres db 3 months ago
Cargo.lock Send log entries to postgres db 3 months ago
Cargo.toml Initial commit 3 months ago
LICENSE.md Initial commit 3 months ago
README.md Formalize planning document, update DB logic 2 months ago
rustfmt.toml Send log entries to postgres db 3 months ago

README.md

Grimhilde

Usecase

At dotsrc, mirror statistics are currently generated by the mirror_stats go program. While this approach works decently, it could do with improved granularity, and working with the very large json blobs it generates can be taxing.

Grimhilde is a proposed replacement, currently in development.

Planning & Structure

By using the log_format directive in our nginx configuration, the access log output can be customized to include only the data we need in machine-readable form rather than the default hunman-readable form. Different logging facilities can also be configured, including logging over the syslog protocol. Grimhilde emulates a syslog server by opening a local UDP socket, enabling fast and reliable communication between nginx and Grimhilde.

Each request processed by nginx is sent to Grimhilde, where its details are stored in a Postgresql database. As the DotSrc mirror processes millions of requests each day, the amount of data we need to store would quickly become untenable if stored traditionally. The proposed way around this is to assign each unique piece of data in every request (request path, referrer, hostname, etc) an ID, and only store it once, replacing any subsequent identical data with references to the initial copy.

While this will massively decrease the amount of data stored, the database will still grow large with time. The proposed solution to this is to limit the length of time the data is stored, and to use the API to generate static views of interest prior to purging data older than f.x. three months. This will allow staff members to generate advanced reports on realtime data to see and respond to problematic trends, while also retaining interesting historical usage data to be published publicly for years to come.

Todo

  • Write tests
  • Implement the GraphQL API for fetching statistics
  • Find an appropriate solution for generating reports based on data from the API