Formalize planning document, update DB logic

2022-04-05 13:11:32 +02:00 · 2022-04-05 13:11:32 +02:00 · 3ca1cec8bb
commit 3ca1cec8bb
parent bb740067f1
2 changed files with 91 additions and 18 deletions
--- a/README.md
+++ b/README.md
@ -0,0 +1,31 @@
+# Grimhilde
+
+## Usecase
+At dotsrc, mirror statistics are currently generated by the [mirror_stats](https://gitlab.com/dotSRC/mirror_stats) go program. While this approach works decently, it could do with improved granularity, and working with the very large json blobs it generates can be taxing.
+
+Grimhilde is a proposed replacement, currently in development.
+
+## Planning & Structure
+By using the `log_format` directive in our nginx configuration, the access log output can be
+customized to include only the data we need in machine-readable form rather than the default
+hunman-readable form. Different logging facilities can also be configured, including logging over
+the syslog protocol. Grimhilde emulates a syslog server by opening a local UDP socket, enabling fast
+and reliable communication between nginx and Grimhilde.
+
+Each request processed by nginx is sent to Grimhilde, where its details are stored in a Postgresql
+database. As the DotSrc mirror processes millions of requests each day, the amount of data we need
+to store would quickly become untenable if stored traditionally. The proposed way around this is to
+assign each unique piece of data in every request (request path, referrer, hostname, etc) an ID, and
+only store it once, replacing any subsequent identical data with references to the initial copy.
+
+While this will massively decrease the amount of data stored, the database will still grow large
+with time. The proposed solution to this is to limit the length of time the data is stored, and to
+use the API to generate static views of interest prior to purging data older than f.x. three months.
+This will allow staff members to generate advanced reports on realtime data to see and respond to
+problematic trends, while also retaining interesting historical usage data to be published publicly
+for years to come.
+
+## Todo
+* [ ] Write tests
+* [ ] Implement the GraphQL API for fetching statistics
+* [ ] Find an appropriate solution for generating reports based on data from the API