New project: Platform Problem Monitoring
Introducing Platform Problem Monitoring
I’d like to share a new open-source project I’ve been working on: Platform Problem Monitoring. In essence, it’s a powerful tool that helps DevOps and platform engineering teams understand what’s going wrong in their systems without active effort or manual analysis.
The Problem
If you’re running a complex platform with many components, you’re likely already collecting logs in an ELK (Elasticsearch, Logstash, Kibana) stack. But there’s a fundamental challenge with this approach: the sheer volume of data makes it difficult to separate signal from noise.
The typical monitoring workflow looks something like this:
- Something breaks or behaves oddly
- Someone notices (hopefully)
- Engineers actively search logs for clues
- Time passes while piecing together what happened
This reactive approach puts the burden on engineers to constantly check dashboards or wait for threshold-based alerts that often come too late.
A Different Approach
What if you could receive a concise, automatically generated email report that summarizes emerging problems, shows how existing issues are trending, and highlights which problems have disappeared since the last check?
That’s exactly what Platform Problem Monitoring does:
- It automatically analyzes your existing Elasticsearch logs at regular intervals
- It normalizes log messages by replacing dynamic parts (UUIDs, timestamps, etc.) with placeholders to reveal patterns
- It identifies trends by comparing current problems with previous runs
- It generates well-designed reports that deliver a clear picture of your platform’s health directly to your inbox
Here’s what the report looks like:
How It Works
Under the hood, Platform Problem Monitoring follows a clean, modular architecture with 12 distinct processing steps that handle everything from querying Elasticsearch to sending the final report:
- It downloads problem-related log messages from your ELK stack
- Using the excellent drain3 library, it intelligently normalizes messages to identify patterns (e.g., “error for user j.doe@example.com: wrong password” becomes “error for user <*>: wrong password”)
- It compares these patterns with previous runs to track changes
- It generates a well-designed, easily scannable HTML email report with links to Kibana for deeper inspection
The application is designed as a command-line tool that can be triggered manually or scheduled with cron, making it very flexible. It’s self-contained and requires minimal setup.
Is This Tool Right For Me?
If your team:
- Already has an ELK stack collecting logs
- Wants to reduce the cognitive load of monitoring
- Prefers receiving regular summaries rather than constant alerts
- Needs to understand patterns and trends in platform issues
…then Platform Problem Monitoring might be exactly what you need.
Getting Started
The setup is straightforward:
git clone https://github.com/dx-tooling/platform-problem-monitoring-core.git
cd platform-problem-monitoring-core
python3 -m venv venv
source venv/bin/activate
pip3 install -e .
Then create a configuration file pointing to your Elasticsearch server, S3 bucket (for state storage), and SMTP server, and you’re ready to go.
See the project README for more details.
The Technical Side
For those interested in the implementation details, the project is built with Python 3 and follows a clean, modular architecture where each step in the process can be executed independently. This approach ensures the code remains maintainable and testable.
The code adheres to strict quality standards enforced by tools like mypy, Black, and Ruff. The entire process is designed to be efficient even with large volumes of data, using streaming and pagination techniques when interacting with Elasticsearch.
Open Source
Platform Problem Monitoring is available under the MIT License, and I welcome contributions from the community. Whether you’re interested in enhancing the email template, improving the normalization logic, or adding support for additional data sources, there’s plenty of room for collaboration.
You can find the project on GitHub at dx-tooling/platform-problem-monitoring-core.
Conclusion
In my experience, the best tools are those that quietly do their job without requiring constant attention. Platform Problem Monitoring aims to be exactly that — a reliable companion that helps you understand your platform’s health without the need to actively hunt for problems.
By delivering concise, actionable reports directly to your inbox, it transforms the way teams monitor their platforms, shifting from a reactive to a proactive approach.
Give it a try, and let me know what you think!