DevOps

Monitoring

13 сентября 2024

12 minutes

How to Systematically and Effectively Build Monitoring

Exploring how to systematically and effectively build monitoring: from goals and tasks to metric collection methodologies.

Evgeny Gurin

Full-Stack developer and DevOps with 6y experience

Monitoring is an integral part of the DevOps field that enables companies to track and evaluate the state of software products.

Building systematic monitoring is not just about implementing Prometheus or Grafana. It is a holistic process that covers defining goals, choosing metric collection methodologies, setting tasks, developing alerting processes, and integrating monitoring into team culture.

Here I’ll try to outline the main points and ideas for setting up monitoring. This article doesn’t cover specific tools—that will be a separate topic.

Every minute of application downtime, a network failure, or inefficient infrastructure operation leads to losses—financial, reputational, and organizational. If you want colleagues to respect you—set it up wisely :)

What is Monitoring

Monitoring is a continuous process of collecting and analyzing data about the performance of systems, applications, infrastructure, and business processes. Its goal is to detect failures and threats in time, understand change dynamics, and prevent critical problems.

It is a key part of DevOps and SRE, helping ensure availability, reliability, and predictability of services.

Monitoring Steps

Monitoring and processing its results usually include several stages:

Data Collection — recording metrics, logs, traces, and business indicators.
Analysis — aggregation, trend and anomaly detection, comparison with baselines.
State Evaluation — determining whether the system meets objectives (normal, warning, incident).
Reliability Assurance — automatic reactions, alerts, and actions to maintain availability.
Validation — verifying recovery, analyzing incidents, and adjusting monitoring.

It is a closed cycle that not only detects problems but also increases overall system resilience.

Metric — a numerical indicator reflecting the state of a system (for example, CPU load, API response time, or number of successful transactions).

Goals of Monitoring

Ensuring High Availability

High Availability (HA) is the property of a system or service to remain available and functional as much as possible, even when failures or component outages occur.

To achieve high availability, there are a number of system design principles and methods aimed at ensuring continuous operation and availability of the system.

From a business perspective, the main parameters are:

minimizing the risk of system failures
availability to users during unexpected outages or hardware failures

Monitoring is the first step toward high availability, as it makes it possible to notice problems in principle and take the necessary measures.

High Availability components

Here’s a small diagram combining several areas that together ensure continuous system operation.

System Design

Architectural decisions — choosing an architecture that supports scalability and fault tolerance. This is usually the responsibility of developers, not DevOps engineers.
Redundancy mechanisms — duplicating critical components: servers, databases, networking equipment. Most fault tolerance mechanisms boil down to ensuring scalability and eliminating single points of failure.
Rollback capability — if a problem occurs, there must be a way to roll back to the previous version. From experience, not every system even has this option.

Risk Management

Failure analysis — identifying vulnerabilities and failure scenarios. This requires analytical work, identifying weak points and analyzing collected metrics. For example, if there are consistently hundreds of stuck DB transactions, something may be wrong.
Mitigation strategies — measures to reduce the impact of incidents (e.g., geographic distribution, data replication). If an incident occurs, you need a plan to restore service quickly.

Reliability Assurance Methods

Monitoring strategies — continuous system observation for rapid problem detection.
Maintenance protocols — regulations for updates, backups, and testing to prevent downtime. At this stage, it’s worth writing detailed instructions and checklists.

Failure Handling. Unfortunately, failures will occur eventually

Failover strategies — mechanisms like load balancers and automatic switchover. Or manual intervention, depending on setup.
Recovery protocols — instructions for restoring the system after a failure (Disaster Recovery Plan).

Key Metrics for Quality Evaluation

1. Time to Detect (TTD)

The time between when a failure occurs and when it is detected.

Ideally, the system reports the issue before users notice.

2. Time to Mitigate/Troubleshoot (TTM)

The time it takes for the team to identify the root cause and apply a temporary or permanent fix that reduces the impact.

How to minimize TTM:

clear instructions (playbooks, runbooks) for common incidents;
automated mitigation (self-healing, auto-scaling, automatic failover);
isolating faulty components (circuit breaker, feature flag, partial shutdowns);
streamlined escalation procedures (who is responsible for what).

Ideally, mitigation takes seconds or minutes thanks to automation, not hours.

3. Time to Recover/Repair (TTR)

The time required to return the system to normal working state after a failure.

This is a direct metric by which the business evaluates IT reliability—the longer recovery takes, the higher the downtime and losses.

How to minimize TTR:

backups and replication (databases, storage, configs);
Disaster Recovery Plans (DRP) and regular drills;
automatic rollback of failed releases;
Infrastructure as Code (Terraform, Ansible, Kubernetes) for quick redeployment;
use of hot/warm standbys (standby servers, clusters).

Ideally, users don’t even notice the switchover to backup systems (Recovery = seconds).

4. Relationship between TTD, TTM, and TTR

TTD measures how quickly we detect issues.
TTM measures how effectively we mitigate them.
TTR measures how quickly full recovery occurs.

Together, these metrics form MTTR (Mean Time to Recovery), which reflects overall system resilience.

Tracking Failures and Incidents

Monitoring allows you to see incidents in real time rather than after the fact. This reduces MTTR and increases system resilience.

The most important part of this task is configuring alerting properly, to avoid overwhelming teams with noise—more on that later.

Evaluating Experimental Results

When introducing new features or architectural changes, monitoring helps evaluate whether system behavior has improved. For example, whether API latency decreased after DB optimization or cache introduction.

The goal is to test hypotheses during product development. For example:

will a new DB configuration improve performance,
will a new API version increase stability,
will redesigning the “Buy” button improve conversion.

Monitoring here helps to:

Measure impact of changes on system or business metrics;
Detect negative effects (e.g., one module’s speedup caused load on another);
Compare results with baselines (before vs after);
Decide — keep changes, refine, or roll back.

Example Metrics for Experiment Evaluation

Technical
- service response times (latency p95/p99);
- error frequency (HTTP 5xx, SQL errors);
- resource usage (CPU, RAM, disk IO).
Business
- conversions (signup, purchase, button click);
- user retention;
- average order value or revenue per user.
Operational
- MTTR (recovery time after failures);
- number of incidents during experiment;
- team workload (manual interventions needed).

A/B Testing as Part of Monitoring

One key use of monitoring for experiments is A/B testing — a hypothesis testing method where users are randomly split:

Group A — control (old version of system/feature);
Group B — experimental (new version).

Examples:

new caching → test if API latency decreases;
new cart UI → measure purchase completion rate;
alternative load balancing algorithm → evaluate service stability.

How Monitoring Integrates with Experiments

Dashboards for comparison Real-time visualization of key metrics for groups A and B (e.g., Grafana, DataDog).
Alerts during experiments If errors spike or conversion drops in group B, the experiment stops automatically.
Post-experiment analysis Results are preserved: which version performed better, what side effects occurred.

Benefits of Monitoring in Experiments

helps safely introduce new features;
reduces risk of harmful changes;
provides transparency: proves changes are beneficial;
accelerates fact-based decision making.

Monitoring Tasks: Process Improvement

Monitoring Tasks

Reducing Downtime

With automated alerts and accurate metrics, teams can respond quickly to incidents, reducing both MTTR and overall downtime.

Early Problem Detection in Testing

Monitoring should be implemented not just in production but also in test environments. This helps detect release issues (e.g., memory leaks, misconfigurations, scaling problems) before production deployment.

It is also useful during load testing. A standard load scenario with metrics collection lets you track historical changes and assess performance trends.

Improving Security and Performance

Anomaly analysis helps detect not only bugs but also suspicious activity (e.g., sudden traffic spikes to a port).

Automation for Collaboration

Monitoring integrates with CI/CD and incident management systems. This allows for automatic ticket creation, involving the right experts, and linking data between tools.

Increasing Visibility and Transparency

Monitoring makes the system “transparent” for both engineers and managers. DevOps teams get detailed metrics, while business sees summary dashboards.

This is especially relevant for incident analysis. Checking monitoring during incidents is one of the main ways to find root causes.

What and How to Monitor?

In general, the choice of metrics is wide, and it makes sense to select them based on your needs.

Infrastructure Monitoring

Infrastructure

Covers hardware, VMs, and other equipment. The most important basics include:

Device online status
CPU, RAM, network load
Disk space
VM start/stop
RAID status
Virtual resource usage
Disk IO

Applications monitoring

Applications

Here we monitor deployed applications, to understand how they behave.

Attention should be given both to your product and infrastructure software like databases.

Web Applications:

API response time (avg/max)
Request handling time (avg/max)
Errors (HTTP 4xx, 5xx)
Critical feature logic
User activity (signup, login, etc.)
Requests count, RPS (Requests Per Second), aggregated per minute/hour

Databases:

Active connections and transactions
Row counts
Memory usage
Table locks
Read/write ratio

Caches:

Hit/miss ratio
Record TTL
Throughput (ops/sec)

Networks monitoring

Networks

Bandwidth
Latency and packet loss
Load balancer status
Regional availability Useful for distributed systems, requires regional availability checks

Business Monitoring

Business Metrics

Most business-related metrics. They link user activity to company revenue and conversion.

Orders, signups
Transaction processing time
User abandonment rate
Funnel completion rates:
- signup
- form fill
- payment
Service usage (frequency, types)

DevOps Teams

DevOps Teams

Metrics tracking team performance and process quality.

Environment deployment time
Lead time from task to release
Release frequency
Incident response time
Deployment error count
Rollback time

Events

The most comprehensive metric type. Even if you’ve implemented everything else, you may still not know why things happen.

For example, conversion from signup to payment = 1%. So only 1 out of 100 users pays.

If this requires 30 screens and dozens of clicks, then the payment flow is broken and must be redesigned!

To understand causal links, you need event metrics, generated by users and system components.

Events:

Screen view
Click
Scroll to end

Metric Collection Methodologies

Metric Methodologies

RED (Rate, Errors, Duration)

The RED methodology is focused on monitoring user-facing services and applications. Especially useful for microservices and web apps.

Main metrics:

Rate
how many requests per second the system receives;
Example: 500 RPS.
Errors
how many requests failed;
Example: 1% of requests return 500 Internal Server Error.
Duration
latency: how fast the system responds;
Example: p95 response time = 300 ms.

Quickly answers: is the service up, how often does it fail, and how fast is it.

USE (Utilization, Saturation, Errors)

The USE methodology applies to low-level resource monitoring — servers, hardware, system components.

Main metrics:

Utilization
percentage of time resource is busy.
Example: CPU 85%, disk 70%.
Saturation
resource overload, task queues.
Example: 200 pending disk writes.
Errors
hardware/system errors.
Example: network failures, disk read errors, packet drops.

USE helps diagnose bottlenecks and optimize infrastructure: finding overload points and weak spots.

LTES (Latency, Traffic, Errors, Saturation)

The LTES methodology combines RED + USE, more broad and universal.

Main metrics:

Latency
system response times.
Example: median = 150 ms, p99 = 1.2s.
Traffic
system load.
Example: 10,000 HTTP req/min, 1 Gbps network traffic.
Errors
failed request count/ratio.
Example: 0.5% return 502 Bad Gateway.
Saturation
system nearing limits.
Example: 90% DB connection pool used.

LTES is a universal model, combining application and infrastructure monitoring.

Choosing a Methodology

Choosing Methodology

In short:

For infrastructure → USE
For apps & APIs → RED
For mixed cases → LTES
In practice, methods are often combined and adapted.

Detailed Factors:

System type

Infrastructure (servers, disks, network, CPU, memory) → USE.
Apps, APIs, microservices → RED.
Hybrid (infra + apps + business) → LTES.

Monitoring goal

Find bottlenecks → USE.
Control service quality for users → RED.
Comprehensive DevOps/SRE monitoring → LTES.

Conclusion

Effective monitoring is the foundation of system visibility. It unites infrastructure, apps, and business metrics into one system that not only reacts to failures but predicts them.

A systematic approach to monitoring is built on three principles:

Transparency — everyone sees what’s happening.
Predictability — metrics reveal trends and prevent failures.
Integration — monitoring is part of CI/CD and incident management.

Alerting and failure notification are no less important, but those will be covered in a separate article.