Chapter 7: Metrics, Measurement, and Success Criteria
Synopsis
Metrics, measurement, and success criteria form the backbone of effective product management. In the absence of clearly defined metrics, product teams risk navigating without direction, relying on subjective judgments or intuition rather than evidence-based decisions.
For Technical Product Managers (TPMs), whose responsibilities bridge business strategy and technical execution, metrics are not only tools for tracking progress but also mechanisms for building trust with stakeholders, aligning cross-functional teams, and ensuring that products deliver value. This chapter explores why metrics matter, how they should be chosen, and what they reveal about the health of infrastructure and AI products.
Metrics are powerful because they transform abstract goals into tangible outcomes. An organizational vision such as “delivering the most reliable cloud platform” or “building responsible AI systems” remains aspirational until it is translated into measurable indicators like uptime percentages, latency improvements, fairness scores, or compliance adherence. These metrics create a shared language across stakeholders. Executives can see progress toward strategic goals, engineers can understand where their work creates impact, and customers can experience reliability and trust. Without metrics, these groups risk talking past one another, each interpreting success differently. Metrics provide the common ground on which alignment is built.
For TPMs working in infrastructure, measurement focuses heavily on reliability, scalability, and performance. Metrics such as mean time to recovery (MTTR), error rates, or throughput are essential to understanding whether systems meet the expectations of customers and businesses alike. Infrastructure work often goes unnoticed when it functions well, but failures can cause immediate and widespread disruption. Measurement ensures that invisible work receives recognition and prioritization by trying to outcomes that stakeholders value, such as customer retention, reduced downtime costs, or compliance with service-level agreements. In this way, metrics elevate technical priorities into strategic conversations.
In AI, measurement is even more complex. Traditional metrics like accuracy or precision are no longer sufficient to define success. AI systems must also be evaluated on fairness, interpretability, robustness, and ethical alignment. A model that achieves high accuracy but systematically disadvantages one demographic group cannot be considered successful. TPMs must therefore broaden the scope of success criteria, ensuring that AI products are judged not only on technical performance but also on social responsibility and long-term trustworthiness. Metrics such as equal opportunity rates, calibration scores, or user trust indicators become central to the measurement framework.
Infrastructure KPIs: uptime, latency, throughput, error rates
1. Uptime and Latency as Indicators of Reliability
Uptime is the most recognized KPI in infrastructure because it directly reflects system availability. Expressed as a percentage, uptime communicates how consistently a service remains operational over a given period. Organizations often strive for “five nines” availability, meaning 99.999% uptime, which translates to just a few minutes of downtime per year. For customers, uptime is the difference between trust and frustration; even small outages can result in financial loss, reputational damage, and decreased user loyalty. Technical Product Managers (TPMs) must treat uptime not as a vanity metric but as a critical measure of reliability that influences customer retention and business credibility.
2. Throughput as a Measure of Scalability
Throughput measures how much work a system can manage within a given timeframe, often expressed in transactions per second, requests per minute, or data processed per hour. It is a vital KPI for understanding scalability, especially in infrastructure products supporting global user bases or high-volume AI workloads. High throughput indicates that systems can support growing demand, while low throughput signals bottlenecks that could constrain business growth.
3. Error Rates as Indicators of Quality and Stability
Error rates measure the frequency of failed operations within a system, such as unsuccessful API calls, transaction failures, or incorrect outputs. While uptime, latency, and throughput capture availability and scalability, error rates provide insights into the quality and stability of infrastructure. High error rates erode customer trust, increase operational costs, and signal deeper issues in design or execution. For AI systems, error rates can manifest misclassifications, failed predictions, or corrupted outputs, which can be particularly damaging in sensitive applications like healthcare or finance.
Table 7.1: Error Rates as Indicators of Quality and Stability
Error Type
Indicator of
Organizational Impact
System Downtime Errors
Stability of infrastructure and resilience of systems
Loss of availability, SLA breaches, customer dissatisfaction
Application Bugs
Quality of software development and testing practices
Reduced productivity, increased rework, reputational harm
User Errors
Effectiveness of training, UX design, and process clarity
Delays, inefficiencies, increased support costs
Security Breaches
Strength of security measures and monitoring
Regulatory penalties, loss of data, severe reputational damage
Monitoring error rates allows TPMs and engineering teams to identify systemic issues early and prioritize fixes. Error rates can also reveal hidden inefficiencies: for example, a spike in failed API requests may indicate poor load handling or inadequate error-handling mechanisms. By segmenting error rates across user groups, geographies, or services, teams can pinpoint where failures are most harmful. Importantly, TPMs must frame error rates not only as technical indicators but also as customer experience metrics since errors often translate directly into frustration or lost revenue.
Reducing error rates requires investment in automated testing, observability tools, and robust incident response processes. It also demands cultural practices that prioritize quality, such as blameless postmortems and continuous improvement. TPMs must advocate for these practices and ensure that error rate reduction is built into roadmaps alongside feature delivery. By tracking error rates consistently and transparently, organizations demonstrate a commitment to stability, reliability, and customer trust.
