How to Monitor Your Test Suite Health Over Time

test-monitoringhealth-scoreci-cdtesting

Your CI pipeline runs tests on every commit. Each run produces a result: pass or fail. But if you only look at the result of the latest run, you're missing the bigger picture.

A single test run is a snapshot. To understand whether your test suite is getting better or worse, you need to look at trends over time. That's what test suite health monitoring gives you.

Why Per-Run Reporting Isn't Enough

Most teams have a CI pipeline that runs tests and reports a binary result: green or red. This tells you whether the current commit broke something, but it can't answer the questions that actually matter for long-term test reliability:

  • Is our flaky test rate increasing? A test that fails once a week might not trigger alarms, but ten tests each failing once a week means your suite is unreliable.
  • Are our tests getting slower? If test duration creeps up 5% per month, you won't notice day-to-day — but in six months your pipeline takes twice as long.
  • Which tests fail most often? Without historical data, you can't distinguish between a test that fails due to a real bug and a flaky test that fails randomly.
  • Are we fixing problems or ignoring them? Trends reveal whether your team is improving test quality or letting it degrade.

What "Test Health" Actually Means

Test suite health is a composite assessment of your test infrastructure's reliability. It typically covers four dimensions:

The percentage of tests that pass across recent runs. A single failure isn't concerning, but a declining pass rate indicates systemic problems.

Warning signs:

  • Pass rate dropping below 95% on the main branch
  • Pass rate volatility (swinging between 85% and 100%)
  • Different pass rates on the same commit across re-runs

How long your tests take to execute, tracked over time. Slow tests directly impact developer productivity and CI costs.

Warning signs:

  • Total suite duration increasing month-over-month
  • Individual tests that have gotten 10x slower since they were written
  • High variance in execution time (suggests resource contention or flakiness)

3. Flaky Test Rate

The percentage of tests that produce inconsistent results — passing on one run and failing on the next for the same code. This is often the most important metric because flaky tests destroy trust in the entire suite.

Warning signs:

  • Flaky rate above 5% of total tests
  • Increasing number of tests marked as "known flaky" or quarantined
  • Developers routinely re-running CI to get a green build

4. Failure Clustering

Patterns in which tests fail together. If the same group of tests always fails on Monday mornings, you might have a test environment issue. If certain tests only fail in CI but never locally, you have an environment parity problem.

Manual Approaches to Test Health Monitoring

Before reaching for tools, understand what manual monitoring looks like — and why teams usually outgrow it:

Spreadsheet Tracking

Export test results from CI, paste into a spreadsheet, build charts. This works for very small projects but doesn't scale:

  • Pros: Free, no setup, full control over metrics
  • Cons: Manual effort every time, data goes stale, no alerting, no per-test granularity

CI Dashboard Review

Most CI platforms show a history of pipeline runs. You can scroll through recent builds and look for patterns:

  • Pros: Already available, no extra tools
  • Cons: No aggregate metrics, can't see per-test trends, limited to what the CI shows, hard to compare across branches

Custom Scripts

Write scripts that parse JUnit XML or JSON test reports and compute metrics:

interface HealthMetrics {
  passRate: number;
  flakyRate: number;
  avgDuration: number;
  totalTests: number;
}
 
function computeHealth(runs: TestRun[]): HealthMetrics {
  const totalTests = runs.reduce((sum, r) => sum + r.tests, 0);
  const totalPassed = runs.reduce((sum, r) => sum + r.passed, 0);
  const passRate = totalTests > 0 ? totalPassed / totalTests : 0;
 
  // Flaky: tests that changed status between consecutive runs
  const flakyTests = detectStatusFlips(runs);
  const flakyRate = flakyTests.length / totalTests;
 
  const avgDuration =
    runs.reduce((sum, r) => sum + r.duration, 0) / runs.length;
 
  return { passRate, flakyRate, avgDuration, totalTests };
}
  • Pros: Customizable, exact metrics you want
  • Cons: Maintenance burden, storage infrastructure, you build everything from scratch

Key Metrics to Track

If you're setting up test health monitoring, start with these five metrics:

1. Rolling Pass Rate (7-day window)

Track the pass rate across the last 7 days of main-branch runs. This smooths out individual failures and shows the true reliability trend.

Healthy: > 95% Concerning: 90-95% Action needed: < 90%

2. Flaky Test Count

The number of distinct tests that have flipped between pass and fail on the same code in the last 30 days. One or two flaky tests are normal. Ten or more means you have a systemic problem.

3. P95 Suite Duration

The 95th percentile execution time for your full test suite. Using P95 instead of average avoids being skewed by outlier runs while still capturing real slowness.

4. New Failures per Week

How many previously-passing tests started failing this week. A sudden spike means a breaking change slipped through. A steady trickle means tests are gradually decaying.

5. Mean Time to Fix (MTTF)

How long a failing test stays broken before someone fixes it. Tests that stay red for days or weeks are effectively dead — they provide no value and should be fixed or removed.

Automated Monitoring vs. Manual Checks

AspectManualAutomated
FrequencyWeekly at bestEvery CI run
CoverageSpot checksEvery test, every run
AlertingNoneThreshold-based
Historical dataLimitedComplete
Per-test trendsImpracticalBuilt-in
Team overheadHighNear-zero after setup

The pattern is clear: manual monitoring works for small projects with few tests. Once you have more than a few hundred tests running multiple times per day, automated monitoring becomes essential.

Setting Up Automated Test Health Monitoring

Automated monitoring collects test results from every CI run, stores them, and computes health metrics over time. Here's what a typical setup looks like:

Step 1: Produce Machine-Readable Test Reports

Configure your test framework to output structured results. JUnit XML is the most universal format:

For Jest:

npx jest --reporters=default --reporters=jest-junit

For pytest:

pytest --junitxml=test-results/results.xml

For Vitest:

// vitest.config.ts
export default defineConfig({
  test: {
    reporters: ["default", "junit"],
    outputFile: { junit: "test-results/results.xml" },
  },
});

Step 2: Send Results to a Monitoring Service

After tests run, send the report to a service that stores and analyzes it. With TestGlance, this is a single step in your CI workflow:

# .github/workflows/ci.yml
- name: Run tests
  run: npm test
 
- name: Report test results
  uses: testglance/action@v1
  if: always()
  with:
    api-key: ${{ secrets.TESTGLANCE_API_KEY }}

The if: always() is important — you want to capture results from failing runs too, not just passing ones.

Step 3: Review Health Dashboards

Once results are flowing in, review the health dashboard regularly:

  • Daily: Glance at the health score to catch sudden drops
  • Weekly: Review flaky test trends and duration changes
  • Monthly: Assess overall test quality trajectory and plan improvements

Step 4: Set Up Alerts

Configure alerts for critical thresholds:

  • Health score drops below 80
  • A new flaky test is detected
  • Suite duration exceeds a target (e.g., 10 minutes)
  • Pass rate drops below 90% on main branch

Common Anti-Patterns

Ignoring Flaky Tests

The worst thing you can do with flaky tests is nothing. Teams often accept flakiness as "just how CI works," but every flaky test makes the entire suite less trustworthy. Address them systematically: quarantine, investigate, fix, or remove.

Optimizing for Speed Without Measuring

"Our tests are too slow" is a common complaint, but without duration tracking you don't know which tests are slow, whether they're getting slower, or whether your optimizations actually helped. Measure before and after.

Testing Only on the Main Branch

If you only monitor health on the main branch, you catch problems after they're merged. Running health checks on PR branches too gives you earlier signal — you can see if a PR is adding flaky tests before it lands.

Vanity Pass Rates

A 100% pass rate doesn't mean your tests are good. It might mean you've deleted or skipped all the hard tests. Track the total test count alongside pass rate to catch this.

FAQ

What's the difference between a test health score and a pass rate?

A pass rate is a single metric: the percentage of tests that passed in a given run. A health score is a composite metric that factors in pass rate trends over time, flaky test rate, duration trends, and other signals. A suite could have a 98% pass rate on the latest run but a declining health score because that rate has been dropping from 100% over the past month.

Should I quarantine flaky tests or fix them immediately?

Both, ideally. Quarantine provides immediate relief by stopping flaky tests from blocking deployments, but it's not a long-term solution. Treat quarantined tests as technical debt with a deadline — investigate the root cause and either fix the test or delete it if it's not providing value. A quarantine without a fix deadline is just a graveyard.

How do I convince my team to invest in test health monitoring?

Start by quantifying the cost of the current state. Track how many CI re-runs happen per week due to flaky tests, how long developers wait for slow test suites, and how many real bugs slip through because people learned to ignore test failures. These concrete numbers make the case far better than abstract quality arguments.

Next Steps

Test suite health monitoring is a practice, not a one-time setup. Start with the basics — pass rate and flaky detection — and expand as your suite grows.

Set up automated monitoring with TestGlance to track your test suite health score across every CI run. Catch declining reliability before it impacts your team.