What is a flaky test?

A flaky test is a test that produces inconsistent results — sometimes passing and sometimes failing — without any changes to the code being tested. Flaky tests are caused by non-deterministic behavior such as race conditions, shared mutable state, network dependencies, or timing issues. They erode developer trust in the test suite and waste CI resources through unnecessary re-runs.

How do I detect flaky tests automatically?

The most reliable way to detect flaky tests is statistical analysis across multiple CI runs. Track each test's pass/fail status over time and flag tests that change status without corresponding code changes. Tools like TestGlance automate this by analyzing your JUnit XML or CTRF test reports across runs and identifying tests with inconsistent results.

Should I retry or fix flaky tests?

Retrying is a short-term mitigation, not a solution. Automatic retries hide the problem and increase CI costs. The right approach is to quarantine flaky tests (move them out of the critical path), investigate the root cause, and fix the underlying non-determinism. If a flaky test can't be fixed economically, it's better to delete it and replace it with a deterministic alternative than to keep retrying it indefinitely.

What Are Flaky Tests? A Practical Guide to Detection and Fixes

A flaky test is a test that sometimes passes and sometimes fails without any changes to the code being tested. Run the same test ten times on the same commit, and it might pass nine times and fail once — or fail three times and pass seven.

Flaky tests are one of the most frustrating and costly problems in software testing. They erode trust in your test suite, waste CI resources, and slow down deployments. This guide covers why they happen, how to detect them, and what to do about them.

Why Flaky Tests Matter

The cost of flaky tests compounds over time. What starts as an occasional annoyance becomes a serious drag on team productivity.

Eroded Trust

When developers see tests fail randomly, they start ignoring failures — including real ones. "Oh, that test is just flaky" becomes the default response to any red build. This is the most dangerous consequence: your test suite exists to catch bugs, and flaky tests train people to stop paying attention.

Wasted CI Resources

Every flaky failure triggers a re-run. If your pipeline takes 10 minutes and you re-run it 5 times per day due to flaky tests, that's nearly an hour of wasted compute daily. Multiply across a team and it adds up fast.

Slower Deployments

Teams with flaky suites develop a pattern: merge, wait for CI, see a failure, re-run, wait again. Deployments that should take minutes stretch into hours. Some teams stop deploying on Fridays entirely because they don't want to deal with flaky failures before the weekend.

Hidden Bugs

Perhaps worst of all, flaky tests can mask real bugs. If a test fails due to an actual regression but gets dismissed as "probably flaky," the bug ships to production. The test did its job — the team just couldn't trust it.

Common Causes of Flaky Tests

Understanding why tests flake helps you fix and prevent them. Here are the most common causes, with concrete examples.

1. Timing and Race Conditions

Tests that depend on precise timing are inherently fragile. Anything that uses setTimeout, waitFor, or expects operations to complete within a specific window can flake when the CI runner is under load:

// Flaky: assumes the animation completes in exactly 100ms
test("button shows loading state", async () => {
  fireEvent.click(submitButton);
  await new Promise((resolve) => setTimeout(resolve, 100));
  expect(screen.getByRole("progressbar")).toBeInTheDocument();
});
 
// Fixed: wait for the element to appear, with a reasonable timeout
test("button shows loading state", async () => {
  fireEvent.click(submitButton);
  await waitFor(() => {
    expect(screen.getByRole("progressbar")).toBeInTheDocument();
  });
});

The fixed version doesn't assume how long the operation takes. It polls until the condition is true or the timeout expires.

2. Shared Mutable State

Tests that read from or write to shared state can interfere with each other. This is especially common with database tests, global variables, and singletons:

// Flaky: tests share the same database table
let userId: string;
 
beforeAll(async () => {
  const user = await db.users.create({ name: "Test User" });
  userId = user.id;
});
 
test("updates user name", async () => {
  await db.users.update(userId, { name: "Updated" });
  const user = await db.users.findById(userId);
  expect(user.name).toBe("Updated");
});
 
test("reads original user name", async () => {
  const user = await db.users.findById(userId);
  // Flaky: fails if "updates user name" runs first
  expect(user.name).toBe("Test User");
});

// Fixed: each test creates its own data
test("updates user name", async () => {
  const user = await db.users.create({ name: "Test User" });
  await db.users.update(user.id, { name: "Updated" });
  const updated = await db.users.findById(user.id);
  expect(updated.name).toBe("Updated");
});
 
test("reads user name", async () => {
  const user = await db.users.create({ name: "Original" });
  const found = await db.users.findById(user.id);
  expect(found.name).toBe("Original");
});

3. External Dependencies

Tests that make real network calls, read from the file system, or depend on third-party APIs introduce non-determinism. The external service might be slow, down, or return different data:

// Flaky: depends on a real API being available and fast
test("fetches user profile from GitHub", async () => {
  const profile = await fetch("https://api.github.com/users/octocat");
  const data = await profile.json();
  expect(data.login).toBe("octocat");
});

External dependencies should be mocked in unit tests. Integration tests that intentionally hit real services need retry logic and should be tagged separately so they don't block the fast feedback loop.

4. Environment Differences

Tests pass locally but fail in CI, or vice versa. Common culprits:

Timezone differences: new Date().toLocaleDateString() produces different output on different machines
Locale settings: Number formatting, string sorting, and collation depend on locale
File system ordering: fs.readdirSync() doesn't guarantee alphabetical order on all operating systems
Available resources: CI runners may have less memory or CPU than development machines

// Flaky: depends on timezone
test("formats today's date", () => {
  const formatted = formatDate(new Date());
  expect(formatted).toBe("March 20, 2026");
});
 
// Fixed: use a fixed date
test("formats a specific date", () => {
  const date = new Date("2026-03-20T00:00:00Z");
  const formatted = formatDate(date);
  expect(formatted).toBe("March 20, 2026");
});

5. Order-Dependent Tests

Tests that only pass when run in a specific order have implicit dependencies. This often surfaces when running tests in parallel or when a test runner randomizes execution order:

// These tests have a hidden dependency
test("creates a project", async () => {
  await createProject({ name: "My Project" });
  // Side effect: sets global currentProject
});
 
test("adds a member to the project", async () => {
  // Flaky: depends on "creates a project" running first
  await addMember(currentProject.id, { email: "[email protected]" });
  expect(currentProject.members).toHaveLength(1);
});

Each test should set up its own preconditions. If test B depends on state created by test A, test B will break whenever test A is skipped, filtered out, or runs in a different process.

Detection Strategies

Manual Detection

The simplest approach: re-run your CI pipeline multiple times on the same commit. If any tests change status between runs, they're flaky.

Pros: No tooling needed, definitive results. Cons: Time-consuming, doesn't scale, only catches flakiness that manifests during your test window.

Statistical Detection

Track every test's pass/fail status across CI runs and look for tests that flip:

def detect_flaky_tests(runs: list[TestRun]) -> list[str]:
    """Find tests that changed status between runs on the same code."""
    test_history: dict[str, list[str]] = {}
 
    for run in runs:
        for test in run.tests:
            key = f"{test.suite}::{test.name}"
            test_history.setdefault(key, []).append(test.status)
 
    flaky = []
    for test_name, statuses in test_history.items():
        unique_statuses = set(statuses)
        if "passed" in unique_statuses and "failed" in unique_statuses:
            flaky.append(test_name)
 
    return flaky

This catches tests that aren't consistently broken (those would be caught by normal CI) but aren't consistently passing either.

Automated Detection with Monitoring

The most effective approach combines statistical detection with continuous monitoring. Services like TestGlance analyze your test reports across every CI run and automatically flag tests with inconsistent results.

The key advantage of automated detection: it catches flaky tests as soon as they appear, before they've had time to erode team trust. A test that started flaking yesterday is much easier to fix than one that's been flaking for six months.

Fixing Flaky Tests

1. Isolate Test State

The single most effective fix for flaky tests: make each test independent.

Use beforeEach instead of beforeAll for test data setup
Create fresh database records per test instead of sharing fixtures
Reset global state and singletons between tests
Use unique identifiers (UUIDs, timestamps) to avoid collisions

2. Replace Timing with Events

Never use setTimeout or sleep to wait for async operations. Use event-driven waiting instead:

DOM testing: Use waitFor, findBy* queries, or waitForElementToBeRemoved
API testing: Use await on the actual operation, not on a timer
Playwright / E2E: Use page.waitForSelector or expect(locator).toBeVisible() instead of page.waitForTimeout

3. Mock External Dependencies

In unit tests, mock all external calls. In integration tests, use containers or test doubles:

// Mock external API in tests
vi.mock("@/lib/github", () => ({
  fetchRepo: vi.fn().mockResolvedValue({
    name: "test-repo",
    stars: 42,
  }),
}));

4. Quarantine, Don't Ignore

When you find a flaky test that can't be fixed immediately:

Quarantine it: Move it to a separate test suite that doesn't block CI
Track it: Create a ticket with a deadline for investigation
Investigate: Look at the failure pattern — does it only fail in CI? Only on certain days? Only when tests run in parallel?
Fix or delete: Either fix the root cause or delete the test and write a deterministic replacement

The worst option is @skip or xit without a plan. Skipped tests with no deadline become permanent dead code.

5. Retry as a Last Resort

Automatic retries can mask flakiness and make it harder to track. If you do use retries:

Limit to 1-2 retries, not unlimited
Log every retry so you can track which tests need them
Treat retried tests as flaky in your metrics, not as passing
Set a deadline to fix any test that needs retries

// If you must retry, track it
test("unreliable integration test", async () => {
  // retry: 2 — tracked as flaky, ticket TEST-456 to fix
  const result = await callExternalService();
  expect(result.status).toBe("ok");
}, { retry: 2 });

Prevention Best Practices

Write Deterministic Tests from the Start

Use fixed seeds for random data generation
Use frozen clocks (vi.useFakeTimers()) for time-dependent logic
Use fixed dates instead of new Date()
Sort arrays before comparing if order doesn't matter

Run Tests in Random Order

Most test runners support randomizing test execution order. This surfaces order-dependent tests early:

# Jest
jest --randomize
 
# pytest
pytest -p random

Run Tests in Parallel

Parallel execution reveals shared-state issues that sequential execution hides. If a test only passes when run alone, it has a hidden dependency.

Monitor Continuously

Don't wait for flaky tests to become a problem. Track test health metrics from day one. A flaky test caught on day one takes minutes to fix. The same test caught after six months of working around it takes hours.

FAQ

How many flaky tests are too many?

Any flaky test is one too many, but pragmatically: if more than 2-3% of your tests are flaky, your team is probably re-running CI regularly and starting to ignore failures. At 5%+, the test suite is actively harmful — it's training developers to not trust test results.

Can AI help detect flaky tests?

AI and statistical methods can identify flaky tests faster than manual approaches. By analyzing test results across many CI runs, algorithms can detect patterns like "this test fails 8% of the time on the same code" or "this test only fails when run in parallel with test X." The detection part is well-suited to automation; the fixing part still requires human judgment about the root cause.

What's the difference between a flaky test and an intermittent bug?

A flaky test produces inconsistent results due to problems in the test code itself (timing, shared state, environment). An intermittent bug produces inconsistent results due to problems in the production code (race condition, resource leak). The distinction matters because the fix is different: flaky tests are fixed by improving test isolation and determinism, while intermittent bugs require fixing the application code.

Next Steps

Start by identifying your current flaky tests. If you don't have automated detection, run your full test suite 5 times on the same commit and compare results. Then prioritize fixes by impact — the tests that fail most frequently and block the most deployments should be fixed first.

Set up automated flaky test detection with TestGlance to catch new flaky tests the moment they appear. Track your suite's health score and test retry rate over time to measure improvement.