What Are Flaky Tests? A Practical Guide to Detection and Fixes
A flaky test is a test that sometimes passes and sometimes fails without any changes to the code being tested. Run the same test ten times on the same commit, and it might pass nine times and fail once — or fail three times and pass seven.
Flaky tests are one of the most frustrating and costly problems in software testing. They erode trust in your test suite, waste CI resources, and slow down deployments. This guide covers why they happen, how to detect them, and what to do about them.
Why Flaky Tests Matter
The cost of flaky tests compounds over time. What starts as an occasional annoyance becomes a serious drag on team productivity.
Eroded Trust
When developers see tests fail randomly, they start ignoring failures — including real ones. "Oh, that test is just flaky" becomes the default response to any red build. This is the most dangerous consequence: your test suite exists to catch bugs, and flaky tests train people to stop paying attention.
Wasted CI Resources
Every flaky failure triggers a re-run. If your pipeline takes 10 minutes and you re-run it 5 times per day due to flaky tests, that's nearly an hour of wasted compute daily. Multiply across a team and it adds up fast.
Slower Deployments
Teams with flaky suites develop a pattern: merge, wait for CI, see a failure, re-run, wait again. Deployments that should take minutes stretch into hours. Some teams stop deploying on Fridays entirely because they don't want to deal with flaky failures before the weekend.
Hidden Bugs
Perhaps worst of all, flaky tests can mask real bugs. If a test fails due to an actual regression but gets dismissed as "probably flaky," the bug ships to production. The test did its job — the team just couldn't trust it.
Common Causes of Flaky Tests
Understanding why tests flake helps you fix and prevent them. Here are the most common causes, with concrete examples.
1. Timing and Race Conditions
Tests that depend on precise timing are inherently fragile. Anything that uses setTimeout, waitFor, or expects operations to complete within a specific window can flake when the CI runner is under load:
// Flaky: assumes the animation completes in exactly 100ms
test("button shows loading state", async () => {
fireEvent.click(submitButton);
await new Promise((resolve) => setTimeout(resolve, 100));
expect(screen.getByRole("progressbar")).toBeInTheDocument();
});
// Fixed: wait for the element to appear, with a reasonable timeout
test("button shows loading state", async () => {
fireEvent.click(submitButton);
await waitFor(() => {
expect(screen.getByRole("progressbar")).toBeInTheDocument();
});
});The fixed version doesn't assume how long the operation takes. It polls until the condition is true or the timeout expires.
2. Shared Mutable State
Tests that read from or write to shared state can interfere with each other. This is especially common with database tests, global variables, and singletons:
// Flaky: tests share the same database table
let userId: string;
beforeAll(async () => {
const user = await db.users.create({ name: "Test User" });
userId = user.id;
});
test("updates user name", async () => {
await db.users.update(userId, { name: "Updated" });
const user = await db.users.findById(userId);
expect(user.name).toBe("Updated");
});
test("reads original user name", async () => {
const user = await db.users.findById(userId);
// Flaky: fails if "updates user name" runs first
expect(user.name).toBe("Test User");
});// Fixed: each test creates its own data
test("updates user name", async () => {
const user = await db.users.create({ name: "Test User" });
await db.users.update(user.id, { name: "Updated" });
const updated = await db.users.findById(user.id);
expect(updated.name).toBe("Updated");
});
test("reads user name", async () => {
const user = await db.users.create({ name: "Original" });
const found = await db.users.findById(user.id);
expect(found.name).toBe("Original");
});3. External Dependencies
Tests that make real network calls, read from the file system, or depend on third-party APIs introduce non-determinism. The external service might be slow, down, or return different data:
// Flaky: depends on a real API being available and fast
test("fetches user profile from GitHub", async () => {
const profile = await fetch("https://api.github.com/users/octocat");
const data = await profile.json();
expect(data.login).toBe("octocat");
});External dependencies should be mocked in unit tests. Integration tests that intentionally hit real services need retry logic and should be tagged separately so they don't block the fast feedback loop.
4. Environment Differences
Tests pass locally but fail in CI, or vice versa. Common culprits:
- Timezone differences:
new Date().toLocaleDateString()produces different output on different machines - Locale settings: Number formatting, string sorting, and collation depend on locale
- File system ordering:
fs.readdirSync()doesn't guarantee alphabetical order on all operating systems - Available resources: CI runners may have less memory or CPU than development machines
// Flaky: depends on timezone
test("formats today's date", () => {
const formatted = formatDate(new Date());
expect(formatted).toBe("March 20, 2026");
});
// Fixed: use a fixed date
test("formats a specific date", () => {
const date = new Date("2026-03-20T00:00:00Z");
const formatted = formatDate(date);
expect(formatted).toBe("March 20, 2026");
});5. Order-Dependent Tests
Tests that only pass when run in a specific order have implicit dependencies. This often surfaces when running tests in parallel or when a test runner randomizes execution order:
// These tests have a hidden dependency
test("creates a project", async () => {
await createProject({ name: "My Project" });
// Side effect: sets global currentProject
});
test("adds a member to the project", async () => {
// Flaky: depends on "creates a project" running first
await addMember(currentProject.id, { email: "[email protected]" });
expect(currentProject.members).toHaveLength(1);
});Each test should set up its own preconditions. If test B depends on state created by test A, test B will break whenever test A is skipped, filtered out, or runs in a different process.
Detection Strategies
Manual Detection
The simplest approach: re-run your CI pipeline multiple times on the same commit. If any tests change status between runs, they're flaky.
Pros: No tooling needed, definitive results. Cons: Time-consuming, doesn't scale, only catches flakiness that manifests during your test window.
Statistical Detection
Track every test's pass/fail status across CI runs and look for tests that flip:
def detect_flaky_tests(runs: list[TestRun]) -> list[str]:
"""Find tests that changed status between runs on the same code."""
test_history: dict[str, list[str]] = {}
for run in runs:
for test in run.tests:
key = f"{test.suite}::{test.name}"
test_history.setdefault(key, []).append(test.status)
flaky = []
for test_name, statuses in test_history.items():
unique_statuses = set(statuses)
if "passed" in unique_statuses and "failed" in unique_statuses:
flaky.append(test_name)
return flakyThis catches tests that aren't consistently broken (those would be caught by normal CI) but aren't consistently passing either.
Automated Detection with Monitoring
The most effective approach combines statistical detection with continuous monitoring. Services like TestGlance analyze your test reports across every CI run and automatically flag tests with inconsistent results.
The key advantage of automated detection: it catches flaky tests as soon as they appear, before they've had time to erode team trust. A test that started flaking yesterday is much easier to fix than one that's been flaking for six months.
Fixing Flaky Tests
1. Isolate Test State
The single most effective fix for flaky tests: make each test independent.
- Use
beforeEachinstead ofbeforeAllfor test data setup - Create fresh database records per test instead of sharing fixtures
- Reset global state and singletons between tests
- Use unique identifiers (UUIDs, timestamps) to avoid collisions
2. Replace Timing with Events
Never use setTimeout or sleep to wait for async operations. Use event-driven waiting instead:
- DOM testing: Use
waitFor,findBy*queries, orwaitForElementToBeRemoved - API testing: Use
awaiton the actual operation, not on a timer - Playwright / E2E: Use
page.waitForSelectororexpect(locator).toBeVisible()instead ofpage.waitForTimeout
3. Mock External Dependencies
In unit tests, mock all external calls. In integration tests, use containers or test doubles:
// Mock external API in tests
vi.mock("@/lib/github", () => ({
fetchRepo: vi.fn().mockResolvedValue({
name: "test-repo",
stars: 42,
}),
}));4. Quarantine, Don't Ignore
When you find a flaky test that can't be fixed immediately:
- Quarantine it: Move it to a separate test suite that doesn't block CI
- Track it: Create a ticket with a deadline for investigation
- Investigate: Look at the failure pattern — does it only fail in CI? Only on certain days? Only when tests run in parallel?
- Fix or delete: Either fix the root cause or delete the test and write a deterministic replacement
The worst option is @skip or xit without a plan. Skipped tests with no deadline become permanent dead code.
5. Retry as a Last Resort
Automatic retries can mask flakiness and make it harder to track. If you do use retries:
- Limit to 1-2 retries, not unlimited
- Log every retry so you can track which tests need them
- Treat retried tests as flaky in your metrics, not as passing
- Set a deadline to fix any test that needs retries
// If you must retry, track it
test("unreliable integration test", async () => {
// retry: 2 — tracked as flaky, ticket TEST-456 to fix
const result = await callExternalService();
expect(result.status).toBe("ok");
}, { retry: 2 });Prevention Best Practices
Write Deterministic Tests from the Start
- Use fixed seeds for random data generation
- Use frozen clocks (
vi.useFakeTimers()) for time-dependent logic - Use fixed dates instead of
new Date() - Sort arrays before comparing if order doesn't matter
Run Tests in Random Order
Most test runners support randomizing test execution order. This surfaces order-dependent tests early:
# Jest
jest --randomize
# pytest
pytest -p randomRun Tests in Parallel
Parallel execution reveals shared-state issues that sequential execution hides. If a test only passes when run alone, it has a hidden dependency.
Monitor Continuously
Don't wait for flaky tests to become a problem. Track test health metrics from day one. A flaky test caught on day one takes minutes to fix. The same test caught after six months of working around it takes hours.
FAQ
How many flaky tests are too many?
Any flaky test is one too many, but pragmatically: if more than 2-3% of your tests are flaky, your team is probably re-running CI regularly and starting to ignore failures. At 5%+, the test suite is actively harmful — it's training developers to not trust test results.
Can AI help detect flaky tests?
AI and statistical methods can identify flaky tests faster than manual approaches. By analyzing test results across many CI runs, algorithms can detect patterns like "this test fails 8% of the time on the same code" or "this test only fails when run in parallel with test X." The detection part is well-suited to automation; the fixing part still requires human judgment about the root cause.
What's the difference between a flaky test and an intermittent bug?
A flaky test produces inconsistent results due to problems in the test code itself (timing, shared state, environment). An intermittent bug produces inconsistent results due to problems in the production code (race condition, resource leak). The distinction matters because the fix is different: flaky tests are fixed by improving test isolation and determinism, while intermittent bugs require fixing the application code.
Next Steps
Start by identifying your current flaky tests. If you don't have automated detection, run your full test suite 5 times on the same commit and compare results. Then prioritize fixes by impact — the tests that fail most frequently and block the most deployments should be fixed first.
Set up automated flaky test detection with TestGlance to catch new flaky tests the moment they appear. Track your suite's health score and test retry rate over time to measure improvement.