A Primer on the Eval Proc

Software Testing & Evaluation — from the smallest unit test to a full user journey.

teaching primer

Contents 1. The Big Idea 2. The Test Pyramid 3. The Layers 4. Two Cousins 5. Vocabulary 6. CI Pipeline 7. The Bug-Fix Loop 8. Key Takeaways

"The eval proc" is the process by which you prove your software actually works, before and after it reaches a user. This primer builds it up layer by layer.

1. The Big Idea

Every test answers one question: "Given some input, did the system do the right thing?"

Every test, from the tiniest unit test to a full browser test, has the same three-step shape, called Arrange / Act / Assert (AAA):

flowchart LR
    A["ARRANGE
set up the world"] --> B["ACT
run the thing"]
    B --> C["ASSERT
check the result"]
    C --> D{"PASS / FAIL"}
    style A fill:#1e3a5f,stroke:#6ea8fe,color:#fff
    style B fill:#5f4420,stroke:#f0a84a,color:#fff
    style C fill:#1f4d2e,stroke:#5cc97e,color:#fff
    style D fill:#5a1f3a,stroke:#e06aa0,color:#fff

Learn to see this atom everywhere.

2. The Test Pyramid — Your Map of the Layers

This is the single most important mental model. Tests live in layers: you want many cheap, fast tests at the bottom and few expensive, realistic tests at the top.

flowchart TB
    E["E2E — whole app, as a user
~seconds · few (tens)"]
    I["INTEGRATION — parts wired together
~100ms · some (hundreds)"]
    U["UNIT — one function in isolation
~ms · many (thousands)"]
    E --> I --> U
    style E fill:#5a1f1f,stroke:#ef6a6a,color:#fff
    style I fill:#5f5020,stroke:#f0c84a,color:#fff
    style U fill:#1f4d2e,stroke:#5cc97e,color:#fff

Why a pyramid and not a square? The higher you go, the more realistic the test (good) but also the slower, flakier, and harder to debug it gets (bad). So push as much testing as low as possible, and reserve the top for the few critical user journeys.

Layer	Scope	Speed	Count
Unit	one function/class	~ms	many (1000s)
Integration	a few parts together	~100ms	some (100s)
E2E	whole app, as a user	~seconds	few (10s)

3. The Layers in Detail

Layer 1 — Unit tests

Scope: one small piece (a function, a class) in isolation.
When used: the workhorse — logic, calculations, edge cases, branching, error handling.
Trade-off: blazing fast and pinpoint-accurate when they fail, but they prove nothing about how pieces fit together.

flowchart LR
    IN["in: $100, NY"] --> F["calc_tax()"]
    F --> OUT["assert out == $108.875"]
    style F fill:#1f4d2e,stroke:#5cc97e,color:#fff

Layer 2 — Integration tests

Scope: several real pieces working together (your code + a real database, or two services talking).
When used: to catch bugs that live between components — wrong SQL, bad API contract, serialization mismatches.
Trade-off: more realistic, slower, needs more setup (a test DB, etc.).

flowchart LR
    API["API"] --> SVC["Service"] --> DB[("Database")]
    DB --> CHK["assert row exists in orders"]
    style API fill:#5f5020,stroke:#f0c84a,color:#fff
    style SVC fill:#5f5020,stroke:#f0c84a,color:#fff
    style DB fill:#5f5020,stroke:#f0c84a,color:#fff

Layer 3 — End-to-End (E2E)

Scope: the entire product, driven exactly as a human user would (real browser, real clicks, real server).
When used: critical journeys ("can a user log in and check out?") and reproducing reported bugs the way the user hit them.
Trade-off: maximum confidence, but slowest, flakiest, and hardest to debug — keep these few.

flowchart LR
    U(["User"]) -->|clicks| BR["Browser"]
    BR --> FE["Frontend"] --> API2["API"] --> SVC2["Service"] --> DB2[("DB")]
    DB2 --> RES["assert 'Order confirmed'"]
    style U fill:#5a1f1f,stroke:#ef6a6a,color:#fff

This is the layer the "reproduce a bug E2E first" rule is about: see it break as a user would, so your fix targets the real problem.

4. Two Cousins You Will Hear About

These are not pyramid layers, but they sit alongside it.

Type	What it checks	When
Visual / snapshot	Did the UI change visually vs a saved baseline image?	Pixel-perfection, catching unintended layout drift
Manual / exploratory	A human pokes around looking for "off" things	Before automation exists, or for judgment calls

flowchart LR
    B["baseline.png"] --> CMP{"pixel diff"}
    C["current.png"] --> CMP
    CMP -->|"diff == 0"| P["PASS"]
    CMP -->|"diff > 0"| F["FAIL - flagged for review"]
    style P fill:#1f4d2e,stroke:#5cc97e,color:#fff
    style F fill:#5a1f1f,stroke:#ef6a6a,color:#fff

5. Supporting Concepts (the vocabulary)

Fixture — the prepared "world" a test runs against (seed data, a fresh DB, a logged-in user). The reusable Arrange step.
Mock / Stub / Fake — a pretend version of a real dependency, so a unit test stays isolated and fast.
Assertion — the check that decides pass/fail.
Coverage — what percent of your code a suite exercises. A useful signal, but 100% coverage does not mean bug-free. Do not worship it.
Flakiness — a test that passes and fails randomly with no code change. Poison: fix or delete it.
Regression test — a test added after fixing a bug, so that exact bug can never silently return.
CI (Continuous Integration) — a robot that runs your whole suite automatically on every change.

flowchart LR
    subgraph Real
      CODE1["code"] --> PAY["real payment API
slow, charges money"]
    end
    subgraph Mock
      CODE2["code"] --> FAKE["fake that says OK
instant, free, controllable"]
    end
    style PAY fill:#5a1f1f,stroke:#ef6a6a,color:#fff
    style FAKE fill:#1f4d2e,stroke:#5cc97e,color:#fff

6. How It All Runs Together (CI Pipeline)

In a real project the layers run fastest-first, so failures surface cheaply.

flowchart LR
    G["git push"] --> L["Lint"] --> U["Unit"] --> I["Integration"] --> E["E2E"] --> D["Deploy"]
    style L fill:#1e3a5f,stroke:#6ea8fe,color:#fff
    style U fill:#1f4d2e,stroke:#5cc97e,color:#fff
    style I fill:#5f5020,stroke:#f0c84a,color:#fff
    style E fill:#5a1f1f,stroke:#ef6a6a,color:#fff
    style D fill:#3d1f5a,stroke:#b06ae0,color:#fff

If something basic breaks, it fails fast and never reaches the slow stages.

7. The Bug-Fix Loop (the practical eval proc)

This maps directly onto the discipline of "reproduce first, then fix".

flowchart TD
    R["1. REPRODUCE
drive the app E2E, see it fail"] --> RED["2. RED
write a test that FAILS for this bug"]
    RED --> FIX["3. FIX
change the code"]
    FIX --> GREEN["4. GREEN
same test now PASSES = real fix"]
    GREEN --> GUARD["5. GUARD
test stays forever as a regression guard"]
    style R fill:#5a1f1f,stroke:#ef6a6a,color:#fff
    style RED fill:#5f5020,stroke:#f0c84a,color:#fff
    style GREEN fill:#1f4d2e,stroke:#5cc97e,color:#fff
    style GUARD fill:#3d1f5a,stroke:#b06ae0,color:#fff

The discipline: never skip steps 1 and 2. A fix without a failing-then-passing test is just a hopeful guess.

8. Key Takeaways (memorize these five)

Every test is Arrange → Act → Assert.
The pyramid: many unit, some integration, few E2E.
Lower = faster, cheaper, stabler; higher = more realistic, slower, flakier. Push tests as low as they will go.
For bugs: reproduce first, write a failing test, then fix.
Flaky tests and ignored failures are worse than no tests — they erode trust in the whole suite.