A Primer on the Eval Proc

Software Testing & Evaluation — from the smallest unit test to a full user journey.

teaching primer
Contents 1. The Big Idea 2. The Test Pyramid 3. The Layers 4. Two Cousins 5. Vocabulary 6. CI Pipeline 7. The Bug-Fix Loop 8. Key Takeaways

"The eval proc" is the process by which you prove your software actually works, before and after it reaches a user. This primer builds it up layer by layer.

1. The Big Idea

Every test answers one question: "Given some input, did the system do the right thing?"

Every test, from the tiniest unit test to a full browser test, has the same three-step shape, called Arrange / Act / Assert (AAA):

flowchart LR
    A["ARRANGE
set up the world"] --> B["ACT
run the thing"] B --> C["ASSERT
check the result"] C --> D{"PASS / FAIL"} style A fill:#1e3a5f,stroke:#6ea8fe,color:#fff style B fill:#5f4420,stroke:#f0a84a,color:#fff style C fill:#1f4d2e,stroke:#5cc97e,color:#fff style D fill:#5a1f3a,stroke:#e06aa0,color:#fff

Learn to see this atom everywhere.

2. The Test Pyramid — Your Map of the Layers

This is the single most important mental model. Tests live in layers: you want many cheap, fast tests at the bottom and few expensive, realistic tests at the top.

flowchart TB
    E["E2E — whole app, as a user
~seconds · few (tens)"] I["INTEGRATION — parts wired together
~100ms · some (hundreds)"] U["UNIT — one function in isolation
~ms · many (thousands)"] E --> I --> U style E fill:#5a1f1f,stroke:#ef6a6a,color:#fff style I fill:#5f5020,stroke:#f0c84a,color:#fff style U fill:#1f4d2e,stroke:#5cc97e,color:#fff

Why a pyramid and not a square? The higher you go, the more realistic the test (good) but also the slower, flakier, and harder to debug it gets (bad). So push as much testing as low as possible, and reserve the top for the few critical user journeys.

LayerScopeSpeedCount
Unitone function/class~msmany (1000s)
Integrationa few parts together~100mssome (100s)
E2Ewhole app, as a user~secondsfew (10s)

3. The Layers in Detail

Layer 1 — Unit tests

Scope: one small piece (a function, a class) in isolation.
When used: the workhorse — logic, calculations, edge cases, branching, error handling.
Trade-off: blazing fast and pinpoint-accurate when they fail, but they prove nothing about how pieces fit together.

flowchart LR
    IN["in: $100, NY"] --> F["calc_tax()"]
    F --> OUT["assert out == $108.875"]
    style F fill:#1f4d2e,stroke:#5cc97e,color:#fff

Layer 2 — Integration tests

Scope: several real pieces working together (your code + a real database, or two services talking).
When used: to catch bugs that live between components — wrong SQL, bad API contract, serialization mismatches.
Trade-off: more realistic, slower, needs more setup (a test DB, etc.).

flowchart LR
    API["API"] --> SVC["Service"] --> DB[("Database")]
    DB --> CHK["assert row exists in orders"]
    style API fill:#5f5020,stroke:#f0c84a,color:#fff
    style SVC fill:#5f5020,stroke:#f0c84a,color:#fff
    style DB fill:#5f5020,stroke:#f0c84a,color:#fff

Layer 3 — End-to-End (E2E)

Scope: the entire product, driven exactly as a human user would (real browser, real clicks, real server).
When used: critical journeys ("can a user log in and check out?") and reproducing reported bugs the way the user hit them.
Trade-off: maximum confidence, but slowest, flakiest, and hardest to debug — keep these few.

flowchart LR
    U(["User"]) -->|clicks| BR["Browser"]
    BR --> FE["Frontend"] --> API2["API"] --> SVC2["Service"] --> DB2[("DB")]
    DB2 --> RES["assert 'Order confirmed'"]
    style U fill:#5a1f1f,stroke:#ef6a6a,color:#fff
This is the layer the "reproduce a bug E2E first" rule is about: see it break as a user would, so your fix targets the real problem.

4. Two Cousins You Will Hear About

These are not pyramid layers, but they sit alongside it.

TypeWhat it checksWhen
Visual / snapshotDid the UI change visually vs a saved baseline image?Pixel-perfection, catching unintended layout drift
Manual / exploratoryA human pokes around looking for "off" thingsBefore automation exists, or for judgment calls
flowchart LR
    B["baseline.png"] --> CMP{"pixel diff"}
    C["current.png"] --> CMP
    CMP -->|"diff == 0"| P["PASS"]
    CMP -->|"diff > 0"| F["FAIL - flagged for review"]
    style P fill:#1f4d2e,stroke:#5cc97e,color:#fff
    style F fill:#5a1f1f,stroke:#ef6a6a,color:#fff

5. Supporting Concepts (the vocabulary)

flowchart LR
    subgraph Real
      CODE1["code"] --> PAY["real payment API
slow, charges money"] end subgraph Mock CODE2["code"] --> FAKE["fake that says OK
instant, free, controllable"] end style PAY fill:#5a1f1f,stroke:#ef6a6a,color:#fff style FAKE fill:#1f4d2e,stroke:#5cc97e,color:#fff

6. How It All Runs Together (CI Pipeline)

In a real project the layers run fastest-first, so failures surface cheaply.

flowchart LR
    G["git push"] --> L["Lint"] --> U["Unit"] --> I["Integration"] --> E["E2E"] --> D["Deploy"]
    style L fill:#1e3a5f,stroke:#6ea8fe,color:#fff
    style U fill:#1f4d2e,stroke:#5cc97e,color:#fff
    style I fill:#5f5020,stroke:#f0c84a,color:#fff
    style E fill:#5a1f1f,stroke:#ef6a6a,color:#fff
    style D fill:#3d1f5a,stroke:#b06ae0,color:#fff

If something basic breaks, it fails fast and never reaches the slow stages.

7. The Bug-Fix Loop (the practical eval proc)

This maps directly onto the discipline of "reproduce first, then fix".

flowchart TD
    R["1. REPRODUCE
drive the app E2E, see it fail"] --> RED["2. RED
write a test that FAILS for this bug"] RED --> FIX["3. FIX
change the code"] FIX --> GREEN["4. GREEN
same test now PASSES = real fix"] GREEN --> GUARD["5. GUARD
test stays forever as a regression guard"] style R fill:#5a1f1f,stroke:#ef6a6a,color:#fff style RED fill:#5f5020,stroke:#f0c84a,color:#fff style GREEN fill:#1f4d2e,stroke:#5cc97e,color:#fff style GUARD fill:#3d1f5a,stroke:#b06ae0,color:#fff

The discipline: never skip steps 1 and 2. A fix without a failing-then-passing test is just a hopeful guess.

8. Key Takeaways (memorize these five)

  1. Every test is Arrange → Act → Assert.
  2. The pyramid: many unit, some integration, few E2E.
  3. Lower = faster, cheaper, stabler; higher = more realistic, slower, flakier. Push tests as low as they will go.
  4. For bugs: reproduce first, write a failing test, then fix.
  5. Flaky tests and ignored failures are worse than no tests — they erode trust in the whole suite.