Software Testing & Evaluation — from the smallest unit test to a full user journey.
teaching primer"The eval proc" is the process by which you prove your software actually works, before and after it reaches a user. This primer builds it up layer by layer.
Every test answers one question: "Given some input, did the system do the right thing?"
Every test, from the tiniest unit test to a full browser test, has the same three-step shape, called Arrange / Act / Assert (AAA):
flowchart LR
A["ARRANGE
set up the world"] --> B["ACT
run the thing"]
B --> C["ASSERT
check the result"]
C --> D{"PASS / FAIL"}
style A fill:#1e3a5f,stroke:#6ea8fe,color:#fff
style B fill:#5f4420,stroke:#f0a84a,color:#fff
style C fill:#1f4d2e,stroke:#5cc97e,color:#fff
style D fill:#5a1f3a,stroke:#e06aa0,color:#fff
Learn to see this atom everywhere.
This is the single most important mental model. Tests live in layers: you want many cheap, fast tests at the bottom and few expensive, realistic tests at the top.
flowchart TB
E["E2E — whole app, as a user
~seconds · few (tens)"]
I["INTEGRATION — parts wired together
~100ms · some (hundreds)"]
U["UNIT — one function in isolation
~ms · many (thousands)"]
E --> I --> U
style E fill:#5a1f1f,stroke:#ef6a6a,color:#fff
style I fill:#5f5020,stroke:#f0c84a,color:#fff
style U fill:#1f4d2e,stroke:#5cc97e,color:#fff
Why a pyramid and not a square? The higher you go, the more realistic the test (good) but also the slower, flakier, and harder to debug it gets (bad). So push as much testing as low as possible, and reserve the top for the few critical user journeys.
| Layer | Scope | Speed | Count |
|---|---|---|---|
| Unit | one function/class | ~ms | many (1000s) |
| Integration | a few parts together | ~100ms | some (100s) |
| E2E | whole app, as a user | ~seconds | few (10s) |
Scope: one small piece (a function, a class) in isolation.
When used: the workhorse — logic, calculations, edge cases, branching, error handling.
Trade-off: blazing fast and pinpoint-accurate when they fail, but they prove nothing about how pieces fit together.
flowchart LR
IN["in: $100, NY"] --> F["calc_tax()"]
F --> OUT["assert out == $108.875"]
style F fill:#1f4d2e,stroke:#5cc97e,color:#fff
Scope: several real pieces working together (your code + a real database, or two services talking).
When used: to catch bugs that live between components — wrong SQL, bad API contract, serialization mismatches.
Trade-off: more realistic, slower, needs more setup (a test DB, etc.).
flowchart LR
API["API"] --> SVC["Service"] --> DB[("Database")]
DB --> CHK["assert row exists in orders"]
style API fill:#5f5020,stroke:#f0c84a,color:#fff
style SVC fill:#5f5020,stroke:#f0c84a,color:#fff
style DB fill:#5f5020,stroke:#f0c84a,color:#fff
Scope: the entire product, driven exactly as a human user would (real browser, real clicks, real server).
When used: critical journeys ("can a user log in and check out?") and reproducing reported bugs the way the user hit them.
Trade-off: maximum confidence, but slowest, flakiest, and hardest to debug — keep these few.
flowchart LR
U(["User"]) -->|clicks| BR["Browser"]
BR --> FE["Frontend"] --> API2["API"] --> SVC2["Service"] --> DB2[("DB")]
DB2 --> RES["assert 'Order confirmed'"]
style U fill:#5a1f1f,stroke:#ef6a6a,color:#fff
This is the layer the "reproduce a bug E2E first" rule is about: see it break as a user would, so your fix targets the real problem.
These are not pyramid layers, but they sit alongside it.
| Type | What it checks | When |
|---|---|---|
| Visual / snapshot | Did the UI change visually vs a saved baseline image? | Pixel-perfection, catching unintended layout drift |
| Manual / exploratory | A human pokes around looking for "off" things | Before automation exists, or for judgment calls |
flowchart LR
B["baseline.png"] --> CMP{"pixel diff"}
C["current.png"] --> CMP
CMP -->|"diff == 0"| P["PASS"]
CMP -->|"diff > 0"| F["FAIL - flagged for review"]
style P fill:#1f4d2e,stroke:#5cc97e,color:#fff
style F fill:#5a1f1f,stroke:#ef6a6a,color:#fff
flowchart LR
subgraph Real
CODE1["code"] --> PAY["real payment API
slow, charges money"]
end
subgraph Mock
CODE2["code"] --> FAKE["fake that says OK
instant, free, controllable"]
end
style PAY fill:#5a1f1f,stroke:#ef6a6a,color:#fff
style FAKE fill:#1f4d2e,stroke:#5cc97e,color:#fff
In a real project the layers run fastest-first, so failures surface cheaply.
flowchart LR
G["git push"] --> L["Lint"] --> U["Unit"] --> I["Integration"] --> E["E2E"] --> D["Deploy"]
style L fill:#1e3a5f,stroke:#6ea8fe,color:#fff
style U fill:#1f4d2e,stroke:#5cc97e,color:#fff
style I fill:#5f5020,stroke:#f0c84a,color:#fff
style E fill:#5a1f1f,stroke:#ef6a6a,color:#fff
style D fill:#3d1f5a,stroke:#b06ae0,color:#fff
If something basic breaks, it fails fast and never reaches the slow stages.
This maps directly onto the discipline of "reproduce first, then fix".
flowchart TD
R["1. REPRODUCE
drive the app E2E, see it fail"] --> RED["2. RED
write a test that FAILS for this bug"]
RED --> FIX["3. FIX
change the code"]
FIX --> GREEN["4. GREEN
same test now PASSES = real fix"]
GREEN --> GUARD["5. GUARD
test stays forever as a regression guard"]
style R fill:#5a1f1f,stroke:#ef6a6a,color:#fff
style RED fill:#5f5020,stroke:#f0c84a,color:#fff
style GREEN fill:#1f4d2e,stroke:#5cc97e,color:#fff
style GUARD fill:#3d1f5a,stroke:#b06ae0,color:#fff
The discipline: never skip steps 1 and 2. A fix without a failing-then-passing test is just a hopeful guess.