Evaluation System

Once Pytifex finds a disagreement — code where type checkers produce different verdicts — it needs to determine which checker is right. This page explains the two-tier objective evaluation system that establishes ground truth.

Design Philosophy

The evaluation system is built on a single principle: only runtime evidence counts.

This gives each tier a precise, objective role:

Tier 1 proves false negatives — a runtime crash is a concrete witness that a type error exists. Any checker that said “OK” is definitively wrong.
Tier 2 proves false negatives and false positives — Hypothesis generates type-conformant inputs (verified at the call boundary by beartype) and runs the code. A crash proves a false negative; successful execution proves a false positive.

There are no heuristics, PEP pattern matching, or LLM judgements in the verdict path. A verdict is only issued when there is concrete runtime proof.

Evaluation Tiers

┌──────────────────────────────────────────────────────────────┐
│  Tier 1: Runtime Crash Detection                             │
│  Evidence: code crashes with a TypeError/AttributeError      │
│  Proves: false negatives (checker said "ok", code is broken) │
│  Confidence: 0.95                                            │
├──────────────────────────────────────────────────────────────┤
│  Tier 2a: Hypothesis Crashes                                 │
│  Evidence: crash found with beartype-verified typed inputs   │
│  Proves: false negatives (via fuzzing)                       │
│  Confidence: 0.85                                            │
├──────────────────────────────────────────────────────────────┤
│  Tier 2b: Success Witnesses                                  │
│  Evidence: code ran correctly with beartype-verified inputs  │
│  Proves: false positives (checker said "error", code is fine)│
│  Confidence: 0.90                                            │
├──────────────────────────────────────────────────────────────┤
│  UNCERTAIN: no runtime evidence either way                   │
└──────────────────────────────────────────────────────────────┘

Tier 1 — Runtime Crash Detection

Executes the generated code directly and catches type-related runtime exceptions:

TYPE_ERROR_EXCEPTIONS = (TypeError, KeyError, AttributeError)

If the code crashes, any checker that reported “OK” has a proven false negative. If the code ran fine, Tier 1 produces no verdict — the result falls through to Tier 2.

Full Traceback Walking

Tier 1 doesn’t just catch the top-level exception — it walks the entire traceback and inspects exception chains (__cause__ / __context__):

def _collect_chained_exceptions(exc):
    chain = []
    current = exc
    while current is not None:
        chain.append(current)
        current = current.__cause__ or current.__context__
    chain.reverse()  # root-cause first
    return chain

Isolating Swallowed Errors

Code often contains try/except blocks that catch and hide type errors. Tier 1 uses AST analysis to find try bodies and re-executes them in isolation to surface errors that would otherwise be swallowed.

Tier 2 — Hypothesis Property-Based Testing

Uses Hypothesis to generate random inputs that match the declared type annotations, then calls the actual code with those inputs. Both false negatives and false positives can be proven at this tier.

How It Works

Extract function signatures — Parse the AST to find all user-defined functions and methods.
Build a live namespace — Execute the source with __name__ != "__main__" to get real callable objects.
Introspect signatures — Use inspect.signature() and typing.get_type_hints() to resolve parameter types.
Build Hypothesis strategies — Map type hints to hypothesis.strategies:
- int → st.integers()
- str → st.text()
- list[int] → st.lists(st.integers())
- Optional[str] → st.none() | st.text()
- Custom classes → st.builds(ClassName, ...)
Enforce input conformance with beartype — Before calling the function, it is wrapped with @beartype. If a generated input turns out not to conform to the declared type (rare with well-built strategies, but possible for complex types), beartype raises BeartypeCallHintParamViolation and that call is skipped rather than counted.
Run @given tests — Call the function with confirmed type-conformant inputs and observe the outcome.

Tier 2a — False Negative Detection

If any call raises a TypeError, KeyError, or AttributeError, that is a confirmed type bug. A checker that said “OK” on this code has a proven false negative.

Tier 2b — False Positive Detection (Success Witnesses)

If calls succeed — beartype confirmed the inputs were type-conformant, execution completed without error, and the return value passed a typeguard.check_type() validation — those calls are recorded as success witnesses.

A success witness is a concrete proof that the code is well-typed for at least one real execution. Any checker that reported “error” on this code has a proven false positive.

@dataclass
class SuccessWitness:
    call_text: str          # e.g. "multiply(...)"
    calls_succeeded: int    # number of successful Hypothesis calls
    beartype_enforced: bool # True when beartype confirmed input conformance

Verdict Logic

Tier 1 result	Tier 2 result	Checker said “ok”	Checker said “error”
crash	—	INCORRECT (false negative)	CORRECT
no crash	Hypothesis crash	INCORRECT (false negative)	CORRECT
no crash	success witnesses	CORRECT	INCORRECT (false positive)
no crash	no witnesses	UNCERTAIN	UNCERTAIN

Running Evaluation

# Run full pipeline (generation + evaluation)
uv run pytifex

# Evaluate existing results only
uv run pytifex eval

Evaluation Output

Results are saved to evaluation_comprehensive.json:

{
  "method": "two_tier_objective",
  "summary": {
    "mypy":    {"correct": 8, "incorrect": 4, "uncertain": 3},
    "pyrefly": {"correct": 6, "incorrect": 5, "uncertain": 4}
  },
  "results": [
    {
      "filename": "protocol-defaults-1.py",
      "tier1_bugs": [],
      "tier2_bugs": [],
      "tier2_witnesses": [
        {"call": "render(...)", "successes": 30, "beartype_enforced": true}
      ],
      "verdicts": {
        "mypy":    {"verdict": "INCORRECT", "tier": 2, "confidence": 0.9,
                   "reason": "False positive: reported error on code that executed successfully with type-conformant inputs (30 successful call(s), beartype-enforced)"},
        "pyrefly": {"verdict": "CORRECT",   "tier": 2, "confidence": 0.9,
                   "reason": "Correctly accepted code that executed successfully with type-conformant inputs (30 successful call(s), beartype-enforced)"}
      }
    }
  ]
}

Verdict Meanings

Verdict	Meaning
`CORRECT`	Checker accurately identified all issues (or correctly reported no issues)
`INCORRECT`	Checker missed a real bug (false negative) or flagged correct code (false positive)
`UNCERTAIN`	No runtime evidence from either tier — neither crash nor successful witness found