Architecture

Pytifex is a hybrid system that orchestrates LLM-based code generation with local type-checker execution and objective two-tier evaluation. This page walks through every major subsystem so you can orient yourself in the codebase quickly.

System Overview

Pytifex splits work between two execution domains:

Domain	Responsibilities
LLM API	Code generation and refinement of non-divergent examples (Gemini, OpenAI, Cohere, or any OpenAI-compatible local server)
Local machine	GitHub issue fetching, type-checker execution via `subprocess`, runtime crash detection, and Hypothesis property-based testing

Important: Type-checker outputs are real. Every status comes from actually invoking mypy, pyrefly, ty, and zuban in a subprocess — never from LLM simulation.

┌──────────────────────────────────────────────────────────┐
│                       LLM API                            │
│  ┌─────────────┐  ┌────────────┐                         │
│  │ Code Gen    │  │ Refinement │                         │
│  └─────────────┘  └────────────┘                         │
└────────────────────────┬─────────────────────────────────┘
                         │ API calls
┌────────────────────────▼─────────────────────────────────┐
│                   Local Machine                          │
│  ┌──────────────┐  ┌──────────────┐  ┌────────────────┐  │
│  │ GitHub Issue │  │ Type Checker │  │  Two-Tier      │  │
│  │   Mining     │  │  Execution   │  │  Evaluation    │  │
│  └──────────────┘  └──────────────┘  └────────────────┘  │
└──────────────────────────────────────────────────────────┘

Pipeline Flow

The core pipeline lives in pipeline.py and runs five sequential steps.

Step 0 — Seed Mining

Module: github_issues.py

Fetches closed bug-fix issues from upstream type-checker repositories via the GitHub REST API:

python/mypy
facebook/pyrefly
astral-sh/ty
microsoft/pyright
zubanls/zuban

Only confirmed bugs are kept — issues whose state_reason is "not_planned" are filtered out. Python code blocks are extracted from issue bodies with a fenced-code-block regex. For Pyrefly, sandbox URLs are also handled by base64-decoding the encoded source.

# Simplified seed extraction logic
for issue in fetch_closed_issues(repo, label="bug"):
    if issue["state_reason"] == "not_planned":
        continue
    snippets = extract_python_blocks(issue["body"])
    seeds.extend(snippets)

Step 1 — Code Generation

Modules: prompts.py, agent.py

A batch of 3–5 seed examples is shown to the LLM with a request to generate new variations that are likely to trigger type-checker disagreements.

# prompts.py
prompt = build_seed_based_prompt(seeds[:5])

# Falls back when no seeds are available
if not seeds:
    prompt = build_expert_prompt()

The agent.py module provides three provider backends, each as a Pydantic model:

Class	Provider	Activation
`GetAccessToGemini`	Google Gemini	default; model name starts with `gemini-`
`GetAccessToCohere`	Cohere	model name starts with `command`
`GetAccessToOpenAI`	OpenAI-compatible	`--openai-base-url` is set

GetAccessToOpenAI uses the standard /v1/chat/completions endpoint and works with the real OpenAI API, local llama-server instances, Ollama, vLLM, or any compatible server.

Step 2 — Type Checker Execution

Module: pipeline.py (delegates to run_checkers.py)

Every generated example is run through all four checkers. The pipeline compares the resulting statuses and retains only disagreements — cases where at least two checkers produce different verdicts.

statuses = run_all_checkers(example)  # e.g. {"mypy": "error", "pyrefly": "ok", ...}

if len(set(statuses.values())) > 1:
    divergent_examples.append(example)

Step 3 — Refinement

Module: pipeline.py

Examples that did not diverge are sent back to the LLM together with the real checker feedback. The refinement prompt encourages the model to tweak the code so that it triggers a disagreement.

for attempt in range(max_refinements):
    refined = agent.generate(build_refinement_prompt(example, statuses))
    new_statuses = run_all_checkers(refined)
    if len(set(new_statuses.values())) > 1:
        divergent_examples.append(refined)
        break

Step 4 — Evaluation

Module: comprehensive_eval.py

All divergent examples are passed through a multi-tiered evaluation system that determines which checker is correct. See the next section for details.

Evaluation System (`comprehensive_eval.py`)

Evaluation uses only objective runtime evidence — no heuristics, PEP pattern matching, or LLM judgements.

Tier 1 — Runtime Crash Detection

Executes the source code and catches runtime exceptions that signal a genuine type error (TypeError, KeyError, AttributeError). The tool walks the full traceback including exception chains (__cause__ / __context__) and isolates try/except bodies to surface swallowed errors.

A crash proves a false negative: any checker that said “OK” is definitively wrong.

Confidence: 0.95

Tier 2 — Hypothesis Property-Based Testing

Module: hypothesis_tier2.py

Extracts function signatures via AST inspection, builds Hypothesis strategies from type hints, and runs @given tests with generated inputs.

# hypothesis_tier2.py (conceptual)
sig = extract_signature(func_node)           # AST-based
strategy = build_strategy_from_hints(sig)    # maps hints → st.*

beartyped_fn = beartype(fn)                  # enforce input conformance at call time

@given(strategy)
def test(args):
    result = beartyped_fn(*args)             # crash ⇒ false negative
    check_type(result, return_hint)          # success ⇒ success witness

False negatives (Tier 2a): If a call crashes despite inputs being type-conformant (confirmed by beartype), that is a proven type bug. Any checker that said “OK” missed it.

False positives (Tier 2b): If calls succeed — beartype confirmed the inputs, execution completed, and the return type passed typeguard.check_type() — those calls are recorded as success witnesses. A success witness proves the code is well-typed; any checker that said “error” has a proven false positive.

Confidence: 0.85 (false negatives) / 0.90 (false positives, beartype-enforced)

Disclaimer

Since Pytifex is a research tool under active development, the architecture may evolve in future versions.