Bug Mining & Mutation
Pytifex’s core innovation is bug-seeded mutation: instead of asking an LLM to invent type-checking edge cases from scratch, it mines real bugs from type checker repositories and uses them as seeds to generate targeted variations. This page explains the mining, mutation, and refinement pipeline from a developer perspective.
Why Mine Real Bugs?
Type checker bugs cluster around specific patterns — TypedDict inheritance, ParamSpec decorators, Protocol variance. Real GitHub issues represent proven edge cases that already exposed a checker’s weakness. Using them as seeds makes the LLM’s code generation dramatically more effective at producing disagreements than prompting from scratch.
Real Bug Reports LLM generates Type Checkers test
(proven edge cases) → NEW variations → the NEW code
(not copies)
The pipeline never tests the original GitHub code directly. The seeds are inspiration; the tested code is always freshly generated.
Seed Mining (github_issues.py)
Repositories
Seeds are fetched from five type checker repositories:
REPOS = {
"mypy": "python/mypy",
"pyrefly": "facebook/pyrefly",
"ty": "astral-sh/ty",
"pyright": "microsoft/pyright",
"zuban": "zubanls/zuban",
}NOTE: Pytifex does not evaluate pyright, but because of its rich nature of GitHub issues it provides a valuable source for code examples.
Filtering to Confirmed Bugs
Not all closed issues are real bugs — many are “won’t fix”, duplicates, or feature requests. Pytifex filters using the GitHub state_reason field:
def is_confirmed_bug(issue: dict) -> bool:
if issue.get("state") != "closed":
return False
if issue.get("state_reason") == "not_planned":
return False
return TrueOnly issues with state_reason == "completed" (or null for older issues) pass the filter.
Classifying False Positives / Negatives
Issues are further classified by their labels:
FALSE_POSITIVE_LABELS = ["false-positive", "false positive", "spurious", ...]
FALSE_NEGATIVE_LABELS = ["false-negative", "false negative", "missed-error", ...]This classification is passed to the LLM prompt so it knows whether the seed represents a case where a checker over-reported (false positive) or under-reported (false negative).
Extracting Code from Issues
Code is extracted from issue bodies using two methods:
1. Fenced code blocks — Standard markdown ```python blocks:
pattern = r"```(?:python|py)\n(.*?)```"
matches = re.findall(pattern, text, re.DOTALL | re.IGNORECASE)2. Pyrefly sandbox URLs — Pyrefly issues often link to pyrefly.org/sandbox/?project=<base64>. Pytifex decodes the base64 payload and extracts the code:
sandbox_pattern = r'https://pyrefly\.org/sandbox/\?project=([A-Za-z0-9%+/=]+)'
# URL-decode → base64-decode → JSON parse → extract codeSnippets shorter than 50 characters are discarded as not useful.
Rate Limits
Without a GitHub token, you get 60 API requests per hour. Setting GITHUB_TOKEN raises this to 5,000:
export GITHUB_TOKEN=ghp_your_tokenOr skip GitHub entirely with --no-github.
Code Generation (Mutation)
Seed-Based Prompting
The primary generation strategy (build_seed_based_prompt in prompts.py) shows 3–5 real bug examples to Gemini and asks for variations:
prompt = f"""
## REAL BUG EXAMPLES FROM TYPE CHECKER ISSUES:
{seeds_text}
## YOUR TASK:
Using these real bugs as inspiration, generate {num_variations}
NEW Python code examples that:
1. Are VARIATIONS or EXTENSIONS of the patterns shown above
2. Target subtle type system edge cases likely to cause checker disagreements
3. Are self-contained and runnable (include all imports)
## STRATEGY FOR GENERATING DIVERGENCES:
- If a seed shows a false positive in mypy, create a similar case
that other checkers also get wrong
- If a seed shows a false negative, create variations that test
the boundaries of what gets caught
- Combine patterns: e.g., TypedDict + Protocol, ParamSpec + classmethod
"""Every generated example must reference a seed_issue (e.g., python/mypy#12345). Examples without a valid seed reference are skipped during filtering.
Fallback Prompting
When no GitHub seeds are available (e.g., --no-github), the pipeline falls back to build_expert_prompt, which uses the divergence pattern descriptions from patterns.py instead of real issues.
Seed Rotation
To avoid generating repetitive code, the pipeline rotates through seeds across attempts:
start_idx = (attempt - 1) * 3 % len(seed_examples)
batch_seeds = seed_examples[start_idx:start_idx + 5]Disagreement Detection
After each example is generated, all four checkers run on it:
results = run_all_checkers(example.code)
# results = {"mypy": CheckerResult("ok", ...), "pyrefly": CheckerResult("error", ...), ...}
def has_disagreement(results):
statuses = [r.status for r in results.values()]
return len(set(statuses)) > 1 # at least one checker disagreesIf all four checkers agree ({"ok", "ok", "ok", "ok"} or {"error", "error", "error", "error"}), the example is a candidate for refinement.
Refinement Loop
Non-divergent examples are sent back to the LLM with the real checker outputs as feedback:
def build_refinement_prompt(code, checker_results, seed_example):
return f"""
The following Python code was tested with all 4 type checkers
but they ALL AGREED — meaning this is NOT a useful divergence example.
## ACTUAL CHECKER RESULTS (all agree):
{results_text}
## YOUR TASK:
Modify this code MINIMALLY to create a REAL divergence where at least
one checker disagrees with the others.
STRATEGIES:
1. If all passed: Add a subtle type error that only some checkers catch
2. If all failed: Fix the obvious error but keep a subtle edge case
3. Change the typing pattern slightly (add Protocol, use TypeGuard, ...)
"""The refined code is re-checked. This loop repeats up to max_refinements times (default: 2). Refined examples are marked as DISAGREEMENT (refined) in the output.
Controlling the Pipeline
| Option | Effect |
|---|---|
--num-examples N |
Stop after finding N disagreements |
--batch-size N |
Generate N examples per LLM call |
--max-attempts N |
Max generation rounds before giving up |
--max-refinements N |
Max refinement passes per non-divergent example |
--no-github |
Use pattern-based generation only (no seed mining) |
--model gemini-2.5-pro |
Use a more capable model for better hit rate |