The Fine Art of a Minimal Reproducible Example
Two hours in, what you're actually shipping is a GitHub issue, a JIRA ticket, or a Slack thread: symptoms, hunches, pasted logs, then one more file because someone asked nicely. Each round trip tugs another detail out of your repo while the runnable version still lives mostly on your machine. Then a follow-up makes you spell out the repro steps you sort of skipped the first time, and halfway through your reply the bug stops looking fuzzy. Nobody merged a fix for you. You just finally described it clearly enough to see it yourself.
That loop is about as expensive as debugging gets, and almost none of it is mandatory. The antidote is the minimal reproducible example (MRE): the smallest, most self-contained piece of code that reliably triggers the problem you're trying to explain. Being kind to whoever reads your ticket is a nice side effect. The main payoff is that your own picture of the break gets sharper. Most of the time you'll see the answer yourself before you hit send.
A Diagnostic Trilogy¶
An MRE has exactly three properties. Violate any one of them and you no longer have an MRE: you have either a code dump or a non-starter.
Minimal¶
Strip the code down to the smallest case that still triggers the issue. Remove authentication helpers, logging setup, database connections, unrelated configuration, and everything else that isn't directly involved in the failure. If you can delete a line and the bug still occurs, delete it.
This is harder than it sounds when you're deep in a problem. Everything feels relevant. It's not. The act of removing code is itself a debugging act: each deletion either moves you closer to the root cause or proves that section of code isn't involved.
Reproducible¶
The example must trigger the same behavior every time, in any environment, for anyone who runs it. "It only happens on my machine" isn't an MRE: it's still a hypothesis. Track down the conditions that consistently trigger the issue and make those conditions explicit in the example.
Your example shouldn't depend on:
- External services that require credentials or network access
- Files or state that exist only on your machine
- Configuration loaded from environment variables that aren't shown
- A sequence of prior steps that must be completed first
If external state is genuinely part of the problem, substitute the smallest possible local stand-in: a hardcoded string, an in-memory mock, or a local test fixture.
Example¶
An example is a program that runs. Not a fragment, not a diff, not pseudocode: a complete, standalone piece of code that anyone can paste into a clean environment and execute immediately. Include imports. Include dependency versions. Include the invocation. Remove every reason for someone to say "I can't run this."
The Debug That Happens Before You Ask¶
Creating an MRE is a more reliable debugging technique than most developers give it credit for. The process forces you to form and test hypotheses. You delete a block and the problem disappears: that block was involved. You delete a different block and the problem persists: that block isn't. You replace a library call with a direct implementation and the behavior changes: the issue was in the library, not your code.
This is the rubber duck effect scaled up and made systematic. Explaining a problem to an inanimate object forces clear articulation, and clear articulation surfaces the gaps in your reasoning. Constructing an MRE does the same thing, but with code instead of words.
The Disappearing Bug
Plenty of bugs show themselves while you're carving an MRE. You close the ticket because you already found the cause. That still counts as a win: you understand what broke instead of only muting the symptom.
Building One: The Process in Four Steps¶
Start from a blank file. Not a copy of your project with things deleted: a new file from scratch. Rebuild only what is necessary to trigger the issue. This discipline prevents you from dragging irrelevant code along through the back door.
Add the minimum until it fails. Introduce just enough structure to trigger the behavior. Run it after each addition. Stop adding the moment the issue appears.
Remove everything that isn't the bug. Go the other direction. Delete functions, variables, imports, and configuration blocks one at a time. Run after each removal. If removing something makes the issue disappear, put it back. Everything else goes.
Pin and document your environment. Record the exact versions of every tool and dependency involved. What looks like a logic bug can be a version regression. If you're not sure whether version matters, include it anyway: let the person helping you decide.
When 50 Lines Won't Get Shorter
If you genuinely can't reduce your example below a certain threshold, that's itself meaningful diagnostic information: the problem requires that level of complexity to manifest. Say so explicitly. It narrows the search space even before anyone reads the code.
Five Examples Across Languages and Tools¶
Each example below shows the same pattern: a production scenario with too much noise to diagnose cleanly, then the MRE that isolates the exact issue. Skim the "Before" tab for the friction, then watch what stripping away the extras reveals.
The Scenario: Picture a Go service that processes a list of tasks concurrently. The full-fat version connects to a database, sets up an HTTP client, initializes logging and metrics, and launches a goroutine per task. Every goroutine looks like it's processing the same task, even though the loop reads fine on inspection.
package main
import (
"database/sql"
"fmt"
"log"
"net/http"
"sync"
"time"
_ "github.com/lib/pq"
)
type Task struct {
ID int
Name string
}
func fetchTasks(db *sql.DB) []Task {
// ... 40 lines of query logic ...
return []Task{{ID: 1, Name: "alpha"}, {ID: 2, Name: "beta"}, {ID: 3, Name: "gamma"}}
}
func processTask(t Task, client *http.Client, logger *log.Logger) error {
// ... 80 lines of HTTP calls, retries, metrics ...
logger.Printf("processing task %d: %s", t.ID, t.Name)
return nil
}
func main() {
db, _ := sql.Open("postgres", "host=localhost dbname=tasks")
client := &http.Client{Timeout: 10 * time.Second}
logger := log.Default()
tasks := fetchTasks(db)
var wg sync.WaitGroup
for _, t := range tasks {
wg.Add(1)
go func() {
defer wg.Done()
if err := processTask(t, client, logger); err != nil {
logger.Printf("error: %v", err)
}
}()
}
wg.Wait()
}
The output shows gamma processed three times. Database query? Goroutine pool logic? The processTask implementation? With this much code in the way, every guess costs you time.
The database connection, HTTP client, and logging setup are just noise. The bug lives in a few lines. Each goroutine closes over the loop variable item, which shared one storage location across iterations before Go 1.22. By the time any goroutine runs, the loop has finished and item holds its final value: "gamma". On Go 1.22 and later, each iteration gets its own item, so you'll see alpha, beta, and gamma once each (order may vary). The output above still matches Go 1.21 and earlier, and the closure pitfall still shows up in older code.
The Fix: Pass item as a parameter to the goroutine function literal.
The Scenario: You've got a data pipeline that runs in two successive batches. Each batch reads a CSV, filters rows, and funnels results through a helper. After the second batch runs, the result set somehow contains rows from both batches. The real thing spans several files: CSV loading, configuration, row validation, and result handling.
import csv
import logging
from typing import Optional
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def load_config(path: str) -> dict:
# ... config loading logic ...
return {"output_dir": "/tmp/results", "batch_size": 100}
def process_row(row: dict, config: dict) -> Optional[dict]:
# ... validation, transformation, enrichment ...
if row.get("active") == "true":
return {"id": row["id"], "name": row["name"]}
return None
def append_result(result: dict, accumulator: list = []) -> list:
accumulator.append(result)
return accumulator
def run_pipeline(input_file: str) -> None:
config = load_config("config.yaml")
results = []
with open(input_file) as f:
reader = csv.DictReader(f)
for row in reader:
processed = process_row(row, config)
if processed:
results = append_result(processed)
logger.info("Collected %d results", len(results))
if __name__ == "__main__":
run_pipeline("data_batch_1.csv")
run_pipeline("data_batch_2.csv")
The second run reports more results than the second CSV actually contains, and some records clearly belong to the first batch. CSV loading? Config? Row validation logic? You won't spot the answer at this scale.
The CSV loading, config parsing, and logging are just noise. The bug's sitting in the function signature. Default argument values in Python are evaluated once when the function is defined, not each time it's called. The [] is a single list object that lives for the lifetime of the module. Every call that doesn't pass an explicit accumulator is appending to the same list.
The Fix: Use None as the sentinel and initialize the list inside the function.
The Scenario: You're cloning a Linux VM from a template in a vSphere-backed Terraform stack. The configuration sprawls across modules for networking, storage, and compute, with remote state in an S3 backend and dozens of resources. terraform apply dies with a guest OS customization error, even though the template and guest ID look fine in vCenter.
Disclaimer
This is not an official VMware by Broadcom document. This is a personal blog post.
The information is provided as-is with no warranties and confers no rights.
Please, refer to official documentation for the most up-to-date information.
# main.tf (condensed from a 600-line multi-module configuration)
module "network" {
source = "./modules/network"
datacenter = var.datacenter
# ...
}
module "storage" {
source = "./modules/storage"
datastore = var.datastore
# ...
}
resource "vsphere_virtual_machine" "app_server" {
name = "app-server-01"
resource_pool_id = data.vsphere_compute_cluster.cluster.resource_pool_id
datastore_id = module.storage.datastore_id
num_cpus = 4
memory = 8192
guest_id = data.vsphere_virtual_machine.template.guest_id
firmware = data.vsphere_virtual_machine.template.firmware
network_interface {
network_id = module.network.portgroup_id
}
disk {
label = "disk0"
size = data.vsphere_virtual_machine.template.disks.0.size
}
clone {
template_uuid = data.vsphere_virtual_machine.template.id
customize {
linux_options {
host_name = "app-server-01"
domain = var.domain
}
network_interface {}
}
}
}
The error's buried under module output, state refresh logs, and provider diagnostics. The signal's in there somewhere, but you're wading through output that has nothing to do with the failing resource.
terraform {
required_providers {
vsphere = {
source = "hashicorp/vsphere"
version = "~> 2.11"
}
}
}
provider "vsphere" {
vsphere_server = var.vsphere_server
user = var.vsphere_user
password = var.vsphere_password
allow_unverified_ssl = true
}
data "vsphere_datacenter" "dc" {
name = "dc-01"
}
data "vsphere_datastore" "ds" {
name = "datastore-01"
datacenter_id = data.vsphere_datacenter.dc.id
}
data "vsphere_compute_cluster" "cluster" {
name = "cluster-01"
datacenter_id = data.vsphere_datacenter.dc.id
}
data "vsphere_network" "network" {
name = "VM Network"
datacenter_id = data.vsphere_datacenter.dc.id
}
data "vsphere_virtual_machine" "template" {
name = "ubuntu-22.04-template"
datacenter_id = data.vsphere_datacenter.dc.id
}
resource "vsphere_virtual_machine" "vm" {
name = "mre-test-vm"
resource_pool_id = data.vsphere_compute_cluster.cluster.resource_pool_id
datastore_id = data.vsphere_datastore.ds.id
num_cpus = 2
memory = 2048
guest_id = data.vsphere_virtual_machine.template.guest_id
firmware = data.vsphere_virtual_machine.template.firmware
network_interface {
network_id = data.vsphere_network.network.id
}
disk {
label = "disk0"
size = data.vsphere_virtual_machine.template.disks.0.size
}
clone {
template_uuid = data.vsphere_virtual_machine.template.id
customize {
linux_options {
host_name = "mre-test-vm"
domain = "example.com"
}
network_interface {}
}
}
}
variable "vsphere_server" { type = string }
variable "vsphere_user" { type = string }
variable "vsphere_password" {
type = string
sensitive = true
}
With the module hierarchy, remote state, and unrelated resources stripped away, the error and the failing resource configuration sit side by side. OS customization needs two things to succeed: VMware Tools in the template, and a vSphere service account with the Virtual machine.Provisioning.Customize privilege. Neither one jumps out from a 600-line multi-module configuration; both are easy to check against a 50-line MRE.
This is still an infrastructure MRE: you'll need a reachable vSphere environment, inventory names that match the data sources, and provider credentials you're supplying on purpose. The point isn't to eliminate the lab; it's to strip unrelated Terraform so the failure and the failing resource stay in the same view.
Terraform MREs: Inline Everything
Replace var_file references, remote state backends, and module calls with inline variable blocks and hardcoded values. Remove every resource that isn't the one failing. You want a configuration you can run with terraform init && terraform apply in a clean directory against a fresh state file.
The Scenario: Your PowerCLI script filters VMs by name prefix and takes action on the matches. It's got audit logging, email notifications, error handling, and the polite disconnect at the end. The Where-Object filter keeps returning zero VMs when you'd expect the web- prefix matches.
Disclaimer
This is not an official VMware by Broadcom document. This is a personal blog post.
The information is provided as-is with no warranties and confers no rights.
Please, refer to official documentation for the most up-to-date information.
param (
[string]$VIServer = "vcenter.example.com",
[string]$Cluster = "Production",
[string]$Prefix = "web-"
)
Import-Module VMware.PowerCLI
function Write-AuditLog {
param([string]$Message)
Add-Content -Path "C:\audit\powercli.log" -Value "$(Get-Date) $Message"
}
Connect-VIServer -Server $VIServer -Credential (Get-Credential)
$cluster = Get-Cluster -Name $Cluster
$vms = Get-VM -Location $cluster | Where-Object { $_.Name -like $Prefix }
foreach ($vm in $vms) {
Write-AuditLog "Processing: $($vm.Name)"
# ... 60 more lines ...
}
Disconnect-VIServer -Server $VIServer -Confirm:$false
Zero VMs. The audit log stays empty. Is Get-VM returning anything? Wrong cluster reference? A scope issue with $Prefix? Once VMware's in the story, you're stuck ruling out connectivity and permissions before you even reach the filter expression, and those have nothing to do with the actual bug.
-like in PowerShell only does wildcard matching when you supply the wildcard. The pattern "web-" matches the literal string "web-" with nothing following it. No VM name is exactly that string, so Where-Object isn't wrong to filter everything out.
You can test the comparison directly:
"web-01" -like "web-" # False: matches the literal string only
"web-01" -like "web-*" # True: * matches any suffix
The Fix: Append * to the pattern: $_.Name -like "$Prefix*".
The Scenario: Your playbook uses Ansible's default filter to fall back to a safe value when a configuration variable isn't explicitly set. Locally, the fallback behaves. In staging, it doesn't: the variable's defined as an empty string in the inventory, and default quietly ignores the fallback value.
# site.yml
- name: Deploy application
hosts: app_servers
vars_files:
- group_vars/all.yml
- group_vars/staging.yml
roles:
- role: common
- role: app_config
- role: app_deploy
- role: monitoring
# roles/app_config/tasks/main.yml
- name: Show resolved environment
ansible.builtin.debug:
msg: "Deploying to: {{ app_env | default('production') }}"
The debug output shows "Deploying to: " in staging instead of "Deploying to: production". With roles, var files, and templates all in the mix, it isn't obvious which layer owns the variable or why the fallback never fires.
- name: MRE for default filter with empty string
hosts: localhost
gather_facts: false
vars:
app_env: ""
tasks:
- name: Show environment with default filter
ansible.builtin.debug:
msg: "Deploying to: {{ app_env | default('production') }}"
Run with:
The fallback still isn't applied. You don't need roles, var files, or inventory to see it.
In Ansible's Jinja2, default only applies when the variable is undefined. An empty string ("") still counts as defined: the variable exists, and its value is the empty string. default doesn't treat "empty" as a reason to fall back.
The Fix: Pass true as the second argument to default. This enables "default on falsy", which treats empty strings, zero, and None the same as undefined.
TASK [Show environment with default filter] ***
ok: [localhost] => {
"msg": "Deploying to: production"
}
hosts: localhost is Your Friend
Setting hosts: localhost and gather_facts: false makes any Ansible MRE easy to run without an inventory file or remote SSH. Use ansible-playbook mre.yml -i localhost, (the trailing comma makes a bare host list) to test variable behavior, filter logic, and task sequencing entirely on your laptop.
What to Include Alongside the Code¶
A runnable example without context is still an incomplete MRE. The code answers "what happens"; you also need to supply "what I expected to happen" and the environment in which it happened.
Include the following alongside any MRE:
| Include | Why It Matters |
|---|---|
| Expected behavior | Defines success; "it doesn't work" isn't a problem statement |
| Observed behavior | The actual output or error, verbatim |
| Tool and dependency versions | Behavior differences between versions are a common root cause |
| OS and runtime | Some bugs are platform-specific |
| Exact command to run the example | Remove any reason to guess |
Version Reference by Tool
| Tool | How to Capture the Version |
|---|---|
| Go | go version output and go.mod content |
| Python | python --version and pip show <package> |
| Terraform | terraform version output and the required_providers block |
| PowerShell / PowerCLI | $PSVersionTable and Get-Module VMware.PowerCLI -ListAvailable |
| Ansible | ansible --version |
Common Mistakes¶
Keeping Production Dependencies in the Example
A database that requires internal credentials, a secrets manager that needs an IAM role, an API endpoint only reachable inside the VPN: any of these makes the example non-reproducible for everyone except you. Replace each external dependency with the smallest local stand-in that triggers the same behavior.
Reporting the Side Effect Instead of the Root Failure
An exception stack trace is a symptom, not always the root cause. A function that silently returns the wrong value, a resource created with incorrect configuration, a template that renders to an unexpected string: these are symptoms that can be just as informative as an explicit error. Make sure your MRE demonstrates the actual behavior that surprised you, not a downstream artifact of it.
Stopping Too Early
"I simplified the code and the bug is still there" isn't a complete MRE when you still have 300 lines. Keep stripping. If the example genuinely can't be reduced below a certain threshold, say so and explain why. That constraint is itself diagnostic.
Not Running the Example Before Sharing
Run your MRE in a clean environment before you send it. You'll occasionally discover that it either doesn't reproduce the issue or has a missing dependency. Find this out before someone else does.
After you've built a few MREs, you notice a rhythm: you sit down to ask for help, and somewhere around the third deletion the story tells itself. Strip a dependency and the bug vanishes. Pin a version and the behavior shifts. Swap a library call for something tiny and obvious, and you suddenly see what the abstraction was hiding.
The MRE habit nudges you into real understanding before you outsource the thinking. Most bugs don't do well under that kind of light.
When an MRE doesn't crack the case alone, it still shrinks the gap between "I've got a bug" and "someone else gets my bug" from hours to minutes.
That's the quiet gift: you've already mapped the boundary so everyone can spend time fixing, not doing archaeology.