Benchmark design
This document describes both current benchmark infrastructure (Layer 2 harness in
xtask/src/bench/, bench-proxy, Terraform-managed EC2 instances) and planned enhancements (Layer 1 protocol tests, Layer 3 cross-tool comparison,/proc/net/devmeasurement,pidstatcapture). Planned items are marked inline.
Context
The v1 benchmark was built to answer “is ocync faster than dregsy/regsync on real
registries?” and acquired additional responsibilities along the way —
visibility via HTTP proxy capture, competitor-config generation,
optimization firing-rate checks, CI regression detection.
Putting all of those on a single code path produced a benchmark that:
- Measured infrastructure more than tools. mitmproxy’s single-core Python TLS capped measurable throughput at ~250 Mbps regardless of instance size.
- Couldn’t distinguish optimization firing rate from optimization
effectiveness. “Did
ocync’s cross-repo mount save bytes in this run?” was not directly measurable. We only discovered that ECR returns 202 to every mount attempt after the corpus was large enough to force the path and the metric existed to count it. - Papered over competitor-tool bugs.
regsync’s per-scope token failures anddregsy’s exit-1-on-partial behavior both became bench-suite configuration code rather than reported findings. This madeocynclook better than it should have because competitors were silently partially-failing. - Grew ad-hoc. Scenarios, metrics, and corpus entries were added in response to whatever question came up. There is no acceptance criterion for what a scenario must establish before it’s useful.
Goal
A benchmark that can be trusted, in the sense that anyone reading a
number knows what it means, what it excludes, and how to reproduce it
— and that keeps producing trustworthy numbers as ocync and the
surrounding ecosystem evolve.
Design
Three separate things, formerly conflated, now strictly separated.
Layer 1: protocol tests
Question it answers: does the wire protocol actually do what we
claim? For ocync: does each optimization take the designed fast path
against the registry we target?
Where it lives: alongside the existing integration tests in
crates/ocync-distribution/tests/.
registry2_*.rssuites run mount/client/push fast-path assertions against the referenceregistry:2image via testcontainers. Cheap, deterministic, runs in CI on every PR.- Registry-specific quirks that cannot be exercised against
registry:2(e.g. ECR’s “never fulfills mount” behavior) are captured as evidence in the per-registry documentation and pinned by engine-level integration tests asserting the adapted code path.
Hard rule: every ocync optimization that claims bytes/requests
savings ships with a matching Layer 1 test that asserts the fast path
was taken at the wire level, such as mounts > 0 && status == 201 for
cross-repo mount, PATCH 202/201 counts for chunked upload, and HEAD skipped counts for cache hits. An optimization without a
protocol test is considered unshipped; the PR is blocked.
Output format: standard Rust #[test] pass/fail. No aggregation,
no dashboards. Either the protocol is correct or the build fails.
Layer 2: throughput benchmark
Question it answers: how fast is ocync in realistic conditions,
and when it slows down, why?
Where it lives: xtask/src/bench/.
Scope: ocync only. No competitor tools. No MITM proxy. One
scenario, one question, one headline number per run.
Measurement primitives:
- Wall clock:
std::time::Instantaround theocyncinvocation. - Egress bytes (planned): diff of
/proc/net/devens5tx_bytes before and after the run. Captures actual network effort, immune to implementation-level request counting. - CPU and memory (planned):
pidstat -u -r -p <ocync pid> 1streamed to a file; post-processed for p50/p95/max of each. - Per-image completion timestamps: parsed from
ocync’s--jsonstdout. - API-level counts (request method histogram, HTTP status distribution, bytes split): not part of Layer 2. Those belong in Layer 1 protocol tests or in ad-hoc capture runs.
Scenarios (designed to isolate one question each):
cold: fresh ECR target, first-time sync of a representative corpus. Measures “how fast canocyncactually push bytes?”warm: re-sync of an already-synced corpus. Measures “how cheap is the no-op path?”partial: re-sync after ~5% of tags changed at source. Measures “doesocyncreliably skip unchanged blobs while pushing changes?”scale: cold throughput across corpus sizes (10, 25, 50, full). Measures “doesocync’s throughput scale linearly with corpus size?”
Rules:
- CDN pre-warm is mandatory. Before the timed window, the harness HEADs every source manifest and GETs every unique blob (following CDN redirects) to equalize CDN edge cache state across all tools. Auth token fetches, DNS, and TCP setup are also amortized during pre-warm.
- Iterations >= 3, default 5, median reported, p10/p90 published. Single-run variance hides real regressions.
- Results are versioned artifacts. Output lands at
bench-results/{git_sha}/{instance_type}/{corpus_sha}/{timestamp}/. Nothing is overwritten. Regression detection compares runs at the same(instance_type, corpus_sha)coordinate. - Failures are loud. A scenario that can’t complete (source registry 403, ECR rate limit, partial tool failure) fails the run, not the tool. The report says “incomplete” and explains why.
Headline number per scenario: one sentence. “ocync syncs 15 GB of
images in X seconds (p50, N=5) on c6in.4xlarge, Docker Hub to
us-east-2 ECR, from commit <sha>.” If a scenario cannot be
summarized in one sentence, it’s measuring too many things and needs
to be split.
Layer 3: cross-tool comparison
Question it answers: positioning. “For users considering dregsy
or regsync, what tradeoff are they making?”
Where it lives: bench/competitors/ (planned; directory does not exist yet), a separate directory with
its own docs, harness, and runbook. Not part of xtask bench. Not
part of CI. Runs are manually initiated.
Rules:
- Explicit caveats. Every output document leads with “
dregsyandregsyncare Go/skopeo-based, have different auth architectures, different concurrency defaults, different feature sets. Numbers are directional, not authoritative.” - Only wall clock and egress bytes are compared. Request counts are not fairly comparable (skopeo subprocess boundary hides request fan-out). Response bytes are not fairly comparable (mount success reduces bytes asymmetrically).
- Separate EC2 instance per tool. Eliminates NAT and endpoint contention between tools.
- Competitor failures are reported, not compensated. If
regsyncneedsrepoAuth: trueon cgr.dev, that’s aregsyncconfiguration note in the report, not silent harness magic. Ifdregsyexits 1 on partial success, the run is reported as partial. No more config generation knowing every quirk.
Cross-tool fairness for OCI 1.1 referrers
ocync syncs OCI 1.1 referrer artifacts (SBOM, SLSA attestations) by
default. Comparable tools do not implement the referrers API. A naive
“ocync 55.4 GB > comparable 55.3 GB” reading of the source-bytes
column would infer ocync is less efficient when the truth is +117 MB
of attestation content ocync correctly transferred.
To surface this honestly, the bench tracks referrer_calls
(GET requests to /v2/<repo>/referrers/...) per tool and emits a
footnote in summary.md and the docs performance.md snippet whenever
any tool’s count is non-zero. Comparable tools always show 0;
ocync’s count surfaces the feature gap so the bytes/GETs comparison is
read with context.
What this avoids
The wasted cycles in the original benchmark came from four questions being tangled into one number:
- “Is
ocyncslow?” (Layer 2 question) - “Is the proxy slow?” (Layer 2 infrastructure question)
- “Does ECR honor mount?” (Layer 1 question)
- “Is docker.io misconfigured?” (Layer 1 question)
A byte count cannot answer those. A layered split maps each question
to its answerable home. The rule “every optimization ships with a
protocol test” means this kind of silent failure can’t recur. If the
mount optimization had shipped with a wire-level test asserting
mounts > 0 && status == 201, CI would have failed on the original PR
and the short-circuit would have shipped up front.
What we do NOT do
- Multi-tool parallel benchmarking on shared infrastructure. Contention masks signal.
- Time-series resource capture as default. Adds scope for diminishing return. Enable when we have a concrete question.
- “Ocync wins” framing in any published numbers. Numbers are directional, not endorsement. Positioning goes in marketing, not in design docs.
- Performance regression as a PR-blocking check. Too flaky at the scale we operate. Post-merge notification only.