Benchmarks

compatmalloc prioritizes security over raw performance. This page describes the performance characteristics, overhead sources, and how to run benchmarks to measure the impact on your workloads.

Latest CI Results (x86_64)

Auto-generated by CI on 2026-04-12 04:07 UTC from commit 7fc983c. Results are from GitHub Actions runners (shared infrastructure) and may vary between runs. Each allocator is run 3 times; the best (lowest latency) result is kept.

Multi-Allocator Comparison

Allocator	Weighted Overhead	Latency (64B)	Throughput 1T	Ratio	Throughput 4T	Ratio	Peak RSS
compatmalloc	+13.3900%	14.2 ns	66.58 Mops/s	.87x	147.25 Mops/s	.90x	15212 KB
glibc	0%	11.8 ns	76.26 Mops/s	1.00x	163.24 Mops/s	1.00x	10788 KB
jemalloc	+41.9800%	8.9 ns	102.97 Mops/s	1.35x	258.07 Mops/s	1.58x	41644 KB
mimalloc	+18.3900%	8.1 ns	81.18 Mops/s	1.06x	186.93 Mops/s	1.14x	22860 KB
passthrough	+75.1600%	21.5 ns	44.50 Mops/s	.58x	18.30 Mops/s	.11x	11008 KB
scudo	+297.4300%	49.6 ns	19.99 Mops/s	.26x	40.28 Mops/s	.24x	14588 KB

Ratio interpretation: Latency ratio < 1.0 = faster than glibc. Throughput ratio > 1.0 = faster than glibc.

Hardened allocators: compatmalloc, scudo. These have security features (guard pages, quarantine, etc.) that add overhead vs. pure-performance allocators.

Peak RSS measured via /usr/bin/time -v during a single benchmark run. Hardening features (quarantine, guard pages) increase memory usage.

malloc/free Latency by Size (glibc)

  size=      16:     12.6 ns
  size=      32:     11.9 ns
  size=      64:     11.8 ns
  size=     128:     11.8 ns
  size=     256:     11.8 ns
  size=     512:     12.0 ns
  size=    1024:     11.9 ns
  size=    4096:     24.7 ns
  size=   16384:     24.6 ns
  size=   65536:     25.0 ns
  size=  262144:     25.0 ns
  size=      16:     15.9 ns
  size=      64:     16.3 ns
  size=     256:     26.9 ns
  size=    1024:     29.0 ns
  size=    4096:     91.6 ns
  size=   65536:    758.6 ns

malloc/free Latency by Size (compatmalloc)

  size=      16:     14.9 ns
  size=      32:     14.2 ns
  size=      64:     14.2 ns
  size=     128:     14.3 ns
  size=     256:     14.3 ns
  size=     512:     14.8 ns
  size=    1024:     14.6 ns
  size=    4096:     14.4 ns
  size=   16384:     14.5 ns
  size=   65536:     24.5 ns
  size=  262144:     24.8 ns
  size=      16:     14.0 ns
  size=      64:     14.0 ns
  size=     256:     15.1 ns
  size=    1024:     19.9 ns
  size=    4096:     77.1 ns
  size=   65536:    751.2 ns

Multi-threaded Throughput (glibc)

  threads=1:  76.26 Mops/sec
  threads=2: 150.24 Mops/sec
  threads=4: 163.24 Mops/sec
  threads=8: 160.46 Mops/sec

Multi-threaded Throughput (compatmalloc)

  threads=1:  66.58 Mops/sec
  threads=2: 129.54 Mops/sec
  threads=4: 147.25 Mops/sec
  threads=8: 141.87 Mops/sec

Real-World Application Overhead

Application	glibc	compatmalloc	Overhead
python-json	.0673s	.0771s	14.00%
redis	2.6586s	2.4729s	-7.00%
nginx	5.1036s	5.1030s	-1.00%
sqlite	.1564s	.1326s	-16.00%
git	.5614s	.5505s	-2.00%

Application benchmarks measure wall-clock time for real programs (Python, Redis, nginx, SQLite, Git). Overhead = (compatmalloc_time / glibc_time - 1) * 100%.

ARM64 CI Results

Auto-generated by CI on 2026-04-12 04:07 UTC from commit 7fc983c. Runner architecture: aarch64 | Best-of-3 runs.

Multi-Allocator Comparison (ARM64)

Allocator	Weighted Overhead	Latency (64B)	Throughput 1T	Ratio	Throughput 4T	Ratio	Peak RSS
compatmalloc	+22.8100%	15.1 ns	67.19 Mops/s	.81x	255.62 Mops/s	.79x	12688 KB
glibc	0%	11.7 ns	82.40 Mops/s	1.00x	321.44 Mops/s	1.00x	10272 KB
jemalloc	+34.6200%	10.1 ns	97.38 Mops/s	1.18x	368.90 Mops/s	1.14x	12112 KB
mimalloc	+24.9400%	7.2 ns	80.88 Mops/s	.98x	311.97 Mops/s	.97x	9268 KB

Ratio interpretation: Latency ratio < 1.0 = faster than glibc. Throughput ratio > 1.0 = faster than glibc.

Peak RSS measured via /usr/bin/time -v during a single benchmark run. Hardening features (quarantine, guard pages) increase memory usage.

malloc/free Latency by Size - ARM64 (glibc)

  size=      16:     11.8 ns
  size=      32:     11.7 ns
  size=      64:     11.7 ns
  size=     128:     11.9 ns
  size=     256:     11.8 ns
  size=     512:     12.0 ns
  size=    1024:     12.1 ns
  size=    4096:     24.2 ns
  size=   16384:     24.5 ns
  size=   65536:     24.5 ns
  size=  262144:     25.1 ns
  size=      16:     17.6 ns
  size=      64:     18.0 ns
  size=     256:     27.9 ns
  size=    1024:     32.1 ns
  size=    4096:     51.9 ns
  size=   65536:    380.1 ns

malloc/free Latency by Size - ARM64 (compatmalloc)

  size=      16:     14.9 ns
  size=      32:     15.0 ns
  size=      64:     15.1 ns
  size=     128:     15.2 ns
  size=     256:     15.2 ns
  size=     512:     15.2 ns
  size=    1024:     15.2 ns
  size=    4096:     15.1 ns
  size=   16384:     15.5 ns
  size=   65536:     36.1 ns
  size=  262144:     36.1 ns
  size=      16:     16.8 ns
  size=      64:     17.1 ns
  size=     256:     19.6 ns
  size=    1024:     25.0 ns
  size=    4096:     46.1 ns
  size=   65536:    398.7 ns

Multi-threaded Throughput - ARM64 (glibc)

  threads=1:  82.40 Mops/sec
  threads=2: 163.38 Mops/sec
  threads=4: 321.44 Mops/sec
  threads=8: 308.48 Mops/sec

Multi-threaded Throughput - ARM64 (compatmalloc)

  threads=1:  67.19 Mops/sec
  threads=2: 131.87 Mops/sec
  threads=4: 255.62 Mops/sec
  threads=8: 248.84 Mops/sec

Performance characteristics

Expected overhead

Compared to glibc's ptmalloc2, compatmalloc adds overhead from several sources:

Source	Per-malloc cost	Per-free cost
Metadata table insert	Hash + linear probe + mutex	--
Metadata table lookup	--	Hash + linear probe + mutex
Canary write	`memset` of gap bytes	Canary check (byte comparison)
Poison fill	--	`memset` of allocation
Quarantine push/evict	--	Mutex + ring buffer enqueue
Zero-on-free	--	`memset` of allocation (on eviction)
Guard page setup	`mprotect` (large alloc only)	--

For small allocations (16-256 bytes), the dominant costs are the metadata table operations and the canary/poison fills. For large allocations, the mmap/munmap syscalls dominate regardless of hardening.

Size class efficiency

The slab allocator uses 4-per-doubling size classes, which means internal fragmentation is at most 25% for any allocation. Size classes range from 16 bytes to 16,384 bytes (36 classes total).

Arena contention

With the default arena count (one per CPU), contention is low for most workloads. Programs with many threads performing high-frequency allocations may benefit from explicitly setting COMPATMALLOC_ARENA_COUNT to a higher value.

Running benchmarks

Microbenchmark suite

The benchmark suite is a standalone binary that measures allocator performance via LD_PRELOAD:

# Build the library and benchmark
cargo build --release
rustc -O benches/src/micro.rs -o target/release/micro

# Run with glibc (baseline)
ALLOCATOR_NAME=glibc ./target/release/micro

# Run with compatmalloc
ALLOCATOR_NAME=compatmalloc \
  LD_PRELOAD=./target/release/libcompatmalloc.so \
  ./target/release/micro

Full comparison script

To compare against multiple allocators (glibc, jemalloc, mimalloc, scudo):

./benches/scripts/run_comparison.sh

Disabling hardening for comparison

To measure the overhead of hardening features, build with no features:

cargo build --release --no-default-features
ALLOCATOR_NAME=minimal \
  LD_PRELOAD=./target/release/libcompatmalloc.so \
  ./target/release/micro

LD_PRELOAD benchmarks with external programs

For realistic benchmarks, test with real applications:

# Time a build with and without compatmalloc
time cargo build --release

time LD_PRELOAD=./target/release/libcompatmalloc.so \
  cargo build --release

# Python workload
time python3 -c "
import json
data = [{'key': str(i), 'value': list(range(100))} for i in range(10000)]
result = json.dumps(data)
parsed = json.loads(result)
"

time LD_PRELOAD=./target/release/libcompatmalloc.so python3 -c "
import json
data = [{'key': str(i), 'value': list(range(100))} for i in range(10000)]
result = json.dumps(data)
parsed = json.loads(result)
"

Tuning for performance

If the overhead is too high for your use case, you can selectively disable features:

Configuration	Approximate overhead reduction
Disable `zero-on-free`	Removes one `memset` per free
Disable `poison-on-free`	Removes one `memset` per free (and disables write-after-free check)
Reduce quarantine size	Reduces memory pressure and eviction processing
Disable `guard-pages`	Removes `mprotect` calls and reduces virtual address space usage
Disable `canaries`	Removes canary write/check per alloc/free
`COMPATMALLOC_DISABLE=1`	Bypasses all hardening (passthrough to glibc)

Weighted composite overhead

The headline "Weighted Overhead" metric computes a single overhead percentage that accounts for real-world allocation size distributions. Instead of reporting only the 64-byte latency, we weight each allocation size by its frequency in typical programs (based on jemalloc/tcmalloc telemetry data):

Size	Weight	Rationale
16B	20%	Most common (tiny objects, pointers, small structs)
32B	15%	Second most common
64B	15%	Common for small structs, string headers
128B	12%	Medium-small objects
256B	10%	Strings, small buffers
512B	8%	Buffers
1K	5%	Page-ish allocations
4K	5%	Page-aligned allocations
16K	4%	Large buffers
64K	3%	Near mmap threshold
256K	3%	Very large allocations

Formula: overhead = (Σ weight_i × (alloc_latency_i / glibc_latency_i) − 1) × 100%

A weighted overhead of +15% means compatmalloc is 15% slower than glibc across a representative workload mix. Negative values indicate compatmalloc is faster.

Methodology notes

When benchmarking allocators, keep the following in mind:

Warm up the allocator. The first few allocations may be slower due to slab initialization and metadata table growth.
Test with realistic workloads. Microbenchmarks of malloc/free loops do not represent real application behavior.
Measure RSS, not just time. Hardening features (quarantine, guard pages) increase resident memory. Use getrusage or /proc/self/status to measure VmRSS.
Account for variance. Run benchmarks multiple times and report medians. Allocator performance can be sensitive to ASLR and system load.
Best-of-3 selection. CI results use the minimum latency and maximum throughput from 3 runs. This filters out noise from shared infrastructure while reflecting the allocator's true capability.
Compare against other allocators. The comparison table includes jemalloc and mimalloc (performance-focused) alongside scudo (hardened, like compatmalloc). This provides context for the overhead of hardening features.