Benchmarks
compatmalloc prioritizes security over raw performance. This page describes the performance characteristics, overhead sources, and how to run benchmarks to measure the impact on your workloads.
Latest CI Results (x86_64)
Auto-generated by CI on 2026-03-08 04:47 UTC from commit
760fbb2. Results are from GitHub Actions runners (shared infrastructure) and may vary between runs. Each allocator is run 3 times; the best (lowest latency) result is kept.
Multi-Allocator Comparison
| Allocator | Weighted Overhead | Latency (64B) | Throughput 1T | Ratio | Throughput 4T | Ratio | Peak RSS |
|---|---|---|---|---|---|---|---|
| compatmalloc | +11.5300% | 14.5 ns | 65.48 Mops/s | .87x | 150.78 Mops/s | .87x | 15096 KB |
| glibc | 0% | 12.3 ns | 74.53 Mops/s | 1.00x | 171.84 Mops/s | 1.00x | 10656 KB |
| jemalloc | +58.1100% | 9.9 ns | 95.32 Mops/s | 1.27x | 257.40 Mops/s | 1.49x | 37576 KB |
| mimalloc | +16.6200% | 8.9 ns | 81.17 Mops/s | 1.08x | 199.16 Mops/s | 1.15x | 25044 KB |
| passthrough | +64.5900% | 20.9 ns | 43.21 Mops/s | .57x | 19.00 Mops/s | .11x | 10996 KB |
| scudo | +310.1300% | 53.4 ns | 18.34 Mops/s | .24x | 38.73 Mops/s | .22x | 14664 KB |
Ratio interpretation: Latency ratio < 1.0 = faster than glibc. Throughput ratio > 1.0 = faster than glibc.
Hardened allocators: compatmalloc, scudo. These have security features (guard pages, quarantine, etc.) that add overhead vs. pure-performance allocators.
Peak RSS measured via
/usr/bin/time -vduring a single benchmark run. Hardening features (quarantine, guard pages) increase memory usage.
malloc/free Latency by Size (glibc)
size= 16: 12.8 ns
size= 32: 12.3 ns
size= 64: 12.3 ns
size= 128: 12.4 ns
size= 256: 12.3 ns
size= 512: 12.3 ns
size= 1024: 12.3 ns
size= 4096: 23.2 ns
size= 16384: 23.5 ns
size= 65536: 24.0 ns
size= 262144: 24.4 ns
size= 16: 15.9 ns
size= 64: 16.6 ns
size= 256: 26.8 ns
size= 1024: 28.5 ns
size= 4096: 70.3 ns
size= 65536: 764.3 ns
malloc/free Latency by Size (compatmalloc)
size= 16: 14.9 ns
size= 32: 14.5 ns
size= 64: 14.5 ns
size= 128: 14.7 ns
size= 256: 14.5 ns
size= 512: 14.5 ns
size= 1024: 14.5 ns
size= 4096: 14.5 ns
size= 16384: 14.5 ns
size= 65536: 24.2 ns
size= 262144: 24.2 ns
size= 16: 13.7 ns
size= 64: 13.8 ns
size= 256: 14.2 ns
size= 1024: 19.8 ns
size= 4096: 60.7 ns
size= 65536: 765.3 ns
Multi-threaded Throughput (glibc)
threads=1: 74.53 Mops/sec
threads=2: 144.38 Mops/sec
threads=4: 171.84 Mops/sec
threads=8: 167.43 Mops/sec
Multi-threaded Throughput (compatmalloc)
threads=1: 65.48 Mops/sec
threads=2: 125.34 Mops/sec
threads=4: 150.78 Mops/sec
threads=8: 145.36 Mops/sec
Real-World Application Overhead
| Application | glibc | compatmalloc | Overhead |
|---|---|---|---|
| python-json | .0728s | .0857s | 17.00% |
| redis | 3.3375s | 3.3545s | 0% |
| nginx | 5.1043s | 5.1042s | -1.00% |
| sqlite | .2104s | .1281s | -40.00% |
| git | .3197s | .1816s | -44.00% |
Application benchmarks measure wall-clock time for real programs (Python, Redis, nginx, SQLite, Git). Overhead = (compatmalloc_time / glibc_time - 1) * 100%.
Performance characteristics
Expected overhead
Compared to glibc's ptmalloc2, compatmalloc adds overhead from several sources:
| Source | Per-malloc cost | Per-free cost |
|---|---|---|
| Metadata table insert | Hash + linear probe + mutex | -- |
| Metadata table lookup | -- | Hash + linear probe + mutex |
| Canary write | memset of gap bytes | Canary check (byte comparison) |
| Poison fill | -- | memset of allocation |
| Quarantine push/evict | -- | Mutex + ring buffer enqueue |
| Zero-on-free | -- | memset of allocation (on eviction) |
| Guard page setup | mprotect (large alloc only) | -- |
For small allocations (16-256 bytes), the dominant costs are the metadata table operations and the canary/poison fills. For large allocations, the mmap/munmap syscalls dominate regardless of hardening.
Size class efficiency
The slab allocator uses 4-per-doubling size classes, which means internal fragmentation is at most 25% for any allocation. Size classes range from 16 bytes to 16,384 bytes (36 classes total).
Arena contention
With the default arena count (one per CPU), contention is low for most workloads. Programs with many threads performing high-frequency allocations may benefit from explicitly setting COMPATMALLOC_ARENA_COUNT to a higher value.
Running benchmarks
Microbenchmark suite
The benchmark suite is a standalone binary that measures allocator performance via LD_PRELOAD:
# Build the library and benchmark
cargo build --release
rustc -O benches/src/micro.rs -o target/release/micro
# Run with glibc (baseline)
ALLOCATOR_NAME=glibc ./target/release/micro
# Run with compatmalloc
ALLOCATOR_NAME=compatmalloc \
LD_PRELOAD=./target/release/libcompatmalloc.so \
./target/release/micro
Full comparison script
To compare against multiple allocators (glibc, jemalloc, mimalloc, scudo):
./benches/scripts/run_comparison.sh
Disabling hardening for comparison
To measure the overhead of hardening features, build with no features:
cargo build --release --no-default-features
ALLOCATOR_NAME=minimal \
LD_PRELOAD=./target/release/libcompatmalloc.so \
./target/release/micro
LD_PRELOAD benchmarks with external programs
For realistic benchmarks, test with real applications:
# Time a build with and without compatmalloc
time cargo build --release
time LD_PRELOAD=./target/release/libcompatmalloc.so \
cargo build --release
# Python workload
time python3 -c "
import json
data = [{'key': str(i), 'value': list(range(100))} for i in range(10000)]
result = json.dumps(data)
parsed = json.loads(result)
"
time LD_PRELOAD=./target/release/libcompatmalloc.so python3 -c "
import json
data = [{'key': str(i), 'value': list(range(100))} for i in range(10000)]
result = json.dumps(data)
parsed = json.loads(result)
"
Tuning for performance
If the overhead is too high for your use case, you can selectively disable features:
| Configuration | Approximate overhead reduction |
|---|---|
Disable zero-on-free | Removes one memset per free |
Disable poison-on-free | Removes one memset per free (and disables write-after-free check) |
| Reduce quarantine size | Reduces memory pressure and eviction processing |
Disable guard-pages | Removes mprotect calls and reduces virtual address space usage |
Disable canaries | Removes canary write/check per alloc/free |
COMPATMALLOC_DISABLE=1 | Bypasses all hardening (passthrough to glibc) |
Weighted composite overhead
The headline "Weighted Overhead" metric computes a single overhead percentage that accounts for real-world allocation size distributions. Instead of reporting only the 64-byte latency, we weight each allocation size by its frequency in typical programs (based on jemalloc/tcmalloc telemetry data):
| Size | Weight | Rationale |
|---|---|---|
| 16B | 20% | Most common (tiny objects, pointers, small structs) |
| 32B | 15% | Second most common |
| 64B | 15% | Common for small structs, string headers |
| 128B | 12% | Medium-small objects |
| 256B | 10% | Strings, small buffers |
| 512B | 8% | Buffers |
| 1K | 5% | Page-ish allocations |
| 4K | 5% | Page-aligned allocations |
| 16K | 4% | Large buffers |
| 64K | 3% | Near mmap threshold |
| 256K | 3% | Very large allocations |
Formula: overhead = (Σ weight_i × (alloc_latency_i / glibc_latency_i) − 1) × 100%
A weighted overhead of +15% means compatmalloc is 15% slower than glibc across a representative workload mix. Negative values indicate compatmalloc is faster.
Methodology notes
When benchmarking allocators, keep the following in mind:
- Warm up the allocator. The first few allocations may be slower due to slab initialization and metadata table growth.
- Test with realistic workloads. Microbenchmarks of
malloc/freeloops do not represent real application behavior. - Measure RSS, not just time. Hardening features (quarantine, guard pages) increase resident memory. Use
getrusageor/proc/self/statusto measureVmRSS. - Account for variance. Run benchmarks multiple times and report medians. Allocator performance can be sensitive to ASLR and system load.
- Best-of-3 selection. CI results use the minimum latency and maximum throughput from 3 runs. This filters out noise from shared infrastructure while reflecting the allocator's true capability.
- Compare against other allocators. The comparison table includes jemalloc and mimalloc (performance-focused) alongside scudo (hardened, like compatmalloc). This provides context for the overhead of hardening features.