Benchmarks

compatmalloc prioritizes security over raw performance. This page describes the performance characteristics, overhead sources, and how to run benchmarks to measure the impact on your workloads.

Latest CI Results (x86_64)

Auto-generated by CI on 2026-03-08 04:47 UTC from commit 760fbb2. Results are from GitHub Actions runners (shared infrastructure) and may vary between runs. Each allocator is run 3 times; the best (lowest latency) result is kept.

Multi-Allocator Comparison

AllocatorWeighted OverheadLatency (64B)Throughput 1TRatioThroughput 4TRatioPeak RSS
compatmalloc+11.5300%14.5 ns65.48 Mops/s.87x150.78 Mops/s.87x15096 KB
glibc0%12.3 ns74.53 Mops/s1.00x171.84 Mops/s1.00x10656 KB
jemalloc+58.1100%9.9 ns95.32 Mops/s1.27x257.40 Mops/s1.49x37576 KB
mimalloc+16.6200%8.9 ns81.17 Mops/s1.08x199.16 Mops/s1.15x25044 KB
passthrough+64.5900%20.9 ns43.21 Mops/s.57x19.00 Mops/s.11x10996 KB
scudo+310.1300%53.4 ns18.34 Mops/s.24x38.73 Mops/s.22x14664 KB

Ratio interpretation: Latency ratio < 1.0 = faster than glibc. Throughput ratio > 1.0 = faster than glibc.

Hardened allocators: compatmalloc, scudo. These have security features (guard pages, quarantine, etc.) that add overhead vs. pure-performance allocators.

Peak RSS measured via /usr/bin/time -v during a single benchmark run. Hardening features (quarantine, guard pages) increase memory usage.

malloc/free Latency by Size (glibc)

  size=      16:     12.8 ns
  size=      32:     12.3 ns
  size=      64:     12.3 ns
  size=     128:     12.4 ns
  size=     256:     12.3 ns
  size=     512:     12.3 ns
  size=    1024:     12.3 ns
  size=    4096:     23.2 ns
  size=   16384:     23.5 ns
  size=   65536:     24.0 ns
  size=  262144:     24.4 ns
  size=      16:     15.9 ns
  size=      64:     16.6 ns
  size=     256:     26.8 ns
  size=    1024:     28.5 ns
  size=    4096:     70.3 ns
  size=   65536:    764.3 ns

malloc/free Latency by Size (compatmalloc)

  size=      16:     14.9 ns
  size=      32:     14.5 ns
  size=      64:     14.5 ns
  size=     128:     14.7 ns
  size=     256:     14.5 ns
  size=     512:     14.5 ns
  size=    1024:     14.5 ns
  size=    4096:     14.5 ns
  size=   16384:     14.5 ns
  size=   65536:     24.2 ns
  size=  262144:     24.2 ns
  size=      16:     13.7 ns
  size=      64:     13.8 ns
  size=     256:     14.2 ns
  size=    1024:     19.8 ns
  size=    4096:     60.7 ns
  size=   65536:    765.3 ns

Multi-threaded Throughput (glibc)

  threads=1:  74.53 Mops/sec
  threads=2: 144.38 Mops/sec
  threads=4: 171.84 Mops/sec
  threads=8: 167.43 Mops/sec

Multi-threaded Throughput (compatmalloc)

  threads=1:  65.48 Mops/sec
  threads=2: 125.34 Mops/sec
  threads=4: 150.78 Mops/sec
  threads=8: 145.36 Mops/sec

Real-World Application Overhead

ApplicationglibccompatmallocOverhead
python-json.0728s.0857s17.00%
redis3.3375s3.3545s0%
nginx5.1043s5.1042s-1.00%
sqlite.2104s.1281s-40.00%
git.3197s.1816s-44.00%

Application benchmarks measure wall-clock time for real programs (Python, Redis, nginx, SQLite, Git). Overhead = (compatmalloc_time / glibc_time - 1) * 100%.

Performance characteristics

Expected overhead

Compared to glibc's ptmalloc2, compatmalloc adds overhead from several sources:

SourcePer-malloc costPer-free cost
Metadata table insertHash + linear probe + mutex--
Metadata table lookup--Hash + linear probe + mutex
Canary writememset of gap bytesCanary check (byte comparison)
Poison fill--memset of allocation
Quarantine push/evict--Mutex + ring buffer enqueue
Zero-on-free--memset of allocation (on eviction)
Guard page setupmprotect (large alloc only)--

For small allocations (16-256 bytes), the dominant costs are the metadata table operations and the canary/poison fills. For large allocations, the mmap/munmap syscalls dominate regardless of hardening.

Size class efficiency

The slab allocator uses 4-per-doubling size classes, which means internal fragmentation is at most 25% for any allocation. Size classes range from 16 bytes to 16,384 bytes (36 classes total).

Arena contention

With the default arena count (one per CPU), contention is low for most workloads. Programs with many threads performing high-frequency allocations may benefit from explicitly setting COMPATMALLOC_ARENA_COUNT to a higher value.

Running benchmarks

Microbenchmark suite

The benchmark suite is a standalone binary that measures allocator performance via LD_PRELOAD:

# Build the library and benchmark
cargo build --release
rustc -O benches/src/micro.rs -o target/release/micro

# Run with glibc (baseline)
ALLOCATOR_NAME=glibc ./target/release/micro

# Run with compatmalloc
ALLOCATOR_NAME=compatmalloc \
  LD_PRELOAD=./target/release/libcompatmalloc.so \
  ./target/release/micro

Full comparison script

To compare against multiple allocators (glibc, jemalloc, mimalloc, scudo):

./benches/scripts/run_comparison.sh

Disabling hardening for comparison

To measure the overhead of hardening features, build with no features:

cargo build --release --no-default-features
ALLOCATOR_NAME=minimal \
  LD_PRELOAD=./target/release/libcompatmalloc.so \
  ./target/release/micro

LD_PRELOAD benchmarks with external programs

For realistic benchmarks, test with real applications:

# Time a build with and without compatmalloc
time cargo build --release

time LD_PRELOAD=./target/release/libcompatmalloc.so \
  cargo build --release

# Python workload
time python3 -c "
import json
data = [{'key': str(i), 'value': list(range(100))} for i in range(10000)]
result = json.dumps(data)
parsed = json.loads(result)
"

time LD_PRELOAD=./target/release/libcompatmalloc.so python3 -c "
import json
data = [{'key': str(i), 'value': list(range(100))} for i in range(10000)]
result = json.dumps(data)
parsed = json.loads(result)
"

Tuning for performance

If the overhead is too high for your use case, you can selectively disable features:

ConfigurationApproximate overhead reduction
Disable zero-on-freeRemoves one memset per free
Disable poison-on-freeRemoves one memset per free (and disables write-after-free check)
Reduce quarantine sizeReduces memory pressure and eviction processing
Disable guard-pagesRemoves mprotect calls and reduces virtual address space usage
Disable canariesRemoves canary write/check per alloc/free
COMPATMALLOC_DISABLE=1Bypasses all hardening (passthrough to glibc)

Weighted composite overhead

The headline "Weighted Overhead" metric computes a single overhead percentage that accounts for real-world allocation size distributions. Instead of reporting only the 64-byte latency, we weight each allocation size by its frequency in typical programs (based on jemalloc/tcmalloc telemetry data):

SizeWeightRationale
16B20%Most common (tiny objects, pointers, small structs)
32B15%Second most common
64B15%Common for small structs, string headers
128B12%Medium-small objects
256B10%Strings, small buffers
512B8%Buffers
1K5%Page-ish allocations
4K5%Page-aligned allocations
16K4%Large buffers
64K3%Near mmap threshold
256K3%Very large allocations

Formula: overhead = (Σ weight_i × (alloc_latency_i / glibc_latency_i) − 1) × 100%

A weighted overhead of +15% means compatmalloc is 15% slower than glibc across a representative workload mix. Negative values indicate compatmalloc is faster.

Methodology notes

When benchmarking allocators, keep the following in mind:

  1. Warm up the allocator. The first few allocations may be slower due to slab initialization and metadata table growth.
  2. Test with realistic workloads. Microbenchmarks of malloc/free loops do not represent real application behavior.
  3. Measure RSS, not just time. Hardening features (quarantine, guard pages) increase resident memory. Use getrusage or /proc/self/status to measure VmRSS.
  4. Account for variance. Run benchmarks multiple times and report medians. Allocator performance can be sensitive to ASLR and system load.
  5. Best-of-3 selection. CI results use the minimum latency and maximum throughput from 3 runs. This filters out noise from shared infrastructure while reflecting the allocator's true capability.
  6. Compare against other allocators. The comparison table includes jemalloc and mimalloc (performance-focused) alongside scudo (hardened, like compatmalloc). This provides context for the overhead of hardening features.