Performance — nanodlp

§ 2 — Benchmark table

The headline numbers.

All numbers measured on macOS arm64 (Apple Silicon), Rust 1.95.0 with target-cpu=native + lto=fat, Java Corretto 21 with -Xmx2g -XX:+UseG1GC. Apache Tika is JIT-warmed (10 prior iterations) before each measurement. Cold-start comparisons would widen every ratio further.

Format	nanodlp (parsekit)	Tika 2.9.2	Speedup
txt	0.005 ms · 40,689 MB/s	1.18 ms · 164 MB/s	250×
html	0.013 ms · 12,721 MB/s	2.99 ms · 56 MB/s	227×
eml (RFC 5322 + MIME)	0.014 ms · 14,101 MB/s	1.67 ms · 116 MB/s	122×
docx (small)	0.029 ms · 174 MB/s	2.20 ms · 2 MB/s	75×
pptx	0.33 ms · 85 MB/s	14.59 ms · 2 MB/s	44×
pdf (medium, 50 pp)	1.05 ms · 77 MB/s	26.41 ms · 3 MB/s	25×
pdf (large, 500 pp)	10.29 ms · 79 MB/s	234.70 ms · 3 MB/s	23×
docx (medium)	0.26 ms · 173 MB/s	4.77 ms · 9 MB/s	18×
md (markdown)	0.07 ms · 2,241 MB/s	1.17 ms · 135 MB/s	17×
csv	0.44 ms · 831 MB/s	5.62 ms · 64 MB/s	13×
xml	0.23 ms · 1,332 MB/s	1.53 ms · 197 MB/s	6.8×
rtf	0.24 ms · 807 MB/s	1.30 ms · 149 MB/s	5.4×
docx (large)	4.07 ms · 107 MB/s	20.80 ms · 21 MB/s	5.1×
docx (xlarge)	21.01 ms · 103 MB/s	103.04 ms · 21 MB/s	4.9×
json	0.27 ms · 1,049 MB/s	0.87 ms · 323 MB/s	3.2×
xlsx	4.40 ms · 26 MB/s	13.88 ms · 8 MB/s	3.2×
Geomean across all 16 cases			~21×
Median across all 16 cases			~17×

Where we win least: xlsx — calamine is faster than us specifically here (~50–150 MB/s vs our 26 MB/s). We may absorb their approach if xlsx becomes a workload bottleneck.

Where we win most: anything where Tika's per-call JVM overhead dominates — small files, single-doc latency-sensitive workloads, cold-path invocations.

See the full reproducible benchmark setup → BENCHMARKS.md on GitHub

§ 3 — Memory

Why memory matters more than speed for DLP.

Most DLP vendors talk about speed. Memory is the lock-in. Here's why.

Most DLP workloads aren't single-doc pipelines — they're fanouts of many continuous workers. One per connector. One per tenant. One per region. One per concurrent fetch. The thing that breaks first isn't aggregate speed; it's RAM per worker.

Metric	Apache Tika	nanodlp	What this means in a real deployment
Steady-state RSS per worker	300–400 MB	5–30 MB	nanodlp runs 30–80× more workers per box
Workers on a 16 GB laptop	~40	3,000+	nanodlp deploys on a Mac; Tika needs an EKS node pool
Workers on a 64 GB cloud node	~150	12,000+	nanodlp = one node per region; Tika = a 4–6 node pool
Cold start	200–500 ms (JVM + class load)	<5 ms (binary exec)	Bursty workloads, daemon restarts, autoscaling latency
GC pauses	50–500 ms STW under load	None (no GC)	Tail-latency stability; required for real-time policy enforcement
Container image size	350–600 MB (JRE + Tika + POI + PDFBox)	~22 MB stripped	k8s startup time; image-pull bandwidth cost

Concrete deployment example

A 5,000-person company's typical SaaS DLP corpus: 5,000 employees × ~100 GB each = ~500 TB. ~500 million documents at ~1 MB average.

With Tika

500 GB RSS per worker
× 1,500 workers needed
= 750 GB RAM
= ~12 i3.4xlarge instances
$25,000/month minimum

With nanodlp

5 MB RSS per worker
× 1,500 workers
= 7.5 GB RAM
= ~1 r5.large instance
$91/month

The 18× speed advantage is necessary but not sufficient. The 60× memory advantage is what makes "data plane in customer's environment" actually deployable — instead of "data plane in customer's environment IF they're willing to spin up an EKS cluster, which is a no-go for their security team."

§ 4 — Throughput math

What's actually deployable.

Numbers are great. Here's what they translate to in real deployments.

Workload	Tika typical setup	nanodlp typical setup
Solo professional, 50 GB Drive	n/a — Tika is too heavy	One Mac binary, daemon mode, 5 MB RSS
Mid-market SMB, 5 TB SaaS corpus	4-node EKS cluster, $1,500/mo	One AWS small instance, $50/mo
Enterprise, 500 TB SaaS corpus	12+ node cluster, $25,000+/mo	One small region per data residency boundary, $100–300/mo total
Healthcare hospital, 200 TB cross-app	Heavyweight enterprise DLP, $200k+/yr	Single binary, customer-side, no BAA
Defense / FedRAMP boundary	Disqualified — JVM operational story required	Single static binary, easy SBOM, FedRAMP-friendly

This is the difference: Tika's architecture is "we'll ship Java, you operationalize it." nanodlp's architecture is "we'll ship a binary, you ./nanodlp daemon."

For a security team without a dedicated platform group — which is most security teams under 50 people — Tika's architecture is a non-starter. nanodlp's is a Tuesday-afternoon deploy.

§ 5 — The PDF section

PDF is the format that breaks every DLP product.
Here's how we beat Tika by 25×.

PDF is the worst-behaved common document format. The container is well-spec'd; the contents are font subsets, custom CMaps, ligatures, content-stream operator soup. Apache PDFBox has had ~20 years of optimization. Pure-Rust alternatives (pdf-extract, lopdf-based) are 2–3× slower than PDFBox.

We benchmarked three PDF backends before settling on the fastest:

Backend	medium.pdf (81 KB / 50 pp)	large.pdf (810 KB / 500 pp)	vs Tika PDFBox
Tika PDFBox (Java)	26.4 ms · 3 MB/s	234.7 ms · 3 MB/s	baseline
pdf-extract (pure Rust, popular)	56.7 ms · 1.4 MB/s	603.7 ms · 1.3 MB/s	0.5× ⚠ slower
pdfium-render → Google's pdfium (FFI)	32.7 ms · 2.5 MB/s	334.8 ms · 2.4 MB/s	0.9× ≈ tied
parsekit custom tokenizer	1.05 ms · 77 MB/s	10.3 ms · 79 MB/s	25× / 23×

The trick is what we don't do.

PDFBox and pdf-extract both spend the bulk of their runtime on reading-order reconstruction — sorting glyph rectangles into paragraphs, collapsing inter-glyph spacing, merging columns, detecting text direction. That work matters if you're building Acrobat Reader or doing OCR-quality text extraction.

It does not matter for DLP, RAG, or search. The bytes coming out of Tj/TJ operators in source order are the answer. We skip the reconstruction. Output equivalence with Tika: within 0.05% character count on every PDF we tested.

That's the entire 25× win. It's not a clever new algorithm; it's a scope cut that everyone else missed because their primary use case was different.

§ 6 — Competitor comparison

Honest comparison. Architectural tradeoffs explained.

We don't trash anyone — we explain the architectural tradeoff each vendor made and why it matters specifically for DLP.

vs Microsoft Purview / Google Workspace DLP

Strong cloud DLP, but single-vendor blindness by design.

Attribute	Microsoft Purview	Google Workspace DLP	nanodlp
Coverage of your SaaS sprawl	M365 + limited cross-cloud	Workspace only	Drive + M365 + Slack + Dropbox + GitHub + …
Where data lives during scan	Microsoft cloud	Google cloud	Your environment
BAA required (healthcare)	Yes (Microsoft as BA)	Yes (Google as BA)	No (architecturally not a BA)
Schrems II posture (EU customers)	US data flow disclosure	US data flow disclosure	Zero data flow

When customers pick us over Purview: cross-cloud organizations (most enterprises today) or any healthcare / EU customer who can't credibly justify another Business Associate.

vs Symantec / Broadcom DLP / Forcepoint / BigID

Heavyweight enterprise DLP. Mature, expensive, slow to deploy.

Attribute	Symantec / Forcepoint / BigID	nanodlp
Time to first scan	4–12 weeks (dedicated implementation engineer)	10 minutes (single binary install)
Deployment complexity	Multi-server, Postgres, Java, agents	Single binary
Per-doc throughput	~5–15 MB/s (PDFBox/POI internals)	18–25× faster
Cost	$30–80/seat/mo + $50k–500k implementation	$99/seat/mo, no implementation cost
UI vintage	2010–2014 era	Modern (built last year)

When customers pick us over Symantec: any time-pressured deal where "deploy it next week" matters, or any deal where the buyer doesn't have an enterprise procurement team already familiar with Symantec's quirks.

vs Nightfall AI

Different architectural choice. Their data goes to their cloud.

Attribute	Nightfall AI	nanodlp
Where bytes go during scan	Nightfall's cloud	Your environment
BAA required	Yes (Nightfall as BA)	No
Subprocessor disclosure	Yes (Nightfall + their cloud + their model providers)	None
Latency model	API call per scan (~50–200ms p95 + network)	In-process (1–10ms p99)

When customers pick us over Nightfall: any procurement review with "where does our data go?" as a top-3 question.

vs Cyera / Sentra / Laminar / Dig (DSPM)

Same trust architecture as ours — different scope.

Attribute	DSPM (Cyera / Sentra / etc.)	nanodlp
Coverage scope	Cloud data warehouses (S3, Snowflake, BigQuery, RDS)	SaaS app file content (Drive, M365, Slack, Dropbox, GitHub)
File format extraction	Shallow (mostly text-only)	parsekit: 12 formats, 18× faster than Tika
Detector breadth	~50 PII / PCI / PHI patterns	123 patterns including secrets, AI keys, intl PII
Architectural overlap	High — same trust boundary model	High — complementary, not competing

When customers pick us over DSPM: when their sensitive data lives in Drive / OneDrive / Slack docs (most companies), not just in Snowflake. We're complementary; some customers will run both.

vs Trufflehog / Gitleaks

Trufflehog has 700+ detectors. We have 123. Why pick nanodlp?

Attribute	Trufflehog / Gitleaks	nanodlp
Code repo scanning	✅ Excellent	✅ Via GitHub connector
SaaS app scanning (Drive, M365, Slack)	❌	✅
Document format extraction (PDF, DOCX)	❌	✅ parsekit
Compliance reporting (HITRUST, SOC2, GDPR)	❌	Roadmapped
Multi-org control plane	❌	Hosted SaaS
Detector count	700+	123 (extensible via TOML; importing Trufflehog regexes)

When customers pick us over Trufflehog: when they need a product, not a CLI. We import compatible detector regexes from Trufflehog directly (Apache 2.0); the detector count gap is closing.

§ 7 — Architectural recap

How we got here.

Three things, in roughly decreasing order of impact.

a. No JVM. No GC. No heap.

Apache Tika runs on the JVM. The JVM's GC is excellent — G1GC and ZGC are engineering marvels. But the GC must exist because the JVM allocates. Every String, every byte[], every SAX event object goes on the heap. Under load, GC pauses are 50–500ms. nanodlp is a single native Rust binary. RSS = input + inflated buffer + output. Deterministic. Bounded. Measurable.

b. We don't build DOMs.

Apache POI builds an XmlObject DOM for every part of every OOXML document. Every text node becomes a Java String allocation. Every cell in xlsx becomes an object. That's why POI's xlsx throughput is ~5–10 MB/s. parsekit walks quick-xml SAX events directly. Every text node is a &[u8] slice into a single inflated buffer. The output is Vec<u8>::extend_from_slice — a memcpy. No String allocation per element. No DOM. No GC.

c. We skip work nobody needed.

PDF: skip reading-order reconstruction. We don't need glyph positions for DLP. XLSX: skip cell formula evaluation. The cell formula is what matters for DLP, not what it would compute to. DOCX: skip style resolution. Style of text doesn't affect whether it's a credit card. HTML: skip the full HTML5 tokenizer. SIMD-accelerated tag-stripping is correct for our use case. These aren't lazy decisions; they're deliberate scope cuts. We would lose against Tika on rendering tasks. We're not in that business.

d. SIMD where it counts.

memchr (BurntSushi) under the hood for boundary scans (< in HTML, --boundary in MIME, ( in PDF content streams). Modern CPUs do this in 16–32 bytes per cycle. Tika's character-by-character scans cap out at hundreds of MB/s; we cap at GB/s.

§ 8 — Honest scope

What we don't claim.

The numbers are real. Here are the cases where they're misleading.

Benchmarks are synthetic

We use a gen-corpus tool that produces lorem-ipsum + table + tab/break interleaving. Real-world Drive contents have more structural variety; expect ±20% variance from headline numbers.

PDF /ToUnicode CMaps not yet handled

For subset fonts (typical of Office-emitted PDFs), output is currently glyph indices, not readable text. ~200 LOC to fix; on the roadmap. Until then, PDFs from Office/LaTeX may need the pdfium-render fallback.

calamine beats us on xlsx

Specifically ~50–150 MB/s vs our 26 MB/s. They've put more cycles into XLSX-specific optimization. We may absorb their approach if xlsx becomes a workload bottleneck.

MuPDF beats us on small PDFs

~10–25 MB/s vs our 77 MB/s on small PDFs. On large PDFs it inverts. MuPDF is AGPL — license-incompatible for commercial use without a paid license. We chose to write our own instead.

Benchmarked against Tika 2.9.2, not 3.x

Tika 3.x has similar internals. Expect numbers within ±10%.

JIT-warmed comparison

Cold-start would make Tika look much worse. We chose JIT-warmed because it's what services see in production. We're being conservative.

§ 9 — Reproducibility

Reproduce it yourself.

Don't trust marketing claims. Run the benchmark. Your hardware will produce different absolute numbers but the ratios should be within ±20% of ours.

bash

# Clone the repo
git clone https://github.com/yourorg/nanodlp.git
cd nanodlp/parsekit

# Build with native CPU optimizations
RUSTFLAGS="-C target-cpu=native" cargo build --release --workspace --examples

# Generate the synthetic corpus
cargo run --release -p gen-corpus

# Run parsekit's benchmarks
cargo bench -p parsekit-docx

# Run the Tika side-by-side comparison (downloads tika-app jar on first run)
./tools/run-comparison.sh

The full reproducibility instructions are at parsekit/BENCHMARKS.md. All our numbers are from this script.

Built so you don't need a Kubernetes clusterto scan your documents.

The headline numbers.

Why memory matters more than speed for DLP.

Concrete deployment example

What's actually deployable.

PDF is the format that breaks every DLP product.Here's how we beat Tika by 25×.

The trick is what we don't do.

Honest comparison. Architectural tradeoffs explained.

vs Microsoft Purview / Google Workspace DLP

vs Symantec / Broadcom DLP / Forcepoint / BigID

vs Nightfall AI

vs Cyera / Sentra / Laminar / Dig (DSPM)

vs Trufflehog / Gitleaks

How we got here.

a. No JVM. No GC. No heap.

b. We don't build DOMs.

c. We skip work nobody needed.

d. SIMD where it counts.

What we don't claim.

Benchmarks are synthetic

PDF /ToUnicode CMaps not yet handled

calamine beats us on xlsx

MuPDF beats us on small PDFs

Benchmarked against Tika 2.9.2, not 3.x

JIT-warmed comparison

Reproduce it yourself.

Built so you don't need a Kubernetes cluster
to scan your documents.

PDF is the format that breaks every DLP product.
Here's how we beat Tika by 25×.