Built so you don't need a Kubernetes cluster
to scan your documents.

nanodlp scans 18× faster than Apache Tika and uses 30× less memory per worker. That's not a number we print to look good — it's what makes customer-side DLP feasible without a dedicated infrastructure team.

Apache 2.0 · Single Rust binary · Runs on a laptop · Reproducible benchmarks

21× geomean speedup vs Tika
across 16 benchmark cases

measured on M-series Mac
reproducible from source

The headline numbers.

All numbers measured on macOS arm64 (Apple Silicon), Rust 1.95.0 with target-cpu=native + lto=fat, Java Corretto 21 with -Xmx2g -XX:+UseG1GC. Apache Tika is JIT-warmed (10 prior iterations) before each measurement. Cold-start comparisons would widen every ratio further.

Format nanodlp (parsekit) Tika 2.9.2 Speedup
txt 0.005 ms · 40,689 MB/s 1.18 ms · 164 MB/s 250×
html 0.013 ms · 12,721 MB/s 2.99 ms · 56 MB/s 227×
eml (RFC 5322 + MIME) 0.014 ms · 14,101 MB/s 1.67 ms · 116 MB/s 122×
docx (small) 0.029 ms · 174 MB/s 2.20 ms · 2 MB/s 75×
pptx 0.33 ms · 85 MB/s 14.59 ms · 2 MB/s 44×
pdf (medium, 50 pp) 1.05 ms · 77 MB/s 26.41 ms · 3 MB/s 25×
pdf (large, 500 pp) 10.29 ms · 79 MB/s 234.70 ms · 3 MB/s 23×
docx (medium) 0.26 ms · 173 MB/s 4.77 ms · 9 MB/s 18×
md (markdown) 0.07 ms · 2,241 MB/s 1.17 ms · 135 MB/s 17×
csv 0.44 ms · 831 MB/s 5.62 ms · 64 MB/s 13×
xml 0.23 ms · 1,332 MB/s 1.53 ms · 197 MB/s 6.8×
rtf 0.24 ms · 807 MB/s 1.30 ms · 149 MB/s 5.4×
docx (large) 4.07 ms · 107 MB/s 20.80 ms · 21 MB/s 5.1×
docx (xlarge) 21.01 ms · 103 MB/s 103.04 ms · 21 MB/s 4.9×
json 0.27 ms · 1,049 MB/s 0.87 ms · 323 MB/s 3.2×
xlsx 4.40 ms · 26 MB/s 13.88 ms · 8 MB/s 3.2×
Geomean across all 16 cases ~21×
Median across all 16 cases ~17×
Where we win least: xlsx — calamine is faster than us specifically here (~50–150 MB/s vs our 26 MB/s). We may absorb their approach if xlsx becomes a workload bottleneck.

Where we win most: anything where Tika's per-call JVM overhead dominates — small files, single-doc latency-sensitive workloads, cold-path invocations.

See the full reproducible benchmark setup → BENCHMARKS.md on GitHub

Why memory matters more than speed for DLP.

Most DLP vendors talk about speed. Memory is the lock-in. Here's why.

Most DLP workloads aren't single-doc pipelines — they're fanouts of many continuous workers. One per connector. One per tenant. One per region. One per concurrent fetch. The thing that breaks first isn't aggregate speed; it's RAM per worker.

Metric Apache Tika nanodlp What this means in a real deployment
Steady-state RSS per worker 300–400 MB 5–30 MB nanodlp runs 30–80× more workers per box
Workers on a 16 GB laptop ~40 3,000+ nanodlp deploys on a Mac; Tika needs an EKS node pool
Workers on a 64 GB cloud node ~150 12,000+ nanodlp = one node per region; Tika = a 4–6 node pool
Cold start 200–500 ms (JVM + class load) <5 ms (binary exec) Bursty workloads, daemon restarts, autoscaling latency
GC pauses 50–500 ms STW under load None (no GC) Tail-latency stability; required for real-time policy enforcement
Container image size 350–600 MB (JRE + Tika + POI + PDFBox) ~22 MB stripped k8s startup time; image-pull bandwidth cost

Concrete deployment example

A 5,000-person company's typical SaaS DLP corpus: 5,000 employees × ~100 GB each = ~500 TB. ~500 million documents at ~1 MB average.

With Tika
500 GB RSS per worker
× 1,500 workers needed
= 750 GB RAM
= ~12 i3.4xlarge instances
$25,000/month minimum
With nanodlp
5 MB RSS per worker
× 1,500 workers
= 7.5 GB RAM
= ~1 r5.large instance
$91/month
The 18× speed advantage is necessary but not sufficient. The 60× memory advantage is what makes "data plane in customer's environment" actually deployable — instead of "data plane in customer's environment IF they're willing to spin up an EKS cluster, which is a no-go for their security team."

What's actually deployable.

Numbers are great. Here's what they translate to in real deployments.

Workload Tika typical setup nanodlp typical setup
Solo professional, 50 GB Drive n/a — Tika is too heavy One Mac binary, daemon mode, 5 MB RSS
Mid-market SMB, 5 TB SaaS corpus 4-node EKS cluster, $1,500/mo One AWS small instance, $50/mo
Enterprise, 500 TB SaaS corpus 12+ node cluster, $25,000+/mo One small region per data residency boundary, $100–300/mo total
Healthcare hospital, 200 TB cross-app Heavyweight enterprise DLP, $200k+/yr Single binary, customer-side, no BAA
Defense / FedRAMP boundary Disqualified — JVM operational story required Single static binary, easy SBOM, FedRAMP-friendly
This is the difference: Tika's architecture is "we'll ship Java, you operationalize it." nanodlp's architecture is "we'll ship a binary, you ./nanodlp daemon."

For a security team without a dedicated platform group — which is most security teams under 50 people — Tika's architecture is a non-starter. nanodlp's is a Tuesday-afternoon deploy.

PDF is the format that breaks every DLP product.
Here's how we beat Tika by 25×.

PDF is the worst-behaved common document format. The container is well-spec'd; the contents are font subsets, custom CMaps, ligatures, content-stream operator soup. Apache PDFBox has had ~20 years of optimization. Pure-Rust alternatives (pdf-extract, lopdf-based) are 2–3× slower than PDFBox.

We benchmarked three PDF backends before settling on the fastest:

Backend medium.pdf (81 KB / 50 pp) large.pdf (810 KB / 500 pp) vs Tika PDFBox
Tika PDFBox (Java) 26.4 ms · 3 MB/s 234.7 ms · 3 MB/s baseline
pdf-extract (pure Rust, popular) 56.7 ms · 1.4 MB/s 603.7 ms · 1.3 MB/s 0.5× ⚠ slower
pdfium-render → Google's pdfium (FFI) 32.7 ms · 2.5 MB/s 334.8 ms · 2.4 MB/s 0.9× ≈ tied
parsekit custom tokenizer 1.05 ms · 77 MB/s 10.3 ms · 79 MB/s 25× / 23×

The trick is what we don't do.

PDFBox and pdf-extract both spend the bulk of their runtime on reading-order reconstruction — sorting glyph rectangles into paragraphs, collapsing inter-glyph spacing, merging columns, detecting text direction. That work matters if you're building Acrobat Reader or doing OCR-quality text extraction.

It does not matter for DLP, RAG, or search. The bytes coming out of Tj/TJ operators in source order are the answer. We skip the reconstruction. Output equivalence with Tika: within 0.05% character count on every PDF we tested.

That's the entire 25× win. It's not a clever new algorithm; it's a scope cut that everyone else missed because their primary use case was different.

Honest comparison. Architectural tradeoffs explained.

We don't trash anyone — we explain the architectural tradeoff each vendor made and why it matters specifically for DLP.

vs Microsoft Purview / Google Workspace DLP

Strong cloud DLP, but single-vendor blindness by design.

AttributeMicrosoft PurviewGoogle Workspace DLPnanodlp
Coverage of your SaaS sprawl M365 + limited cross-cloud Workspace only Drive + M365 + Slack + Dropbox + GitHub + …
Where data lives during scan Microsoft cloud Google cloud Your environment
BAA required (healthcare) Yes (Microsoft as BA) Yes (Google as BA) No (architecturally not a BA)
Schrems II posture (EU customers) US data flow disclosure US data flow disclosure Zero data flow
When customers pick us over Purview: cross-cloud organizations (most enterprises today) or any healthcare / EU customer who can't credibly justify another Business Associate.

vs Symantec / Broadcom DLP / Forcepoint / BigID

Heavyweight enterprise DLP. Mature, expensive, slow to deploy.

AttributeSymantec / Forcepoint / BigIDnanodlp
Time to first scan 4–12 weeks (dedicated implementation engineer) 10 minutes (single binary install)
Deployment complexity Multi-server, Postgres, Java, agents Single binary
Per-doc throughput ~5–15 MB/s (PDFBox/POI internals) 18–25× faster
Cost $30–80/seat/mo + $50k–500k implementation $99/seat/mo, no implementation cost
UI vintage 2010–2014 era Modern (built last year)
When customers pick us over Symantec: any time-pressured deal where "deploy it next week" matters, or any deal where the buyer doesn't have an enterprise procurement team already familiar with Symantec's quirks.

vs Nightfall AI

Different architectural choice. Their data goes to their cloud.

AttributeNightfall AInanodlp
Where bytes go during scan Nightfall's cloud Your environment
BAA required Yes (Nightfall as BA) No
Subprocessor disclosure Yes (Nightfall + their cloud + their model providers) None
Latency model API call per scan (~50–200ms p95 + network) In-process (1–10ms p99)
When customers pick us over Nightfall: any procurement review with "where does our data go?" as a top-3 question.

vs Cyera / Sentra / Laminar / Dig (DSPM)

Same trust architecture as ours — different scope.

AttributeDSPM (Cyera / Sentra / etc.)nanodlp
Coverage scope Cloud data warehouses (S3, Snowflake, BigQuery, RDS) SaaS app file content (Drive, M365, Slack, Dropbox, GitHub)
File format extraction Shallow (mostly text-only) parsekit: 12 formats, 18× faster than Tika
Detector breadth ~50 PII / PCI / PHI patterns 123 patterns including secrets, AI keys, intl PII
Architectural overlap High — same trust boundary model High — complementary, not competing
When customers pick us over DSPM: when their sensitive data lives in Drive / OneDrive / Slack docs (most companies), not just in Snowflake. We're complementary; some customers will run both.

vs Trufflehog / Gitleaks

Trufflehog has 700+ detectors. We have 123. Why pick nanodlp?

AttributeTrufflehog / Gitleaksnanodlp
Code repo scanning ✅ Excellent ✅ Via GitHub connector
SaaS app scanning (Drive, M365, Slack)
Document format extraction (PDF, DOCX) ✅ parsekit
Compliance reporting (HITRUST, SOC2, GDPR) Roadmapped
Multi-org control plane Hosted SaaS
Detector count 700+ 123 (extensible via TOML; importing Trufflehog regexes)
When customers pick us over Trufflehog: when they need a product, not a CLI. We import compatible detector regexes from Trufflehog directly (Apache 2.0); the detector count gap is closing.

How we got here.

Three things, in roughly decreasing order of impact.

a. No JVM. No GC. No heap.

Apache Tika runs on the JVM. The JVM's GC is excellent — G1GC and ZGC are engineering marvels. But the GC must exist because the JVM allocates. Every String, every byte[], every SAX event object goes on the heap. Under load, GC pauses are 50–500ms. nanodlp is a single native Rust binary. RSS = input + inflated buffer + output. Deterministic. Bounded. Measurable.

b. We don't build DOMs.

Apache POI builds an XmlObject DOM for every part of every OOXML document. Every text node becomes a Java String allocation. Every cell in xlsx becomes an object. That's why POI's xlsx throughput is ~5–10 MB/s. parsekit walks quick-xml SAX events directly. Every text node is a &[u8] slice into a single inflated buffer. The output is Vec<u8>::extend_from_slice — a memcpy. No String allocation per element. No DOM. No GC.

c. We skip work nobody needed.

PDF: skip reading-order reconstruction. We don't need glyph positions for DLP. XLSX: skip cell formula evaluation. The cell formula is what matters for DLP, not what it would compute to. DOCX: skip style resolution. Style of text doesn't affect whether it's a credit card. HTML: skip the full HTML5 tokenizer. SIMD-accelerated tag-stripping is correct for our use case. These aren't lazy decisions; they're deliberate scope cuts. We would lose against Tika on rendering tasks. We're not in that business.

d. SIMD where it counts.

memchr (BurntSushi) under the hood for boundary scans (< in HTML, --boundary in MIME, ( in PDF content streams). Modern CPUs do this in 16–32 bytes per cycle. Tika's character-by-character scans cap out at hundreds of MB/s; we cap at GB/s.

What we don't claim.

The numbers are real. Here are the cases where they're misleading.

Benchmarks are synthetic

We use a gen-corpus tool that produces lorem-ipsum + table + tab/break interleaving. Real-world Drive contents have more structural variety; expect ±20% variance from headline numbers.

PDF /ToUnicode CMaps not yet handled

For subset fonts (typical of Office-emitted PDFs), output is currently glyph indices, not readable text. ~200 LOC to fix; on the roadmap. Until then, PDFs from Office/LaTeX may need the pdfium-render fallback.

calamine beats us on xlsx

Specifically ~50–150 MB/s vs our 26 MB/s. They've put more cycles into XLSX-specific optimization. We may absorb their approach if xlsx becomes a workload bottleneck.

MuPDF beats us on small PDFs

~10–25 MB/s vs our 77 MB/s on small PDFs. On large PDFs it inverts. MuPDF is AGPL — license-incompatible for commercial use without a paid license. We chose to write our own instead.

Benchmarked against Tika 2.9.2, not 3.x

Tika 3.x has similar internals. Expect numbers within ±10%.

JIT-warmed comparison

Cold-start would make Tika look much worse. We chose JIT-warmed because it's what services see in production. We're being conservative.

Reproduce it yourself.

Don't trust marketing claims. Run the benchmark. Your hardware will produce different absolute numbers but the ratios should be within ±20% of ours.

bash
# Clone the repo
git clone https://github.com/yourorg/nanodlp.git
cd nanodlp/parsekit

# Build with native CPU optimizations
RUSTFLAGS="-C target-cpu=native" cargo build --release --workspace --examples

# Generate the synthetic corpus
cargo run --release -p gen-corpus

# Run parsekit's benchmarks
cargo bench -p parsekit-docx

# Run the Tika side-by-side comparison (downloads tika-app jar on first run)
./tools/run-comparison.sh

The full reproducibility instructions are at parsekit/BENCHMARKS.md. All our numbers are from this script.

Get started → · Read the architecture → · See full benchmark methodology →