nanodlp scans 18× faster than Apache Tika and uses 30× less memory per worker. That's not a number we print to look good — it's what makes customer-side DLP feasible without a dedicated infrastructure team.
Apache 2.0 · Single Rust binary · Runs on a laptop · Reproducible benchmarks
All numbers measured on macOS arm64 (Apple Silicon), Rust 1.95.0 with target-cpu=native + lto=fat, Java Corretto 21 with -Xmx2g -XX:+UseG1GC. Apache Tika is JIT-warmed (10 prior iterations) before each measurement. Cold-start comparisons would widen every ratio further.
| Format | nanodlp (parsekit) | Tika 2.9.2 | Speedup |
|---|---|---|---|
| txt | 0.005 ms · 40,689 MB/s | 1.18 ms · 164 MB/s | 250× |
| html | 0.013 ms · 12,721 MB/s | 2.99 ms · 56 MB/s | 227× |
| eml (RFC 5322 + MIME) | 0.014 ms · 14,101 MB/s | 1.67 ms · 116 MB/s | 122× |
| docx (small) | 0.029 ms · 174 MB/s | 2.20 ms · 2 MB/s | 75× |
| pptx | 0.33 ms · 85 MB/s | 14.59 ms · 2 MB/s | 44× |
| pdf (medium, 50 pp) | 1.05 ms · 77 MB/s | 26.41 ms · 3 MB/s | 25× |
| pdf (large, 500 pp) | 10.29 ms · 79 MB/s | 234.70 ms · 3 MB/s | 23× |
| docx (medium) | 0.26 ms · 173 MB/s | 4.77 ms · 9 MB/s | 18× |
| md (markdown) | 0.07 ms · 2,241 MB/s | 1.17 ms · 135 MB/s | 17× |
| csv | 0.44 ms · 831 MB/s | 5.62 ms · 64 MB/s | 13× |
| xml | 0.23 ms · 1,332 MB/s | 1.53 ms · 197 MB/s | 6.8× |
| rtf | 0.24 ms · 807 MB/s | 1.30 ms · 149 MB/s | 5.4× |
| docx (large) | 4.07 ms · 107 MB/s | 20.80 ms · 21 MB/s | 5.1× |
| docx (xlarge) | 21.01 ms · 103 MB/s | 103.04 ms · 21 MB/s | 4.9× |
| json | 0.27 ms · 1,049 MB/s | 0.87 ms · 323 MB/s | 3.2× |
| xlsx | 4.40 ms · 26 MB/s | 13.88 ms · 8 MB/s | 3.2× |
| Geomean across all 16 cases | ~21× | ||
| Median across all 16 cases | ~17× | ||
See the full reproducible benchmark setup → BENCHMARKS.md on GitHub
Most DLP vendors talk about speed. Memory is the lock-in. Here's why.
Most DLP workloads aren't single-doc pipelines — they're fanouts of many continuous workers. One per connector. One per tenant. One per region. One per concurrent fetch. The thing that breaks first isn't aggregate speed; it's RAM per worker.
| Metric | Apache Tika | nanodlp | What this means in a real deployment |
|---|---|---|---|
| Steady-state RSS per worker | 300–400 MB | 5–30 MB | nanodlp runs 30–80× more workers per box |
| Workers on a 16 GB laptop | ~40 | 3,000+ | nanodlp deploys on a Mac; Tika needs an EKS node pool |
| Workers on a 64 GB cloud node | ~150 | 12,000+ | nanodlp = one node per region; Tika = a 4–6 node pool |
| Cold start | 200–500 ms (JVM + class load) | <5 ms (binary exec) | Bursty workloads, daemon restarts, autoscaling latency |
| GC pauses | 50–500 ms STW under load | None (no GC) | Tail-latency stability; required for real-time policy enforcement |
| Container image size | 350–600 MB (JRE + Tika + POI + PDFBox) | ~22 MB stripped | k8s startup time; image-pull bandwidth cost |
A 5,000-person company's typical SaaS DLP corpus: 5,000 employees × ~100 GB each = ~500 TB. ~500 million documents at ~1 MB average.
Numbers are great. Here's what they translate to in real deployments.
| Workload | Tika typical setup | nanodlp typical setup |
|---|---|---|
| Solo professional, 50 GB Drive | n/a — Tika is too heavy | One Mac binary, daemon mode, 5 MB RSS |
| Mid-market SMB, 5 TB SaaS corpus | 4-node EKS cluster, $1,500/mo | One AWS small instance, $50/mo |
| Enterprise, 500 TB SaaS corpus | 12+ node cluster, $25,000+/mo | One small region per data residency boundary, $100–300/mo total |
| Healthcare hospital, 200 TB cross-app | Heavyweight enterprise DLP, $200k+/yr | Single binary, customer-side, no BAA |
| Defense / FedRAMP boundary | Disqualified — JVM operational story required | Single static binary, easy SBOM, FedRAMP-friendly |
./nanodlp daemon."PDF is the worst-behaved common document format. The container is well-spec'd; the contents are font subsets, custom CMaps, ligatures, content-stream operator soup. Apache PDFBox has had ~20 years of optimization. Pure-Rust alternatives (pdf-extract, lopdf-based) are 2–3× slower than PDFBox.
We benchmarked three PDF backends before settling on the fastest:
| Backend | medium.pdf (81 KB / 50 pp) | large.pdf (810 KB / 500 pp) | vs Tika PDFBox |
|---|---|---|---|
| Tika PDFBox (Java) | 26.4 ms · 3 MB/s | 234.7 ms · 3 MB/s | baseline |
| pdf-extract (pure Rust, popular) | 56.7 ms · 1.4 MB/s | 603.7 ms · 1.3 MB/s | 0.5× ⚠ slower |
| pdfium-render → Google's pdfium (FFI) | 32.7 ms · 2.5 MB/s | 334.8 ms · 2.4 MB/s | 0.9× ≈ tied |
| parsekit custom tokenizer | 1.05 ms · 77 MB/s | 10.3 ms · 79 MB/s | 25× / 23× |
PDFBox and pdf-extract both spend the bulk of their runtime on reading-order reconstruction — sorting glyph rectangles into paragraphs, collapsing inter-glyph spacing, merging columns, detecting text direction. That work matters if you're building Acrobat Reader or doing OCR-quality text extraction.
It does not matter for DLP, RAG, or search. The bytes coming out of Tj/TJ operators in source order are the answer. We skip the reconstruction. Output equivalence with Tika: within 0.05% character count on every PDF we tested.
We don't trash anyone — we explain the architectural tradeoff each vendor made and why it matters specifically for DLP.
Strong cloud DLP, but single-vendor blindness by design.
| Attribute | Microsoft Purview | Google Workspace DLP | nanodlp |
|---|---|---|---|
| Coverage of your SaaS sprawl | M365 + limited cross-cloud | Workspace only | Drive + M365 + Slack + Dropbox + GitHub + … |
| Where data lives during scan | Microsoft cloud | Google cloud | Your environment |
| BAA required (healthcare) | Yes (Microsoft as BA) | Yes (Google as BA) | No (architecturally not a BA) |
| Schrems II posture (EU customers) | US data flow disclosure | US data flow disclosure | Zero data flow |
Heavyweight enterprise DLP. Mature, expensive, slow to deploy.
| Attribute | Symantec / Forcepoint / BigID | nanodlp |
|---|---|---|
| Time to first scan | 4–12 weeks (dedicated implementation engineer) | 10 minutes (single binary install) |
| Deployment complexity | Multi-server, Postgres, Java, agents | Single binary |
| Per-doc throughput | ~5–15 MB/s (PDFBox/POI internals) | 18–25× faster |
| Cost | $30–80/seat/mo + $50k–500k implementation | $99/seat/mo, no implementation cost |
| UI vintage | 2010–2014 era | Modern (built last year) |
Different architectural choice. Their data goes to their cloud.
| Attribute | Nightfall AI | nanodlp |
|---|---|---|
| Where bytes go during scan | Nightfall's cloud | Your environment |
| BAA required | Yes (Nightfall as BA) | No |
| Subprocessor disclosure | Yes (Nightfall + their cloud + their model providers) | None |
| Latency model | API call per scan (~50–200ms p95 + network) | In-process (1–10ms p99) |
Same trust architecture as ours — different scope.
| Attribute | DSPM (Cyera / Sentra / etc.) | nanodlp |
|---|---|---|
| Coverage scope | Cloud data warehouses (S3, Snowflake, BigQuery, RDS) | SaaS app file content (Drive, M365, Slack, Dropbox, GitHub) |
| File format extraction | Shallow (mostly text-only) | parsekit: 12 formats, 18× faster than Tika |
| Detector breadth | ~50 PII / PCI / PHI patterns | 123 patterns including secrets, AI keys, intl PII |
| Architectural overlap | High — same trust boundary model | High — complementary, not competing |
Trufflehog has 700+ detectors. We have 123. Why pick nanodlp?
| Attribute | Trufflehog / Gitleaks | nanodlp |
|---|---|---|
| Code repo scanning | ✅ Excellent | ✅ Via GitHub connector |
| SaaS app scanning (Drive, M365, Slack) | ❌ | ✅ |
| Document format extraction (PDF, DOCX) | ❌ | ✅ parsekit |
| Compliance reporting (HITRUST, SOC2, GDPR) | ❌ | Roadmapped |
| Multi-org control plane | ❌ | Hosted SaaS |
| Detector count | 700+ | 123 (extensible via TOML; importing Trufflehog regexes) |
Three things, in roughly decreasing order of impact.
Apache Tika runs on the JVM. The JVM's GC is excellent — G1GC and ZGC are engineering marvels. But the GC must exist because the JVM allocates. Every String, every byte[], every SAX event object goes on the heap. Under load, GC pauses are 50–500ms. nanodlp is a single native Rust binary. RSS = input + inflated buffer + output. Deterministic. Bounded. Measurable.
Apache POI builds an XmlObject DOM for every part of every OOXML document. Every text node becomes a Java String allocation. Every cell in xlsx becomes an object. That's why POI's xlsx throughput is ~5–10 MB/s. parsekit walks quick-xml SAX events directly. Every text node is a &[u8] slice into a single inflated buffer. The output is Vec<u8>::extend_from_slice — a memcpy. No String allocation per element. No DOM. No GC.
PDF: skip reading-order reconstruction. We don't need glyph positions for DLP. XLSX: skip cell formula evaluation. The cell formula is what matters for DLP, not what it would compute to. DOCX: skip style resolution. Style of text doesn't affect whether it's a credit card. HTML: skip the full HTML5 tokenizer. SIMD-accelerated tag-stripping is correct for our use case. These aren't lazy decisions; they're deliberate scope cuts. We would lose against Tika on rendering tasks. We're not in that business.
memchr (BurntSushi) under the hood for boundary scans (< in HTML, --boundary in MIME, ( in PDF content streams). Modern CPUs do this in 16–32 bytes per cycle. Tika's character-by-character scans cap out at hundreds of MB/s; we cap at GB/s.
The numbers are real. Here are the cases where they're misleading.
We use a gen-corpus tool that produces lorem-ipsum + table + tab/break interleaving. Real-world Drive contents have more structural variety; expect ±20% variance from headline numbers.
For subset fonts (typical of Office-emitted PDFs), output is currently glyph indices, not readable text. ~200 LOC to fix; on the roadmap. Until then, PDFs from Office/LaTeX may need the pdfium-render fallback.
Specifically ~50–150 MB/s vs our 26 MB/s. They've put more cycles into XLSX-specific optimization. We may absorb their approach if xlsx becomes a workload bottleneck.
~10–25 MB/s vs our 77 MB/s on small PDFs. On large PDFs it inverts. MuPDF is AGPL — license-incompatible for commercial use without a paid license. We chose to write our own instead.
Tika 3.x has similar internals. Expect numbers within ±10%.
Cold-start would make Tika look much worse. We chose JIT-warmed because it's what services see in production. We're being conservative.
Don't trust marketing claims. Run the benchmark. Your hardware will produce different absolute numbers but the ratios should be within ±20% of ours.
# Clone the repo git clone https://github.com/yourorg/nanodlp.git cd nanodlp/parsekit # Build with native CPU optimizations RUSTFLAGS="-C target-cpu=native" cargo build --release --workspace --examples # Generate the synthetic corpus cargo run --release -p gen-corpus # Run parsekit's benchmarks cargo bench -p parsekit-docx # Run the Tika side-by-side comparison (downloads tika-app jar on first run) ./tools/run-comparison.sh
The full reproducibility instructions are at parsekit/BENCHMARKS.md. All our numbers are from this script.