🚨 OpenMed just mass-released 35 state-of-the-art PII detection models to the open-source community!
All Apache 2.0. All free. Forever. 🍀
Here's what @OpenMed_AI built and why it matters for healthcare AI safety. Supporting HIPAA, GDPR, and beyond.
Thread 🧵👇
The results speak for themselves:
- 96.08% F1 on our best model (DeBERTa-v3-large)
- Top 10 models all above 95.7% F1
- 35 models ranging from 33M to 600M parameters
- Built on top of Nemotron-PII by Nvidia
This isn't a single model release. It's an entire ecosystem.
OpenMed's PII models detect 54 types of sensitive information:
Identifiers: SSN, passport, medical record numbers, credit cards, API keys, passwords...
- One place to connect Images ↔ Cases ↔ Articles
- Stable IDs and clean joins
- Built for classification, retrieval, grounding, VQA/doc-QA, and multimodal modeling
• @vllm_project for local inference on Llama-3.1-70B Instruct
• @GroqInc for blazing fast end-to-end testing
• @huggingface Inference Endpoints for Cohere Command R+ comparisons
• @Gradio for an intuitive, responsive UI
🚀🔥
The implementation process was fascinating. I started with question generation, moved to clustering, and then dove into the iterative demonstration unification. Each step presented unique challenges and opportunities for optimization.
One of the most interesting aspects was seeing how different models handled the ECHO process. The performance improvements were notable, especially in complex reasoning tasks.
1/4 🚀 Exciting news for AI enthusiasts! Check out NuExtract, a cutting-edge LLM designed for structured extraction tasks. It transforms any text into a structured output with just a template!
3/4 🛠️ This project utilized Llama-3 70B to annotate 50k documents, fine-tuning Phi-3-mini, Phi-3-small, and Qwen1.5-0.5B. Overcoming challenges like reducing hallucinations was key.