Development of a De-identification Model for Electronic Health Records in Buenos Aires City Healthcare (SALA 2026)

Author

María Nanton

Published

March 17, 2026

Together with Ariana Bardauil, Juliana Reyes Szemere, Camila Ebensrtejin, Hugo Neumayer, Sofía Anastasía, Alejandro Domazet, Max Wolf, Cecilia Palermo, and Milagro Teruel, we presented a poster on our de-identification project at the Summit of AI in Latin America (SALA 2026).

The work is developed at the Dirección General de Sistemas de Información Sanitaria (DGSISAN), Ministerio de Salud, Ciudad Autónoma de Buenos Aires. The Buenos Aires City public health system’s Electronic Health Records database contains over 45 million free-text progress notes written by clinicians during consultations. These notes can contain sensitive patient identifiers — full names, ID numbers, addresses, phone numbers — so the goal is to build a NER model capable of detecting and masking them, enabling broader data availability while protecting patient privacy.

We annotated a sample of 5,000 clinical progress notes across specialties and levels of care, and we are currently exploring three modeling approaches: a rule-based system adapted for Spanish and Argentine-specific identifiers, a BiLSTM NER model trained on our annotated data, and a hybrid ensemble combining both — all designed to run on lightweight City infrastructure.