A pseudonymized corpus of occupational health narratives for clinical entity recognition in Spanish

dc.article.number204
dc.catalogadoryvc
dc.contributor.authorDunstan Escudero, Jocelyn Mariel
dc.contributor.authorVakili, Thomas
dc.contributor.authorMiranda Huerta, Luis Alberto
dc.contributor.authorVillena, Fabián
dc.contributor.authorAracena, Claudio
dc.contributor.authorQuiroga Curin, Tamara Nancy
dc.contributor.authorVera, Paulina
dc.contributor.authorViteri Valenzuela, Sebastián
dc.contributor.authorRocco, Victor
dc.date.accessioned2024-08-01T23:33:57Z
dc.date.available2024-08-01T23:33:57Z
dc.date.issued2024
dc.date.updated2024-07-28T00:04:31Z
dc.description.abstractDespite the high creation cost, annotated corpora are indispensable for robust natural language processing systems. In the clinical field, in addition to annotating medical entities, corpus creators must also remove personally identifiable information (PII). This has become increasingly important in the era of large language models where unwanted memorization can occur. This paper presents a corpus annotated to anonymize personally identifiable information in 1,787 anamneses of work-related accidents and diseases in Spanish. Additionally, we applied a previously released model for Named Entity Recognition (NER) trained on referrals from primary care physicians to identify diseases, body parts, and medications in this work-related text. We analyzed the differences between the models and the gold standard curated by a physician in detail. Moreover, we compared the performance of the NER model on the original narratives, in narratives where personal information has been masked, and in texts where the personal data is replaced by another similar surrogate value (pseudonymization). Within this publication, we share the annotation guidelines and the annotated corpus.
dc.description.auspiciadorStockholm University (Financiamiento Acceso Abierto)
dc.description.funderANID Chile Fondo Basal, Centro de Excelencia FB210005 (CMM)
dc.description.funderPhD Visiting Scholarship No. 2023 (TV)
dc.description.funderMillennium Science Initiative Program ICN17_002 (IMFD)
dc.description.funderANID Fondecyt No. 1241825 (JD)
dc.description.funderANID National Doctoral Scholarship 21220200 (FV), 21211659 (CA) and 21220586 (TQ)
dc.description.funderDataLEASH (TV) ACHS 304-2023
dc.format.extent10 páginas
dc.identifier.citationBMC Medical Informatics and Decision Making. 2024, 24(1):204
dc.identifier.doi10.1186/s12911-024-02609-w
dc.identifier.urihttps://doi.org/10.1186/s12911-024-02609-w
dc.identifier.urihttps://repositorio.uc.cl/handle/11534/87253
dc.information.autorucEscuela de Ingeniería; Dunstan Escudero, Jocelyn Mariel; S/I; 1285723
dc.information.autorucEscuela de Ingeniería; Miranda Huerta, Luis Alberto; S/I; 66497
dc.information.autorucEscuela de Ingeniería; Quiroga Curin, Tamara Nancy; S/I; 1207385
dc.language.isoen
dc.nota.accesocontenido completo
dc.publisherSpringer Nature
dc.revistaBMC Medical Informatics and Decision Making
dc.rightsacceso abierto
dc.rights.holderThe Author(s)
dc.rights.licenseCC BY Atribución 4.0 Internacional
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/
dc.subjectNatural language processing
dc.subjectPrivacy
dc.subjectNamed entity recognition
dc.subjectCorpus annotation
dc.subject.ddc610
dc.subject.deweyMedicina y saludes_ES
dc.titleA pseudonymized corpus of occupational health narratives for clinical entity recognition in Spanish
dc.typeartículo
dc.volumen24
sipa.codpersvinculados1285723
sipa.codpersvinculados66497
sipa.codpersvinculados1207385
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
12911_2024_Article_2609.pdf
Size:
1.54 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.98 KB
Format:
Item-specific license agreed upon to submission
Description: