A pseudonymized corpus of occupational health narratives for clinical entity recognition in Spanish

Dunstan Escudero, Jocelyn Mariel; Vakili, Thomas; Miranda Huerta, Luis Alberto; Villena, Fabián; Aracena, Claudio; Quiroga Curin, Tamara Nancy; Vera, Paulina; Viteri Valenzuela, Sebastián; Rocco, Victor

A pseudonymized corpus of occupational health narratives for clinical entity recognition in Spanish

dc.article.number	204
dc.catalogador	yvc
dc.contributor.author	Dunstan Escudero, Jocelyn Mariel
dc.contributor.author	Vakili, Thomas
dc.contributor.author	Miranda Huerta, Luis Alberto
dc.contributor.author	Villena, Fabián
dc.contributor.author	Aracena, Claudio
dc.contributor.author	Quiroga Curin, Tamara Nancy
dc.contributor.author	Vera, Paulina
dc.contributor.author	Viteri Valenzuela, Sebastián
dc.contributor.author	Rocco, Victor
dc.date.accessioned	2024-08-01T23:33:57Z
dc.date.available	2024-08-01T23:33:57Z
dc.date.issued	2024
dc.date.updated	2024-07-28T00:04:31Z
dc.description.abstract	Despite the high creation cost, annotated corpora are indispensable for robust natural language processing systems. In the clinical field, in addition to annotating medical entities, corpus creators must also remove personally identifiable information (PII). This has become increasingly important in the era of large language models where unwanted memorization can occur. This paper presents a corpus annotated to anonymize personally identifiable information in 1,787 anamneses of work-related accidents and diseases in Spanish. Additionally, we applied a previously released model for Named Entity Recognition (NER) trained on referrals from primary care physicians to identify diseases, body parts, and medications in this work-related text. We analyzed the differences between the models and the gold standard curated by a physician in detail. Moreover, we compared the performance of the NER model on the original narratives, in narratives where personal information has been masked, and in texts where the personal data is replaced by another similar surrogate value (pseudonymization). Within this publication, we share the annotation guidelines and the annotated corpus.
dc.description.auspiciador	Stockholm University (Financiamiento Acceso Abierto)
dc.description.funder	ANID Chile Fondo Basal, Centro de Excelencia FB210005 (CMM)
dc.description.funder	PhD Visiting Scholarship No. 2023 (TV)
dc.description.funder	Millennium Science Initiative Program ICN17_002 (IMFD)
dc.description.funder	ANID Fondecyt No. 1241825 (JD)
dc.description.funder	ANID National Doctoral Scholarship 21220200 (FV), 21211659 (CA) and 21220586 (TQ)
dc.description.funder	DataLEASH (TV) ACHS 304-2023
dc.format.extent	10 páginas
dc.identifier.citation	BMC Medical Informatics and Decision Making. 2024, 24(1):204
dc.identifier.doi	10.1186/s12911-024-02609-w
dc.identifier.uri	https://doi.org/10.1186/s12911-024-02609-w
dc.identifier.uri	https://repositorio.uc.cl/handle/11534/87253
dc.information.autoruc	Escuela de Ingeniería; Dunstan Escudero, Jocelyn Mariel; S/I; 1285723
dc.information.autoruc	Escuela de Ingeniería; Miranda Huerta, Luis Alberto; S/I; 66497
dc.information.autoruc	Escuela de Ingeniería; Quiroga Curin, Tamara Nancy; S/I; 1207385
dc.language.iso	en
dc.nota.acceso	contenido completo
dc.publisher	Springer Nature
dc.revista	BMC Medical Informatics and Decision Making
dc.rights	acceso abierto
dc.rights.holder	The Author(s)
dc.rights.license	CC BY Atribución 4.0 Internacional
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.subject	Natural language processing
dc.subject	Privacy
dc.subject	Named entity recognition
dc.subject	Corpus annotation
dc.subject.ddc	610
dc.subject.dewey	Medicina y salud	es_ES
dc.title	A pseudonymized corpus of occupational health narratives for clinical entity recognition in Spanish
dc.type	artículo
dc.volumen	24
sipa.codpersvinculados	1285723
sipa.codpersvinculados	66497
sipa.codpersvinculados	1207385

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 12911_2024_Article_2609.pdf
Size:: 1.54 MB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.98 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Artículos de revistas