Entity normalization in a Spanish medical corpus using a UMLS-based lexicon: findings and limitations

dc.article.number110076
dc.contributor.authorBaez, Pablo
dc.contributor.authorCampillos-Llanos, Leonardo
dc.contributor.authorNunez, Fredy
dc.contributor.authorDunstan, Jocelyn
dc.date.accessioned2024-08-01T08:00:09Z
dc.date.available2024-08-01T08:00:09Z
dc.date.issued2024
dc.description.abstractEntity normalization is a common strategy to resolve ambiguities by mapping all the synonym mentions to a single concept identifier in standard terminology. Normalizing medical entities is challenging, especially for languages other than English, where lexical variation is considerably under-represented. Here, we report a new linguistic resource for medical entity normalization in Spanish. We applied a UMLS-based medical lexicon (MedLexSp) to automatically normalize mentions from 2000 medical referrals of the Chilean Waiting List Corpus. Three medical students manually revised the automatic normalization. The inter-coder agreement was computed, and the distribution of concepts, errors, and linguistic sources of variation was analyzed. The automatic method normalized 52% of the mentions, compared to 91% after manual revision. The lowest agreement between automatic and automatic-manual normalization was observed for Finding, Disease, and Procedure entities. Errors in normalization were associated with ortho-typographic, semantic, and grammatical linguistic inadequacies, mainly of the hyponymy/hyperonymy, polysemy/metonymy, and acronym-abbreviation types. This new resource can enrich dictionaries and lexicons with new mentions to improve the functioning of modern entity normalization methods. The linguistic analysis offers insight into the sources of lexical variety in the Spanish clinical environment related to error generation using lexicon-based normalization methods. This article also introduces a workflow that can serve as a benchmark for comparison in studies replicating our analysis in Romance languages.
dc.description.funderANID fondecyt
dc.description.funderANID
dc.format.extent14 páginas
dc.fuente.origenWOS
dc.identifier.doi10.1007/s10579-024-09755-7
dc.identifier.eissn1574-0218
dc.identifier.issn1574-020X
dc.identifier.scopusidSCOPUS_ID:85194494230
dc.identifier.urihttps://doi.org/10.1007/s10579-024-09755-7
dc.identifier.urihttps://repositorio.uc.cl/handle/11534/87237
dc.identifier.wosidWOS:001260434000001
dc.information.autorucFacultad de Letras; Núñez Torres, Fredy Rodrigo; S/I; 157277
dc.issue.numero3
dc.language.isoen
dc.nota.accesoSin adjunto
dc.pagina.final516
dc.pagina.inicio489
dc.revistaLANGUAGE RESOURCES AND EVALUATION
dc.rightsregistro bibliográfico
dc.subjectClinical text
dc.subjectEntity linking
dc.subjectLexical variation
dc.subjectLinguistic resources
dc.subjectMedical lexicon
dc.subjectNormalization
dc.subject.ddc370
dc.subject.deweyEducaciónes_ES
dc.titleEntity normalization in a Spanish medical corpus using a UMLS-based lexicon: findings and limitations
dc.typeartículo
dc.volumen27
sipa.codpersvinculados157277
sipa.indexWOS
sipa.trazabilidadCarga WOS-SCOPUS;01-08-2024
Files