Development and application of natural language processing methods to medical causes of death for public health purposes

Published on 12 November 2019 Updated on 26 February 2024

Introduction - Medical causes of death are recorded by physicians on death certificates in free-form text using a wide variety of expressions. Natural language processing (NLP) methods make it possible to analyze this data quickly. This article describes the approach taken to develop these methods and illustrates their use for public health alerts. Methods - The identification of high-performing methods is part of an international challenge. This challenge involves providing participating teams with a dataset—comprising free-text descriptions and ICD-10 codes, which are considered the gold standard—to develop their ICD-10 code prediction tools, and then independently evaluating the tools’ performance on a test set. Certain methods were used to classify free-text causes into groups relevant for reactive mortality surveillance. Results - The best results were obtained using neural networks on the U.S. dataset and with rule-based methods on the French dataset. A hybrid method, combining rules and support vector machine (SVM) classification, produced better or comparable results on both datasets. Analysis of the temporal evolution of four cause groupings for reactive mortality surveillance highlighted expected (epidemics) and unusual events. Discussion - The challenge experience and the application for alert-oriented surveillance demonstrate the value and performance of NLP methods in supporting the reactive use of mortality data for public health.

Author(s): Robert Aude, Baghdadi Yasmine, Zweigenbaum Pierre, Morgand Claire, Grouin Cyril, Lavergne Thomas, Névéol Aurélie, Fouillet Anne, Rey Grégoire

Publishing year: 2019

Pages: 603-609

Weekly Epidemiological Bulletin, 2019, n° 29-30, p. 603-609

Download pdf 850.34 KB Link to HTML file