1. Sign our petition calling on Cochrane to withdraw their review of Exercise Therapy for CFS here.
    Dismiss Notice
  2. Guest, the 'News in Brief' for the week beginning 27th November 2023 is here.
    Dismiss Notice
  3. Welcome! To read the Core Purpose and Values of our forum, click here.
    Dismiss Notice

PASCLex: A comprehensive post-acute sequelae of COVID-19 (PASC) symptom lexicon derived from electronic health record clinical notes, 2022, Liqin e/a

Discussion in 'Long Covid research' started by CRG, Apr 5, 2023.

  1. CRG

    CRG Senior Member (Voting Rights)

    Journal of Biomedical Informatics

    PASCLex: A comprehensive post-acute sequelae of COVID-19 (PASC) symptom lexicon derived from electronic health record clinical notes

    Liqin Wang, Dinah Foer, Erin MacPhaul, Ying-Chih Lo, David W. Bates, Li Zhou



    To develop a comprehensive post-acute sequelae of COVID-19 (PASC) symptom lexicon (PASCLex) from clinical notes to support PASC symptom identification and research.

    We identified 26,117 COVID-19 positive patients from the Mass General Brigham’s electronic health records (EHR) and extracted 328,879 clinical notes from their post-acute infection period (day 51–110 from first positive COVID-19 test). PASCLex incorporated Unified Medical Language System® (UMLS) Metathesaurus concepts and synonyms based on selected semantic types. The MTERMS natural language processing (NLP) tool was used to automatically extract symptoms from a development dataset. The lexicon was iteratively revised with manual chart review, keyword search, concept consolidation, and evaluation of NLP output. We assessed the comprehensiveness of PASCLex and the NLP performance using a validation dataset and reported the symptom prevalence across the entire corpus.


    PASCLex included 355 symptoms consolidated from 1520 UMLS concepts of 16,466 synonyms. NLP achieved an averaged precision of 0.94 and an estimated recall of 0.84. Symptoms with the highest frequency included pain (43.1%), anxiety (25.8%), depression (24.0%), fatigue (23.4%), joint pain (21.0%), shortness of breath (20.8%), headache (20.0%), nausea and/or vomiting (19.9%), myalgia (19.0%), and gastroesophageal reflux (18.6%).

    Discussion and conclusion

    PASC symptoms are diverse. A comprehensive lexicon of PASC symptoms can be derived using an ontology-driven, EHR-guided and NLP-assisted approach. By using unstructured data, this approach may improve identification and analysis of patient symptoms in the EHR, and inform prospective study design, preventative care strategies, and therapeutic interventions for patient care.

    Open: https://www.sciencedirect.com/science/article/pii/S153204642100280X
    DokaGirl and Peter Trewhitt like this.
  2. rvallee

    rvallee Senior Member (Voting Rights)

    Still very problematic how language is used. IMO anxiety and depression should be considered symptoms, but they are also commonly used as short-hand for literally anything, and in most cases they are simply used as a collection of symptoms, making them redundant here unless properly defined. In some cases they are both used as a short-hand version of all the other symptoms. And sometimes added by clinicians just because they feel it. Confused raw data make for very confusing analysis.

    This looks like a smart use of NLP but if there's GI where will be a lot of GO. It will remain very difficult to do anything with language when it is so commonly weaponized, distorted or otherwise modified. I am really looking forward to the transformative potential of AI to summarize giant data sets of raw data. That is, as long as it is actually raw data, straight from the patients' account, and not the selective recording that clinicians do.

    Although this does fully validate the initial patient-led studies that were mocked in part for having so many symptoms. I actually wonder how common this is, how often we would find, even in common medical conditions, how much difference there is between what symptoms are recorded and what is the actual lived reality of patients. I frankly assume it is the case in literally every single disease or condition, though to a varying degree.

    Doing a quick skim, they even use "fatigue" as a short-hand for CFS. There is a LOT of missing data here, very lossy compression. Still, with better tools using LMMs (e.g. ChatGPT) this will be more useful over time.
    Lindberg, DokaGirl, Chezboo and 2 others like this.
  3. RedFox

    RedFox Senior Member (Voting Rights)

    I knew you'd have a thorough criticism of this. Any research based on health records will be biased through the lens of doctors, which means some symptoms will be missed or recorded as something else.

Share This Page