Artificial intelligence in medicine

Discussion in 'Other health news and research' started by RedFox, Apr 11, 2023.

  1. rvallee

    rvallee Senior Member (Voting Rights)

    Messages:
    12,531
    Location:
    Canada
    Although large language models (LLMs) have been the main innovation in AI in recent years, the last year has been largely spent on optimizing prompting strategies, basically how to get LLMs to reason with their data set. How you ask a question is a critical requirement to get a valid answer, LLMs need to be directed on how to think.

    Microsoft published research today showing a model that achieves 90% on all dimensions of the standard medical board certification testing, largely through prompting optimization in the form of a model called MedPrompt.

    Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine
    https://arxiv.org/abs/2311.16452

    Generalist foundation models such as GPT-4 have displayed surprising capabilities in a wide variety of domains and tasks. Yet, there is a prevalent assumption that they cannot match specialist capabilities of fine-tuned models. For example, most explorations to date on medical competency benchmarks have leveraged domain-specific training, as exemplified by efforts on BioGPT and Med-PaLM. We build on a prior study of GPT-4's capabilities on medical challenge benchmarks in the absence of special training. Rather than using simple prompting to highlight the model's out-of-the-box capabilities, we perform a systematic exploration of prompt engineering. We find that prompting innovation can unlock deeper specialist capabilities and show that GPT-4 easily tops prior leading results for medical benchmarks. The prompting methods we explore are general purpose, and make no specific use of domain expertise, removing the need for expert-curated content. Our experimental design carefully controls for overfitting during the prompt engineering process. We introduce Medprompt, based on a composition of several prompting strategies. With Medprompt, GPT-4 achieves state-of-the-art results on all nine of the benchmark datasets in the MultiMedQA suite. The method outperforms leading specialist models such as Med-PaLM 2 by a significant margin with an order of magnitude fewer calls to the model. Steering GPT-4 with Medprompt achieves a 27% reduction in error rate on the MedQA dataset over the best methods to date achieved with specialist models and surpasses a score of 90% for the first time. Beyond medical problems, we show the power of Medprompt to generalize to other domains and provide evidence for the broad applicability of the approach via studies of the strategy on exams in electrical engineering, machine learning, philosophy, accounting, law, nursing, and clinical psychology.​

    [​IMG]

    Also an article from Microsoft (who built MedPrompt) on their approach to prompting optimization: https://www.microsoft.com/en-us/research/blog/the-power-of-prompting/, which is also capable of reaching passing grades on several other professional certifications:

    [​IMG]
     
  2. rvallee

    rvallee Senior Member (Voting Rights)

    Messages:
    12,531
    Location:
    Canada
    Towards Accurate Differential Diagnosis with Large Language Models
    https://arxiv.org/abs/2312.00164

    An accurate differential diagnosis (DDx) is a cornerstone of medical care, often reached through an iterative process of interpretation that combines clinical history, physical examination, investigations and procedures. Interactive interfaces powered by Large Language Models (LLMs) present new opportunities to both assist and automate aspects of this process. In this study, we introduce an LLM optimized for diagnostic reasoning, and evaluate its ability to generate a DDx alone or as an aid to clinicians. 20 clinicians evaluated 302 challenging, real-world medical cases sourced from the New England Journal of Medicine (NEJM) case reports. Each case report was read by two clinicians, who were randomized to one of two assistive conditions: either assistance from search engines and standard medical resources, or LLM assistance in addition to these tools. All clinicians provided a baseline, unassisted DDx prior to using the respective assistive tools. Our LLM for DDx exhibited standalone performance that exceeded that of unassisted clinicians (top-10 accuracy 59.1% vs 33.6%, [p = 0.04]). Comparing the two assisted study arms, the DDx quality score was higher for clinicians assisted by our LLM (top-10 accuracy 51.7%) compared to clinicians without its assistance (36.1%) (McNemar's Test: 45.7, p < 0.01) and clinicians with search (44.4%) (4.75, p = 0.03). Further, clinicians assisted by our LLM arrived at more comprehensive differential lists than those without its assistance. Our study suggests that our LLM for DDx has potential to improve clinicians' diagnostic reasoning and accuracy in challenging cases, meriting further real-world evaluation for its ability to empower physicians and widen patients' access to specialist-level expertise.


    The bottom tier isn't very useful, all clinicians will make use of resources to assist in diagnosis, but the upper one is quite significant: LLMs alone do better than LLM-assisted clinicians.

    [​IMG]
     
    Peter Trewhitt and mariovitali like this.
  3. Andy

    Andy Committee Member

    Messages:
    22,007
    Location:
    Hampshire, UK
    Is AI leading to a reproducibility crisis in science?

    "During the COVID-19 pandemic in late 2020, testing kits for the viral infection were scant in some countries. So the idea of diagnosing infection with a medical technique that was already widespread — chest X-rays — sounded appealing. Although the human eye can’t reliably discern differences between infected and non-infected individuals, a team in India reported that artificial intelligence (AI) could do it, using machine learning to analyse a set of X-ray images1.

    The paper — one of dozens of studies on the idea — has been cited more than 900 times. But the following September, computer scientists Sanchari Dhar and Lior Shamir at Kansas State University in Manhattan took a closer look2. They trained a machine-learning algorithm on the same images, but used only blank background sections that showed no body parts at all. Yet their AI could still pick out COVID-19 cases at well above chance level.

    The problem seemed to be that there were consistent differences in the backgrounds of the medical images in the data set. An AI system could pick up on those artefacts to succeed in the diagnostic task, without learning any clinically relevant features — making it medically useless."

    https://www.nature.com/articles/d41586-023-03817-6
     
  4. SNT Gatchaman

    SNT Gatchaman Senior Member (Voting Rights)

    Messages:
    4,506
    Location:
    Aotearoa New Zealand
    Eric Topol at TED: Can AI catch what doctors miss? (14 mins)

    "And in the medical community the thing that we don't talk much about are diagnostic medical errors."
     
  5. rvallee

    rvallee Senior Member (Voting Rights)

    Messages:
    12,531
    Location:
    Canada
    Performance of Large Language Models on a Neurology Board–Style Examination
    JAMA Neurology: https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2812620

    Key Points

    Question What is the performance of large language models on neurology board–style examinations?

    Findings In this cross-sectional study, a newer version of the large language model significantly outperformed the mean human score when given questions from a question bank approved by the American Board of Psychiatry and Neurology, answering 85.0% of questions correctly compared with the mean human score of 73.8%, while the older model scored below the human average (66.8%). Both models used confident or very confident language, even when incorrect.

    Meaning These findings suggest that with further refinements, large language models could have significant applications in clinical neurology.


    Abstract

    Importance
    Recent advancements in large language models (LLMs) have shown potential in a wide array of applications, including health care. While LLMs showed heterogeneous results across specialized medical board examinations, the performance of these models in neurology board examinations remains unexplored.

    Objective To assess the performance of LLMs on neurology board–style examinations.

    Design, Setting, and Participants This cross-sectional study was conducted between May 17 and May 31, 2023. The evaluation utilized a question bank approved by the American Board of Psychiatry and Neurology and was validated with a small question cohort by the European Board for Neurology. All questions were categorized into lower-order (recall, understanding) and higher-order (apply, analyze, synthesize) questions based on the Bloom taxonomy for learning and assessment. Performance by LLM ChatGPT versions 3.5 (LLM 1) and 4 (LLM 2) was assessed in relation to overall scores, question type, and topics, along with the confidence level and reproducibility of answers.

    Main Outcomes and Measures Overall percentage scores of 2 LLMs.

    Results LLM 2 significantly outperformed LLM 1 by correctly answering 1662 of 1956 questions (85.0%) vs 1306 questions (66.8%) for LLM 1. Notably, LLM 2’s performance was greater than the mean human score of 73.8%, effectively achieving near-passing and passing grades in the neurology board examination. LLM 2 outperformed human users in behavioral, cognitive, and psychological–related questions and demonstrated superior performance to LLM 1 in 6 categories. Both LLMs performed better on lower-order than higher-order questions, with LLM 2 excelling in both lower-order and higher-order questions. Both models consistently used confident language, even when providing incorrect answers. Reproducible answers of both LLMs were associated with a higher percentage of correct answers than inconsistent answers.

    Conclusions and Relevance Despite the absence of neurology-specific training, LLM 2 demonstrated commendable performance, whereas LLM 1 performed slightly below the human average. While higher-order cognitive tasks were more challenging for both models, LLM 2’s results were equivalent to passing grades in specialized neurology examinations. These findings suggest that LLMs could have significant applications in clinical neurology and health care with further refinements.
     
  6. rvallee

    rvallee Senior Member (Voting Rights)

    Messages:
    12,531
    Location:
    Canada
    ChatGPT outperforming human doctors in behavioral, cognitive, and psychological–related questions is both hilarious and not the least bit surprising. This is by far the weakest area in all medicine, and the gap will only grow wider.

    AIs don't read between the lines or see thinly veiled language with its alternative meaning. Instead they consider the bulk of what's out there, and the bulk of it disagrees with the "special menu" that we get served from with a wink and a smirk.
     
  7. tmrw

    tmrw Established Member (Voting Rights)

    Messages:
    54
    Location:
    Germany
    https://www.cnbc.com/2023/12/13/how-doctors-are-using-googles-new-ai-models-for-health-care.html

    One interesting point in this article:

    My bolding. So the one area where AI probably could actually make a difference for patients, like pwME, and the doctors are not interested in it. Who would have thought?
     
    JemPD likes this.
  8. rvallee

    rvallee Senior Member (Voting Rights)

    Messages:
    12,531
    Location:
    Canada
    Discovery of a structural class of antibiotics with explainable deep learning
    https://www.nature.com/articles/s41586-023-06887-8

    (Paragraph breaks added for legibility, seriously what's up with academia and illegible walls of text?)

    The discovery of novel structural classes of antibiotics is urgently needed to address the ongoing antibiotic resistance crisis1,2,3,4,5,6,7,8,9. Deep learning approaches have aided in exploring chemical spaces1,10,11,12,13,14,15; these typically use black box models and do not provide chemical insights. Here we reasoned that the chemical substructures associated with antibiotic activity learned by neural network models can be identified and used to predict structural classes of antibiotics.

    We tested this hypothesis by developing an explainable, substructure-based approach for the efficient, deep learning-guided exploration of chemical spaces. We determined the antibiotic activities and human cell cytotoxicity profiles of 39,312 compounds and applied ensembles of graph neural networks to predict antibiotic activity and cytotoxicity for 12,076,365 compounds.

    Using explainable graph algorithms, we identified substructure-based rationales for compounds with high predicted antibiotic activity and low predicted cytotoxicity. We empirically tested 283 compounds and found that compounds exhibiting antibiotic activity against Staphylococcus aureus were enriched in putative structural classes arising from rationales.

    Of these structural classes of compounds, one is selective against methicillin-resistant S. aureus (MRSA) and vancomycin-resistant enterococci, evades substantial resistance, and reduces bacterial titres in mouse models of MRSA skin and systemic thigh infection. Our approach enables the deep learning-guided discovery of structural classes of antibiotics and demonstrates that machine learning models in drug discovery can be explainable, providing insights into the chemical substructures that underlie selective antibiotic activity.
     
    Amw66 likes this.
  9. rvallee

    rvallee Senior Member (Voting Rights)

    Messages:
    12,531
    Location:
    Canada
    Towards Conversational Diagnostic AI
    https://arxiv.org/abs/2401.05654

    At the heart of medicine lies the physician-patient dialogue, where skillful history-taking paves the way for accurate diagnosis, effective management, and enduring trust. Artificial Intelligence (AI) systems capable of diagnostic dialogue could increase accessibility, consistency, and quality of care. However, approximating clinicians' expertise is an outstanding grand challenge. Here, we introduce AMIE (Articulate Medical Intelligence Explorer), a Large Language Model (LLM) based AI system optimized for diagnostic dialogue.
    AMIE uses a novel self-play based simulated environment with automated feedback mechanisms for scaling learning across diverse disease conditions, specialties, and contexts.

    We designed a framework for evaluating clinically-meaningful axes of performance including history-taking, diagnostic accuracy, management reasoning, communication skills, and empathy. We compared AMIE's performance to that of primary care physicians (PCPs) in a randomized, double-blind crossover study of text-based consultations with validated patient actors in the style of an Objective Structured Clinical Examination (OSCE).

    The study included 149 case scenarios from clinical providers in Canada, the UK, and India, 20 PCPs for comparison with AMIE, and evaluations by specialist physicians and patient actors. AMIE demonstrated greater diagnostic accuracy and superior performance on 28 of 32 axes according to specialist physicians and 24 of 26 axes according to patient actors. Our research has several limitations and should be interpreted with appropriate caution. Clinicians were limited to unfamiliar synchronous text-chat which permits large-scale LLM-patient interactions but is not representative of usual clinical practice. While further research is required before AMIE could be translated to real-world settings, the results represent a milestone towards conversational diagnostic AI.
     
    Sean likes this.
  10. rvallee

    rvallee Senior Member (Voting Rights)

    Messages:
    12,531
    Location:
    Canada
    Blog post from Google AI about the above paper:

    AMIE: A research AI system for diagnostic medical reasoning and conversations
    https://blog.research.google/2024/01/amie-research-ai-system-for-diagnostic.html

    Inspired by this challenge, we developed Articulate Medical Intelligence Explorer (AMIE), a research AI system based on a LLM and optimized for diagnostic reasoning and conversations. We trained and evaluated AMIE along many dimensions that reflect quality in real-world clinical consultations from the perspective of both clinicians and patients. To scale AMIE across a multitude of disease conditions, specialties and scenarios, we developed a novel self-play based simulated diagnostic dialogue environment with automated feedback mechanisms to enrich and accelerate its learning process. We also introduced an inference time chain-of-reasoning strategy to improve AMIE’s diagnostic accuracy and conversation quality. Finally, we tested AMIE prospectively in real examples of multi-turn dialogue by simulating consultations with trained actors.

    [​IMG]

    Comparison is somewhat limited in that to account for the limitations of the AI system, participating PCPs only interacted through a chat system, rather than the in-person conversation that is typical in health care. However the benefits of remote asynchronous conversation not being limited by physician time and clinical space are still massively significant, and it's only a matter of time before real-time conversation is available, more likely this year than not.
     
    tmrw, Sean and Amw66 like this.
  11. rvallee

    rvallee Senior Member (Voting Rights)

    Messages:
    12,531
    Location:
    Canada
    US FDA clears DermaSensor's AI-powered skin cancer detecting device
    https://www.reuters.com/business/he...ered-skin-cancer-detecting-device-2024-01-17/

    The FDA clearance is based on a study which showed that the device had a 96% sensitivity in detecting skin cancers. A negative result through the device had a 97% chance of being benign, according to the company.

    When brought in contact with skin, the device emits light and captures the wavelengths of light reflecting off cellular structures beneath the skin's surface.

    It subsequently utilizes an algorithm to analyze the reflected light and detect the presence of skin cancer.
    ...
    Company CEO Cody Simmons said the device will be priced through a subscription model at $199 a month for five patients or $399 a month for unlimited use.

    DermaSensor is currently commercially available in Europe and Australia.
     
    tmrw and Trish like this.

Share This Page