Microsoft Releases Convincing Case Study Showing Chain of Thought (CoT) with GPT 4 Versus Fine Tuned Models via Medprompt and CoT Prompting Strategies

A great read. I'll pull out the important parts.

November 2023

https://preview.redd.it/cyf6y5fubl3c1.png?width=1059&format=png&auto=webp&s=2a1b559ebfdd0900ab7dc84d3dc7088470b3bb2a

Figure 1: (a) Comparison of performance on MedQA. (b) GPT-4 with Medprompt achieves SoTA on a wide range of medical challenge questions.

A core metric for characterizing the performance of foundation models is the accuracy of next word prediction. Accuracy with next word prediction is found to increase with scale in training data, model parameters, and compute, in accordance with empirically derived “neural model scaling laws” [3, 12]). However, beyond predictions of scaling laws on basic measures such as next word prediction, foundation models show the sudden emergence of numerous problem-solving capabilities at different thresholds of scale [33, 27, 24].

Despite the observed emergence of sets of general capabilities, questions remain about whether truly exceptional performance can be achieved on challenges within specialty areas like medicine in the absence of extensive specialized training or fine-tuning of the general models. Most explorations of foundation model capability on biomedical applications rely heavily on domain- and task-specific fine-tuning. With first-generation foundation models, the community found an unambiguous advantage with domain-specific pretraining, as exemplified by popular models in biomedicine such as 2 PubMedBERT [10] and BioGPT [19]. But it is unclear whether this is still the case with modern foundation models pretrained at much larger scale.

We present results and methods of a case study on steering GPT-4 to answer medical challenge questions with innovative prompting strategies. We include a consideration of best practices for studying prompting in an evaluative setting, including the holding out of a true eyes-off evaluation set. We discover that GPT-4 indeed possesses deep specialist capabilities that can be evoked via prompt innovation. The performance was achieved via a systematic exploration of prompting strategies. As a design principle, we chose to explore prompting strategies that were inexpensive to execute and not customized for our benchmarking workload. We converged on a top prompting strategy for GPT-4 for medical challenge problems, which we refer to as Medprompt. Medprompt unleashes medical specialist skills in GPT-4 in the absence of expert crafting, easily topping existing benchmarks for all standard medical question-answering datasets. The approach outperforms GPT-4 with the simple prompting strategy and state-of-the-art specialist models such as Med-PaLM 2 by large margins. On the MedQA dataset (USMLE exam), Medprompt produces a 9 absolute point gain in accuracy, surpassing 90% for the first time on this benchmark.

As part of our investigation, we undertake a comprehensive ablation study that reveals the relative significance for the contributing components of Medprompt. We discover that a combination of methods, including in-context learning and chain-of-thought, can yield synergistic effects. Perhaps most interestingly, we find that the best strategy in steering a generalist model like GPT-4 to excel on the medical specialist workload that we study is to use a generalist prompt. We find that GPT-4 benefits significantly from being allowed to design its prompt, specifically with coming up with its own chain-of-thought to be used for in-context learning. This observation echoes other reports that GPT-4 has an emergent self-improving capability via introspection, such as self-verification [9].

>>> Extractions from [9] https://openreview.net/pdf?id=SBbJICrglS Published: 20 Jun 2023, Last Modified: 19 Jul 2023 <<<

https://preview.redd.it/wb3kj4btbl3c1.png?width=1027&format=png&auto=webp&s=0268c29e1f8bbeb898577bd712fdfa1042fb5d7d

Experiments on various clinical information extraction tasks and various LLMs, including ChatGPT (GPT-4) (OpenAI, 2023) and ChatGPT (GPT-3.5) (Ouyang et al., 2022), show the efficacy of SV. In addition to improving accuracy, we find that the extracted interpretations match human judgements of relevant information, enabling auditing by a human and helping to build a path towards trustworthy extraction of clinical information in resource-constrained scenarios.

Fig. 1 shows the four different steps of the introduced SV pipeline. The pipeline takes in a raw text input, e.g. a clinical note, and outputs information in a pre-specified format, e.g. a bulleted list. It consists of four steps, each of which calls the same LLM with different prompts in order to refine and ground the original output. The original extraction step uses a task-specific prompt which instructs the model to output a variable-length bulleted list. In the toy example in Fig. 1, the goal is to identify the two diagnoses Hypertension and Right adrenal mass, but the original extraction step finds only Hypertension. After the original LLM extraction, the Omission step finds missing elements in the output; in the Fig. 1 example it finds Right adrenal mass and Liver fibrosis. For tasks with long inputs (mean input length greater than 2,000 characters), we repeat the omission step to find more potential missed elements (we repeat five times, and continue repeating until the omission step stops finding new omissions).

Results 3.1. Self-verification improves prediction performance Table 2 shows the results for clinical extraction performance with and without self-verification. Across different models and tasks, SV consistently provides a performance improvement. The performance improvement is occasionally quite large (e.g. ChatGPT (GPT-4) shows more than a 0.1 improvement in F1 for clinical trial arm extraction and more than a 0.3 improvement for medication status extraction), and the average F1 improvement across models and tasks is 0.056. We also compare to a baseline where we concatenate the prompts across different steps into a single large prompt which is then used to make a single LLM call for information extraction. We find that this large-prompt baseline performs slightly worse than the baseline reported in Table 2, which uses a straightforward prompt for extraction (see comparison details in Table A5).

<<< Reference [9] end >>>

2.2 Prompting Strategies

Prompting in the context of language models refers to the input given to a model to guide the output that it generates. Empirical studies have shown that the performance of foundation models on a specific task can be heavily influenced by the prompt, often in surprising ways. For example, recent work shows that model performance on the GSM8K benchmark dataset can vary by over 10% without any changes to the model’s learned parameters [35]. Prompt engineering refers to the process of developing effective prompting techniques that enable foundation models to better solve specific tasks. Here, we briefly introduce a few key concepts that serve as building blocks for our Medprompt approach.

Chain of Thought (CoT) is a prompting methodology that employs intermediate reasoning steps prior to introducing the sample answer [34]. By breaking down complex problems into a series 4 of smaller steps, CoT is thought to help a foundation model to generate a more accurate answer. CoT ICL prompting integrates the intermediate reasoning steps of CoT directly into the few-shot demonstrations. As an example, in the Med-PaLM work, a panel of clinicians was asked to craft CoT prompts tailored for complex medical challenge problems [29]. Building on this work, we explore in this paper the possibility of moving beyond reliance on human specialist expertise to mechanisms for generating CoT demonstrations automatically using GPT-4 itself. As we shall describe in more detail, we can do this successfully by providing [question, correct answer] pairs from a training dataset. We find that GPT-4 is capable of autonomously generating high-quality, detailed CoT prompts, even for the most complex medical challenges.

Self-Generated Chain of Thought

https://preview.redd.it/47qku12dcl3c1.png?width=820&format=png&auto=webp&s=a8e3a393e92e7dac8acdd5b25310933f72d38788

Chain-of-thought (CoT) [34] uses natural language statements, such as “Let’s think step by step,” to explicitly encourage the model to generate a series of intermediate reasoning steps. The approach has been found to significantly improve the ability of foundation models to perform complex reasoning. Most approaches to chain-of-thought center on the use of experts to manually compose few-shot examples with chains of thought for prompting [30]. Rather than rely on human experts, we pursued a mechanism to automate the creation of chain-of-thought examples. We found that we could simply ask GPT-4 to generate chain-of-thought for the training examples using the following prompt:

https://preview.redd.it/irfh2hnkcl3c1.png?width=907&format=png&auto=webp&s=fbc6d4d6749b630658de932a80a4bd4b7b97d003

A key challenge with this approach is that self-generated CoT rationales have an implicit risk of including hallucinated or incorrect reasoning chains. We mitigate this concern by having GPT-4 generate both a rationale and an estimation of the most likely answer to follow from that reasoning chain. If this answer does not match the ground truth label, we discard the sample entirely, under the assumption that we cannot trust the reasoning. While hallucinated or incorrect reasoning can still yield the correct final answer (i.e. false positives), we found that this simple label-verification step acts as an effective filter for false negatives.

We observe that, compared with the CoT examples used in Med-PaLM 2 [30], which are handcrafted by clinical experts, CoT rationales generated by GPT-4 are longer and provide finer-grained step-by-step reasoning logic. Concurrent with our study, recent works [35, 7] also find that foundation models write better prompts than experts do.

https://preview.redd.it/lcb8lae1dl3c1.png?width=904&format=png&auto=webp&s=c321e625136360622a254d41852a3980b60de624

Medprompt combines intelligent few-shot exemplar selection, self-generated chain of thought steps, and a majority vote ensemble, as detailed above in Sections 4.1, 4.2, and 4.3, respectively. The composition of these methods yields a general purpose prompt-engineering strategy. A visual depiction of the performance of the Medprompt strategy on the MedQA benchmark, with the additive contributions of each component, is displayed in Figure 4. We provide an a corresponding algorithmic description in Algorithm 1.

Medprompt consists of two stages: a preprocessing phase and an inference step, where a final prediction is produced on a test case.

Algorithm 1 Algorithmic specification of Medprompt, corresponding to the visual representation of the strategy in Figure 4.

We note that, while Medprompt achieves record performance on medical benchmark datasets, the algorithm is general purpose and is not restricted to the medical domain or to multiple choice question answering. We believe the general paradigm of combining intelligent few-shot exemplar selection, self-generated chain of thought reasoning steps, and majority vote ensembling can be broadly applied 11 to other problem domains, including less constrained problem solving tasks (see Section 5.3 for details on how this framework can be extended beyond multiple choice questions).

Results

https://preview.redd.it/jeckyxlvdl3c1.png?width=766&format=png&auto=webp&s=844c8c890a2c0025776dca2c95fa8919ffbc94c1

With harnessing the prompt engineering methods described in Section 4 and their effective combination as Medprompt, GPT-4 achieves state-of-the-art performance on every one of the nine benchmark datasets in MultiMedQA

submitted by /u/Xtianus21
[link] [comments]