function ResearchAloe() {
    return <section className="container section">
        <h2 className="is-size-2 mb-2">Motivation</h2>
        <div className="columns is-multiline mb-4 display-linebreak">
            <p align="justify">
            The aim of this project is to develop high-quality Medical Language Models tailored to the medical domain, capable of providing accurate and reliable answers to complex medical questions. These models are designed to be open, free-to-use, multilingual, and aligned with ethical principles, ensuring their accessibility and trustworthiness.
            <br /><br />
            The medical domain represents one of the most promising frontiers for AI advancement, with the potential to revolutionize healthcare delivery, accelerate medical research, and improve patient outcomes worldwide.  
            Through foundational LLMs, AI can process vast amounts of complex medical text and provide users with an intuitive conversational interface. These models address critical challenges in healthcare by supporting decision-making, improving efficiency, and enabling broader access to medical expertise.
            <br /><br />
            While similar models exist, many fall short in openness or accessibility. Med-PaLM-2, for example, is a promising model but remains entirely private. The absence of openly available, high-quality medical LLMs limits their impact on healthcare, particularly in under-resourced settings where clinical experts are scarce. By developing open and specialized LLMs, we aim to bridge this gap, creating tools that empower healthcare providers and improve outcomes worldwide.
             </p>
        </div>
        <h2 className="is-size-2 mb-2">Aloe Family</h2>
        <div className="columns is-multiline mb-4 display-linebreak">
            <p align="justify">
            In a coordinated effort to push for more accurate open source medical models, the Aloe family is currently composed of two model versions:

                <ul> 
                    <li><strong>Aloe Alpha</strong> (April 2024): Built on top of Llama-3, Aloe Alpha marks a key milestone in our journey. Through extensive testing and refining our training strategy, we achieved state-of-the-art results at the time of release, setting a new benchmark for performance. We released only <a href="https://huggingface.co/HPAI-BSC/Llama3-Aloe-8B-Alpha"> one version with 8B parameters.</a></li> 
                    <li><strong>Aloe Beta</strong> (November-December 2024): Building on the success of Aloe Alpha, Aloe Beta takes things to the next level. With a significantly expanded training set and improvements in model efficiency, Aloe Beta consists of four models across different sizes, using two leading model families as bases: Llama3.1 and Qwen2.5. We released <a href="https://huggingface.co/HPAI-BSC/Qwen2.5-Aloe-Beta-7B"> 7B </a>, <a href="https://huggingface.co/HPAI-BSC/Llama3.1-Aloe-Beta-8B"> 8B </a>, <a href="https://huggingface.co/HPAI-BSC/Llama3.1-Aloe-Beta-70B"> 70B </a>, and <a href="https://huggingface.co/HPAI-BSC/Qwen2.5-Aloe-Beta-72B"> 72B </a> versions, all following the same successful training recipie, and each achieving state-of-the-art results across the full range of model sizes.</li>
                </ul>
                
            </p>
        </div>

        <h2 className="is-size-2 mb-2">Training recipie</h2>
        <div className="columns is-multiline mb-4 display-linebreak">
            <p align="justify">
                The foundation of our method is built on a structured, multi-step approach designed to create robust and reliable models for the healthcare domain:
                <ol type="1">
                    <li><strong>Data preprocessing:</strong> Our approach starts by curating a diverse training dataset that includes both expert-curated medical datasets and synthetically generated data. This dataset is tailored to enhance the model's versatility, covering a range of tasks crucial for clinical applications</li>
                    <li><strong> Supervised Fine-Tuning:</strong> Large volumes of formatted healthcare data are used to enrich the model's representation of medical concepts and to align its output behavior to be that of a helpful assistant. This step is essential to adapting the model to the intricacies of the healthcare domain</li>
                    <li><strong>Model Merging:</strong> In the subsequent phase we employ model merging techniques (§5.2) to integrate the learned representations of models with analogous architectures. This process, which involves combining parameter sets rather than adding parameters, aims to leverage the strengths of diverse models, mitigating individual model biases and increasing robustness.</li>
                    <li><strong>Model Alignment:</strong> The last step involves training the model to produce responses that are fair, accurate, and safe for use in healthcare settings, explicitly addressing risks related to bias, toxicity, and other harms. To maximize the efficienty and impact of this last effort, we use a two stage DPO training process. In the first stage, medical preference, general preference, and safety are combined. In the second stage, DPO is conducted exclusively on the customized red teaming dataset.</li>
                </ol>

                <br /><br />
                <center>
                    <embed src="/images/work/methods_aloe.png"  class="responsive-image"/>
                    Aloe Beta Training Pipeline: An overview of the sequential stages.
                </center>
                <br /><br />
            </p>
        </div>

        <h2 className="is-size-2 mb-2">In-Context Learning</h2>
        <div className="columns is-multiline mb-4 display-linebreak">
            <p align="justify">
        
                In-Context Learning (ICL) is a technique that enhances the performance of large language models by modifying the input context to improve output accuracy. Instead of retraining the model, ICL adjusts the way information is presented to guide the model's predictions more effectively.<br /><br />

                To maximize the performance of Aloe, we incorporate ICL through a method known as <a href="https://arxiv.org/abs/2311.16452">Medprompt</a>. This Retrieval-Augmented Generation (RAG) system is specifically designed for the healthcare domain and includes various strategies like Chain-of-Thought (CoT), self-consistency, and choice shuffling. These techniques improve the model's reasoning abilities and overall reliability, making it highly effective in complex medical tasks. <br /><br />

                Our Medprompt implementation utilizes a custom database built from the MedMCQA and MedQA datasets, containing 190K examples generated with Llama-3.1-70B-Instruct. By retrieving the most relevant examples based on their embeddings, we enhance the model's prompts, leading to improved accuracy and explainability in the generated responses. For the reported In-Context Learning evaluations, we use the SFR-Embedding-Mistral model to select the 5 most relevant examples from the database, which are then added to the prompt. We generate 20 completions applying choice shuffling, and the final answer is determined through majority voting. <br /><br />

                A dedicated repository for running these Medprompt experiments is available on our <a href="https://github.com/HPAI-BSC/prompt_engine">GitHub</a>. <br /><br />
                <center>
                    <embed src="/images/work/medprompt_diagram.png"  class="responsive-image"/>
                    Diagram of the Medprompt-based prompt strategy. K refers to the number of few-shots examples included in the prompt.
                </center>  
            </p>
        </div>
        <h2 className="is-size-2 mb-2">Aloe Beta</h2>
        <div className="columns is-multiline mb-4 display-linebreak">
            <p align="justify">
            Aloe Beta represents the next step in our mission to create open and high-quality medical language models tailored for healthcare applications. uilding on the success of Aloe Alpha, Aloe Beta introduces significant advancements in training methodology, dataset curation, and model architecture. It delivers unparalleled performance across multiple benchmarks, setting a new standard for accessibility and quality in medical LLMs. 
            The result is a family of models specifically optimized for healthcare, including four sizes: 7B, 8B, 70B, and 72B parameters, built on top of Llama3.1 and Qwen2.5 base models, and available in Hugginface:
            <br /><br />
            <ul>
                <li><a href="https://huggingface.co/HPAI-BSC/Qwen2.5-Aloe-Beta-7B">Qwen2.5-Aloe-Beta-7B</a></li>
                <li><a href="https://huggingface.co/HPAI-BSC/Llama3.1-Aloe-Beta-8B">Llama3.1-Aloe-Beta-8B</a></li>
                <li><a href="https://huggingface.co/HPAI-BSC/Llama3.1-Aloe-Beta-70B">Llama3.1-Aloe-Beta-70B</a></li>
                <li><a href="https://huggingface.co/HPAI-BSC/Qwen2.5-Aloe-Beta-72B">Qwen2.5-Aloe-Beta-72B</a></li>

            </ul>
            </p>
        </div>
        <h2 className="is-size-3 mb-5">Data</h2>
        <div className="columns is-multiline mb-4 display-linebreak">
            <p align="justify">
            This model version includes a widely extended training dataset meticulously curated to enhance its medical expertise. The dataset includes a total of <strong>1.62 milion</strong> medical instructions spanning <strong>20 distinct medical tasks</strong> such as Question Answering, diagnosis, summarization, etc. In addition, to ensure both depth and breadth, the dataset includes data from three key sources:
            <ul>
                <li><strong>Curated Medical Data:</strong> These datasets are sourced directly from reputable healthcare-curated sources, ensuring the inclusion of highly specific and reliable medical information. While these sources offer high fidelity, their volume is inherently limited.</li>
                <li><strong>Synthetically Enhanced Medical Datasets:</strong> To overcome volume limitations of the human-curated medical data, we augment the dataset with data extended via LLMs. This approach maximizes the training sample size, but careful design is needed to guarantee the quality of the generated data. We used Llama3.1-70B-Instruct to generate the synthetic data.</li>
                <li><strong>General-Purpose Datasets:</strong> To mitigate the risks of catastrophic forgetting and model collapse, a carefully selected subset of general-purpose datasets is incorporated. These datasets, which are not specific to healthcare, ensure that the model retains its proficiency in general language understanding and instruction following.</li>    
            </ul> 
            <br />
            To further enhance the model's safety and reliability, a dedicated alignment dataset was meticulously curated. The purpose of the model alignment phase is to bias the LLMs outputs towards a desirable form and style. The alignment data consists of 262k instructions, focusing on three main topics to address diverse aspects of user preferences:
            <ul>
                <li><strong>Medical preference datasets</strong>: to align the responses with the preferences in a healthcare domain.</li>
                <li><strong>Human preference datasets</strong>: to align the responses with the general social preferences, mitigating dangerous outcomes (toxicity, self-harm, stereotypes, etc.).</li>
                <li><strong>Safety preference data</strong>: to focus on enhancing alignment with user expectations related to safety and ethical standards. These datasets are selected for their ability to capture preferences related to avoiding harmful or inappropriate responses..</li>
            </ul>


            <br /><br />
            <center>
                <embed src="/images/work/aloe_diagram.png"  class="responsive-image"/>
                Summary of Aloe Beta training stages and data sources.
            </center>            
            </p>

        </div>
        <h2 className="is-size-3 mb-2">Results</h2>
        <div className="columns is-multiline mb-4 display-linebreak">
            <p align="justify">
            To compare Aloe with the most competitive open models (both general purpose and healthcare-specific) we use popular healthcare datasets (PubMedQA, MedMCQA, MedQA and MMLU for six medical tasks only), together with the new and highly reliable CareQA.
            <br />
            <center>
                <img src="/images/work/mcqa_evals_small.png" width="50%" />
                <img src="/images/work/mcqa_evals_big.png" width="50%" />
                Medical Multiple-Choice Question Answering benchmarks.
            </center>
            <br /><br />

            However, while MCQA benchmarks provide valuable insights into a model's ability to handle structured queries, they fall short of representing the full range of challenges faced in medical practice. The Beta model has been developed to excel in several different medical tasks. For this reason, we evaluated the model in many different medical tasks:
            <br /> <br />
            <center>
                <img src="/images/work/medical_tasks_small_1.png" width="50%" />
                <img src="/images/work/medical_tasks_small_2.png" width="50%" />
                Medical tasks evaluation of small size models.
            </center>
            <center>
                <img src="/images/work/medical_tasks_big_1.png" width="50%" />
                <img src="/images/work/medical_tasks_big_2.png" width="50%" />
                Medical tasks evaluation of large size models.
            </center>

            <br />
            We also compared the performance of the model in the general domain, using the OpenLLM Leaderboard benchmark. Aloe-Beta gets competitive results with the current SOTA general models in the most used general benchmarks and outperforms the medical models:
            <br /><br />
            <center>
                <img src="/images/work/general_evals_small.png" width="50%" />
                <img src="/images/work/general_evals_big.png" width="50%" />
                Evaluation of general tasks (OpenLLM leaderboard). 
            </center>
            <br /><br />
            Benchmark results show that Aloe's training has significantly enhanced its performance, achieving results on par with leading private models like MedPalm-2 and GPT-4. Aloe-Beta also stands out by outperforming other medical models on the OpenLLM Leaderboard and excelling in various medical tasks, including Medical Factuality and Treatment Recommendations, among others. These results position Aloe Beta as one of the top models currently available in healthcare.

            Additionally, Aloe's performance is further boosted through advanced prompting techniques. Specifically, our In-Context Learning method delivers an impressive accuracy increase of approximately 8% for smaller models and 4% for larger models. This advancement enables Aloe Beta to surpass all existing models that do not utilize RAG evaluation.
            </p>
        </div>
        <br /><br />
        
        
    </section>;
}

export default ResearchAloe;
