2023 was the year that witnessed the rise of various Large Language Models (LLMs) in the Generative AI space. LLMs have incredible power and potential, but productionizing them has been a consistent challenge for users. An especially prevalent problem is what LLM should one use? Even more specifically, how can one evaluate an LLM for accuracy? This is especially challenging when there’s a large number of models to choose from, different datasets for fine-tuning/RAG, and a variety of prompt engineering/tuning techniques to consider.
To solve this problem we need to establish DevOps best practices for LLMs. Having a workflow or pipeline that can help one evaluate different models, datasets, and prompts. This field is starting to get known as LLMOPs/FMOPs. Some of the parameters that can be considered in LLMOPs are shown below, in a (extremely) simplified flow:
In this article, we’ll try to tackle this problem by building a pipeline that fine-tunes, deploys, and evaluates a Llama 7B model. You can also scale this example, by using it as a template to compare multiple LLMs, datasets, and prompts. For this example, we’ll be utilizing the following tools to build this pipeline:
- SageMaker JumpStart: SageMaker JumpStart provides various FM/LLMs out of the box for both fine-tuning and deployment. Both these processes can be quite complicated, so JumpStart abstracts out the specifics and enables you to specify your dataset and model metadata to conduct fine-tuning and deployment. In this case we select Llama 7B and conduct Instruction fine-tuning which is supported out of the box. For a deeper introduction into JumpStart fine-tuning please refer to this blog and this Llama code sample, which we’ll use as a reference.
- SageMaker Clarify/FMEval: SageMaker Clarify provides a Foundation Model Evaluation tool via the SageMaker Studio UI and the open-source Python FMEVal library. The feature comes built-in with a variety of different algorithms spanning different NLP…