Image by pch.vector on Freepik
Large Language Model (LLM) has recently started to find their foot in the business, and it will expand even further. As the company began understanding the benefits of implementing the LLM, the data team would adjust the model to the business requirements.
The optimal path for the business is to utilize a cloud platform to scale any LLM requirements that the business needs. However, many hurdles could hinder LLM performance in the cloud and increase the usage cost. It is certainly what we want to avoid in the business.
That’s why this article will try to outline a strategy you could use to optimize the performance of LLM in the cloud while taking care of the cost. What’s the strategy? Let’s get into it.
We must understand our financial condition before implementing any strategy to optimize performance and costs. How much budget we are willing to invest in the LLM will become our limit. A higher budget could lead to more significant performance results but might not be optimal if it doesn’t support the business.
The budget plan needs extensive discussion with various stakeholders so it would not become a waste. Identify the critical focus your business wants to solve and assess if LLM is worth investing in.
The strategy also applies to any solo business or individual. Having a budget for the LLM that you are willing to spend would help your financial problem in the long run.
With the advancement of research, there are many kinds of LLMs that we can choose to solve our problem. With a smaller parameter model, it would be faster to optimize but might not have the best ability to solve your business problems. While a bigger model has a more excellent knowledge base and creativity, it costs more to compute.
There are trade-offs between the performance and cost with the change in the LLM size, which we need to take into account when we decide on the model. Do we need to have bigger parameter models that have better performance but require higher cost, or vice versa? It’s a question we need to ask. So, try to assess your needs.
Additionally, the cloud Hardware could affect the performance as well. Better GPU memory might have a faster response time, allow for more complex models, and reduce latency. However, higher memory means higher cost.
Depending on the cloud platform, there would be many choices for the inferences. Comparing your application workload requirements, the option you want to choose might be different as well. However, inference could also affect the cost usage as the number of resources is different for each option.
If we take an example from Amazon SageMaker Inferences Options, your inference options are:
- Real-Time Inference. The inference processes the response instantly when input comes. It’s usually the inferences used in real-time, such as chatbot, translator, etc. Because it always requires low latency, the application would need high computing resources even in the low-demand period. This would mean that LLM with Real-Time inference could lead to higher costs without any benefit if the demand isn’t there.
- Serverless Inference. This inference is where the cloud platform scales and allocates the resources dynamically as required. The performance might suffer as there would be slight latency for each time the resources are initiated for each request. But, it’s the most cost-effective as we only pay for what we use.
- Batch Transform. The inference is where we process the request in batches. This means that the inference is only suitable for offline processes as we don’t process the request immediately. It might not be suitable for any application that requires an instant process as the delay would always be there, but it doesn’t cost much.
- Asynchronous Inference. This inference is suitable for background tasks because it runs the inference task in the background while the results are retrieved later. Performance-wise, it’s suitable for models that require a long processing time as it can handle various tasks concurrently in the background. Cost-wise, it could be effective as well because of the better resource allocation.
Try to assess what your application needs, so you have the most effective inference option.
LLM is a model with a particular case, as the number of tokens affects the cost we would need to pay. That’s why we need to build a prompt effectively that uses the minimum token either for the input or the output while still maintaining the output quality.
Try to build a prompt that specifies a certain amount of paragraph output or use a concluding paragraph such as “summarize,” “concise,” and any others. Also, precisely construct the input prompt to generate the output you need. Don’t let the LLM model generate more than you need.
There would be information that would be repeatedly asked and have the same responses every time. To reduce the number of queries, we can cache all the typical information in the database and call them when it’s required.
Typically, the data is stored in a vector database such as Pinecone or Weaviate, but cloud platform should have their vector database as well. The response that we want to cache would converted into vector forms and stored for future queries.
There are a few challenges when we want to cache the responses effectively, as we need to manage policies where the cache response is inadequate to answer the input query. Also, some caches are similar to each other, which could result in a wrong response. Manage the response well and have an adequate database that could help reduce costs.
LLM that we deploy might end up costing us too much and have inaccurate performance if we don’t treat them right. That’s why here are some strategies you could employ to optimize the performance and cost of your LLM in the cloud:
- Have a clear budget plan,
- Decide the right model size and hardware,
- Choose the suitable inference options,
- Construct effective prompts,
- Caching responses.
Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and Data tips via social media and writing media.