The Right Fit: Choosing Open-Source LLMs

LLMOps

This blog post provides a practical guide to selecting and deploying an open-source Large Language Model (LLM) locally. It offers a high-level overview of essential considerations, including model characteristics, hardware requirements, and deployment frameworks, to help you make informed decisions for on-premise or self-hosted LLMs. For further technical details, we refer to the relevant sources.

A Governance Argument

Open-source LLMs offer significant governance advantages, particularly in privacy, security, and compliance. By self-hosting models — especially with on-premise deployments — organizations ensure that sensitive data never leaves their own infrastructure. This eliminates third-party exposure and greatly reduces regulatory risks for industries like healthcare and finance, where strict compliance requirements often necessitate on-premise deployments.

Beyond security, open-source models provide greater independence and control. Unlike proprietary solutions from providers like OpenAI or Anthropic, they allow organizations to avoid vendor lock-in and maintain full authority over their technology stack. This allows for greater flexibility in the long term and enables organizations to implement a cloud exit strategy if needed. In short, open models empower organizations with data sovereignty and control over costs, which is pivotal for strategic governance.

Key Providers and Models

Open-source models are delivered by providers who make their model weights available for download for various reasons, and the models are often subject to licensing. When considering a specific provider or model, ensure that your intended use falls under the licensing in question.

Historically, research institutions were the primary providers of open-source models. However, the landscape has evolved, and today, many companies are also major contributors, particularly in the field of LLMs. As a result, multiple open-source models are available, each with different strengths and capabilities. Some excel in reasoning and coding tasks, while others focus on multilingual proficiency or efficiency. Leading examples include:

Meta – Llama 3.3: A 70B parameter model, emphasizing safety with integrated refusal mechanisms and tools like Llama Guard 3.
DeepSeek – R1: Designed for advanced reasoning, programming, and mathematics, with multiple size options available.
Google – Gemma 2: A family of lightweight, state-of-the-art open models, with an enhanced 27B variant.
Mistral AI – Mistral Small 3: A 24B parameter model known for its exceptional efficiency, rivaling larger models.
Alibaba – Qwen 2.5: A high-scorer across various benchmark types with excellent language coverage supporting over 29 languages

Characteristics of LLMs

Selecting the right LLM requires understanding fundamental characteristics and their trade-offs. Several factors impact a model's performance, efficiency, and suitability for different use cases.

Number of Parameters

This refers to the number of learnable weights in the model, which influences its ability to understand and generate text. Larger models (70B+ parameters) tend to offer more nuanced reasoning and knowledge, whereas smaller models (7B–13B) are more lightweight and efficient.

Trade-off: Higher parameter counts generally improve a model's reasoning and language understanding, but they also demand significantly more computational resources (memory and processing power). In practice, very large models may require high-end GPUs or multi-GPU setups to run effectively.

Context Window Size

Measured in tokens, the context window determines how much text the model can process at once. Modern models vary widely on this: some support around 8K tokens, while others extend to 100K+ tokens in context. Larger context windows are useful for handling long documents or maintaining long conversations without losing context.

Trade-off: Longer context windows require and can slow down inference. The memory (VRAM) usage scales roughly linearly with context length, since the model must store all prior tokens data for attention.

Precision and Quantization

Precision refers to the numerical format used for the model weights and calculations. LLMs are typically trained in high precision (FP32 or BF16) for maximum accuracy. Quantization involves converting weights to lower precision like INT8 or INT4, which dramatically reduces model size and speeds up inference.

Trade-off: Lower precision reduces memory and compute requirements at a slight cost to accuracy. For instance, using INT8 halves the memory needed compared to FP16 (with minimal impact on performance), and INT4 halves it again. Many open-source models can be and have been quantized.

Training Data and Methodology

Training data is the text corpus used to teach the model language skills, while methodology refers to the training techniques (pre-training, fine-tuning, and alignment via reinforcement learning from human feedback, or RLHF). Base models (pre-trained on broad data without human alignment) often exhibit more raw and creative outputs, whereas RLHF-aligned models are fine-tuned to follow instructions, be helpful, and avoid unsafe content. Additionally, domain-specific fine-tuning (e.g., on medical or legal text) can make a model perform much better in specialized fields.

Trade-off: More diverse training data improves generalization but may introduce biases; alignment enhances safety but can reduce flexibility.

Multimodal Capabilities

While most open-source LLMs handle text only, some are multimodal – they can process images, audio, or even video in addition to text. For example, vision-language models can interpret images in a prompt, and others incorporate speech recognition or generation.

Trade-off: Given a fixed model size, supporting multiple modalities (e.g., image generation) might dilute performance on pure text tasks.

Emergent Abilities

Capabilities of LLMs can include advanced reasoning, complex problem-solving, arithmetic skills, or the ability to learn from just a few examples (in-context learning). Some advanced LLMs also demonstrate the use of tools or the execution of multi-step reasoning (chain-of-thought) to solve complex tasks. These emergent abilities are more frequent as the model size and amount of training data increases.

Trade-off: Emergent abilities tend to appear in larger models. Substantially greater memory and compute may be needed to display them.

Multilanguage Performance

Although some models are trained on non-English languages, English is still the primary language of LLMs due to the vast amount of training data available. For open-source models, the performance degradation in another language can be quite significant. Given that you expect your model to be used in non-English, it is relevant to consider the model performance in the particular language (see the following section).

Trade-off: To ensure maximum performance in your non-english language in question, you may have to sacrifice general model capabilities.

How to Determine Model Quality

Assessing an LLM's effectiveness requires objective evaluation across multiple dimensions. Both standardized benchmarks and user-based assessments are useful to gauge a model's strengths and weaknesses.

Common benchmark tests include:

MMLU (Massive Multitask Language Understanding): Measures knowledge and reasoning across 57 academic subjects (e.g., history, science, math). It tests a model's breadth of world knowledge and problem-solving ability.
HumanEval / MBPP: Evaluate coding proficiency by having the model write code to solve programming problems, then checking if the code runs correctly.
GSM8K / MATH: Test mathematical problem-solving and logical reasoning (GSM8K is a grade-school math word problem dataset, and MATH is a competition-level math benchmark).
TruthfulQA: Assesses factual accuracy and the model's ability to avoid generating false or misleading statements. It checks how "truthful" a model is when answering tricky questions.
AlpacaEval: Measures helpfulness and instruction-following in a conversational setting (often by comparing model responses to a set of instructions).

Public leaderboards tracking these benchmarks can be found online. Notable sources include:

Hugging Face Open LLM Leaderboard – an automated leaderboard that evaluates open models across several benchmarks (like MMLU, HellaSwag, etc.)
EuroEval Leaderboard – a benchmark focused on European languages, assessing how well models perform in languages other than English.

Additionally, user feedback, often gathered through blind model comparisons and preference rankings, provides insights beyond benchmarks. Chatbot Arena LLM Leaderboard – a platform where models are pitted against each other and humans vote on the better response, yielding a ranking of chat models — is a notable example in this regard. If you are specifically considering smaller LLMs, GPU-Poor LLM Leaderboard is another useful resource.

A Hardware Conundrum

Deploying LLMs for inference — the process of generating predictions or responses from a trained model — requires careful consideration of hardware. In almost all cases, GPUs (graphics processing units) are essential for running large models. CPUs are generally far too slow for anything beyond very small models because they are designed for general-purpose, sequential processing. LLM inference is highly parallelizable and memory-intensive, which plays to the strengths of GPUs.

Understanding GPU Constraints

Inference tasks are often more memory-bound than compute-bound. This is because LLM inference requires loading and accessing massive model weights from memory repeatedly, which creates a bottleneck in memory bandwidth and latency, whereas the actual compute operations (matrix multiplications) are relatively efficient on modern GPUs. Therefore, the primary factor in choosing a Nvidia h200nd model is the GPU's VRAM (video RAM, the GPU's on-board memory) capacity and the model's size, respectively.

Note: Because of this, and as a simplification, we primarily focus on VRAM requirements in this discussion. While factors like compute power (FLOPs), latency, and time to first token (TTFT) can be crucial for real-world performance, we do not analyze them in depth here.

Memory Requirements for Model Weights

For efficient inference, the model's weights must fit entirely within the combined VRAM of your GPUs. You can estimate the VRAM required to load the model using the following formula:

VRAM [bytes] ≈ (Number of parameters × Precision [bits]) / 8 bits per byte

For example, for a 70B parameters model, we would have the following memory requirements for different levels of quantization: FP16 ~140 GB, INT8 ~70 GB, INT4 ~35 GB.

Memory Requirements for KV Cache

The KV cache (Key-Value cache) is a crucial component during LLM inference, especially for long prompts or dialogues. As the model generates output, it stores intermediate results (the "keys" and "values" for each attention layer at each token position) so that it doesn't recompute them every time it processes a new token. This cache grows linearly with the number of tokens being processed. In practical terms, longer context = more VRAM usage dedicated to the cache. The size of the KV cache can be estimated using the formula:

KV Cache Size [bytes] ≈ 2 × Number of layers × 2 × Hidden Layer Size × Precision [bytes] × Avg. Number of Tokens per Request

For example, for a model like Llama 70B with 80 layers and hidden layer size of 8192, using FP16 precision and average number of tokens per request of 750, we would have approximately 3.9GB of KV cache memory usage.

Impact on Concurrent Users

Serving multiple users or requests concurrently further strains memory usage. You need to accommodate N separate contexts (and their caches) if you want to handle N sessions in parallel. In order to determine the maximum number of concurrent users, you will have to know the average number of tokens per request which can be hard to determine. In the KV cache size example calculation above and given a GPU with 40 VRAM, 10 concurrent users can be served without performance degradation (3.9GB * 10 = 39GB).

VRAM Estimator Calculator

Experiment with model and user specifications and see the effect on the hardware uptake. Choose pre-set models and GPUs from the dropdowns or specify the parameters yourself. Note: the calculations are approximations.

Model Specifications

Model Name:

Number of Parameters (billions):

Precision:

Number of Layers:

Hidden Size:

Usage Information

Number of Concurrent Users:

Average Context Length (tokens):

GPU Deployed

GPU Model:

Total GPU VRAM (GB):

Model Weight Size: 14.00 GB

Estimated KV Cache per User: 1.07 GB

Total KV Cache (All Users): 1.07 GB

Total VRAM Used: 15.07 GB

VRAM Remaining: 8.93 GB (37.2%)

75%

50%

25%

Free VRAM: 37.2% of VRAM

KV Cache: 4.5% of VRAM

Model Weights: 58.3% of VRAM

Practical Steps

Selecting the right open-source LLM for local deployment requires balancing performance, hardware constraints, and specific use case requirements.

Define the use case first: Clearly identify the primary application: general-purpose reasoning, coding, multilingual capabilities, or domain-specific expertise. Consider whether the model needs to process long documents, maintain long conversations, or support multimodal inputs.
Estimate Concurrency Needs: Determine the expected number of simultaneous users and the average length of input prompts. Higher concurrency requires more memory to handle multiple active sessions efficiently.
Evaluate Hardware Constraints: Check available GPU VRAM, as model weights and KV cache must fit within memory for smooth inference. Decide whether to work with existing hardware or budget for upgrades.
Select a Model That Fits Hardware Limits: Match the model's size to available VRAM while leaving room for caching and multiple concurrent users. Consider quantization (e.g., INT8, INT4) if hardware resources are limited, reducing VRAM requirements with minimal accuracy loss.
Assess Performance Using Benchmarks: MMLU (knowledge & reasoning), HumanEval / MBPP (coding), GSM8K / MATH (math & logical reasoning), TruthfulQA (factual accuracy). Use real-world rankings like the Hugging Face Open LLM Leaderboard and Chatbot Arena LLM Leaderboard for practical performance insights.
Optimize Deployment Strategy: Fine-tune or adapt the model if needed, especially for specialized applications. If performance is inadequate, reassess whether a smaller model with fine-tuning or a larger model with quantization is the best trade-off.

Next Up: Deploying the Model

Once an open-source LLM is selected, the next step is deploying it efficiently. This requires the right infrastructure, optimized inference frameworks, and reliable API servers to handle requests, reduce latency, and manage resources effectively.

Serving the Model

To run an LLM in production, a serving framework is essential. This is a piece of software that loads the model, manages its execution, and efficiently handles user queries. Without one, models are slow, memory-intensive, and difficult to scale. Popular frameworks include vLLM (high-speed inference), TGI (optimized for text generation), Llama.cpp (lightweight CPU/GPU execution), and CTransformers (fast C++ backend).

API Server Requirements

An API server connects applications to the deployed model, handling requests and security. Essential components include an HTTP/REST interface, request processing, security controls (authentication, authorization, rate limiting), and monitoring/logging for performance tracking and debugging.

Where Does GRACE Fit In?

Deploying an LLM requires more than just picking a serving framework—it demands efficient infrastructure, scalability, and seamless integration with applications. Managing GPU resources, optimizing inference, and ensuring security can quickly become complex. This is where the GRACE Platform provides a complete, hassle-free solution:

Deploy Anywhere: Runs on-premise or in the cloud, ensuring full control over data and infrastructure.
Optimized GPU Management: Allocates and balances GPU usage for maximum efficiency.
Streamlined Model Serving: Supports any serving framework, with proven experience in deploying and managing vLLM for high-performance inference.
API-First Design: Comes pre-configured with FastAPI for seamless integration with existing applications.
Ready-to-Use Chat UI: Instantly interact with deployed models via a built-in Chat UI.
Automated Logging & Monitoring: Captures model interactions and system performance for transparency and compliance.
Security & Compliance: Includes built-in guardrails to ensure safe and responsible AI deployment.

In our upcoming blog, we will explore the process of deploying LLMs using the GRACE Platform, highlighting best practices, performance optimizations, and real-world deployment strategies. Stay tuned for more insights.

Sources

For further details on LLM inference, hardware sizing, and deployment best practices, refer to the following resources:

VMware: LLM Inference Sizing and Performance Guidance – A comprehensive guide on sizing and optimizing hardware for LLM inference.
Modal: GPU Glossary – A glossary of essential GPU concepts for understanding model deployment.
Llama: How-To Guides – Official documentation on using Llama models effectively, including fine-tuning, inference, and optimization techniques of LLMs.

Learn More

LISTEN ON SPOTIFY

The Right Fit: Choosing Open-Source LLMs

Contents

A Governance Argument

Key Providers and Models

Characteristics of LLMs

Number of Parameters

Context Window Size

Precision and Quantization

Training Data and Methodology

Multimodal Capabilities

Emergent Abilities

Multilanguage Performance

How to Determine Model Quality

A Hardware Conundrum

Understanding GPU Constraints

Memory Requirements for Model Weights

Memory Requirements for KV Cache

Impact on Concurrent Users

VRAM Estimator Calculator

Model Specifications

Usage Information

GPU Deployed

Practical Steps

Next Up: Deploying the Model

Serving the Model

API Server Requirements

Where Does GRACE Fit In?

Sources

Transcript

More news

The Rise of Shadow AI - and Its impact on your organization

Boards taking the lead on the AI Governance agenda

Fragmented or Unified? What the U.S. can learn from the EU’s approach to AI regulation

The Right Fit: Choosing Open-Source LLMs

Contents

A Governance Argument

Key Providers and Models

Characteristics of LLMs

Number of Parameters

Context Window Size

Precision and Quantization

Training Data and Methodology

Multimodal Capabilities

Emergent Abilities

Multilanguage Performance

How to Determine Model Quality

A Hardware Conundrum

Understanding GPU Constraints

Memory Requirements for Model Weights

Memory Requirements for KV Cache

Impact on Concurrent Users

VRAM Estimator Calculator

Model Specifications

Usage Information

GPU Deployed

Practical Steps

Next Up: Deploying the Model

Serving the Model

API Server Requirements

Where Does GRACE Fit In?

Sources

Transcript

More news

The Rise of Shadow AI - and Its impact on your organization

Boards taking the lead on the AI Governance agenda

Fragmented or Unified? What the U.S. can learn from the EU’s approach to AI regulation

Get the latest news