
This blog post provides a practical guide to selecting and deploying an open-source Large Language Model (LLM) locally. It offers a high-level overview of essential considerations, including model characteristics, hardware requirements, and deployment frameworks, to help you make informed decisions for on-premise or self-hosted LLMs. For further technical details, we refer to the relevant sources.
Open-source LLMs offer significant governance advantages, particularly in privacy, security, and compliance. By self-hosting models — especially with on-premise deployments — organizations ensure that sensitive data never leaves their own infrastructure. This eliminates third-party exposure and greatly reduces regulatory risks for industries like healthcare and finance, where strict compliance requirements often necessitate on-premise deployments.
Beyond security, open-source models provide greater independence and control. Unlike proprietary solutions from providers like OpenAI or Anthropic, they allow organizations to avoid vendor lock-in and maintain full authority over their technology stack. This allows for greater flexibility in the long term and enables organizations to implement a cloud exit strategy if needed. In short, open models empower organizations with data sovereignty and control over costs, which is pivotal for strategic governance.
Open-source models are delivered by providers who make their model weights available for download for various reasons, and the models are often subject to licensing. When considering a specific provider or model, ensure that your intended use falls under the licensing in question.
Historically, research institutions were the primary providers of open-source models. However, the landscape has evolved, and today, many companies are also major contributors, particularly in the field of LLMs. As a result, multiple open-source models are available, each with different strengths and capabilities. Some excel in reasoning and coding tasks, while others focus on multilingual proficiency or efficiency. Leading examples include:
Selecting the right LLM requires understanding fundamental characteristics and their trade-offs. Several factors impact a model's performance, efficiency, and suitability for different use cases.
This refers to the number of learnable weights in the model, which influences its ability to understand and generate text. Larger models (70B+ parameters) tend to offer more nuanced reasoning and knowledge, whereas smaller models (7B–13B) are more lightweight and efficient.
Measured in tokens, the context window determines how much text the model can process at once. Modern models vary widely on this: some support around 8K tokens, while others extend to 100K+ tokens in context. Larger context windows are useful for handling long documents or maintaining long conversations without losing context.
Precision refers to the numerical format used for the model weights and calculations. LLMs are typically trained in high precision (FP32 or BF16) for maximum accuracy. Quantization involves converting weights to lower precision like INT8 or INT4, which dramatically reduces model size and speeds up inference.
Training data is the text corpus used to teach the model language skills, while methodology refers to the training techniques (pre-training, fine-tuning, and alignment via reinforcement learning from human feedback, or RLHF). Base models (pre-trained on broad data without human alignment) often exhibit more raw and creative outputs, whereas RLHF-aligned models are fine-tuned to follow instructions, be helpful, and avoid unsafe content. Additionally, domain-specific fine-tuning (e.g., on medical or legal text) can make a model perform much better in specialized fields.
While most open-source LLMs handle text only, some are multimodal – they can process images, audio, or even video in addition to text. For example, vision-language models can interpret images in a prompt, and others incorporate speech recognition or generation.
Capabilities of LLMs can include advanced reasoning, complex problem-solving, arithmetic skills, or the ability to learn from just a few examples (in-context learning). Some advanced LLMs also demonstrate the use of tools or the execution of multi-step reasoning (chain-of-thought) to solve complex tasks. These emergent abilities are more frequent as the model size and amount of training data increases.
Although some models are trained on non-English languages, English is still the primary language of LLMs due to the vast amount of training data available. For open-source models, the performance degradation in another language can be quite significant. Given that you expect your model to be used in non-English, it is relevant to consider the model performance in the particular language (see the following section).
Assessing an LLM's effectiveness requires objective evaluation across multiple dimensions. Both standardized benchmarks and user-based assessments are useful to gauge a model's strengths and weaknesses.
Common benchmark tests include:
Public leaderboards tracking these benchmarks can be found online. Notable sources include:
Additionally, user feedback, often gathered through blind model comparisons and preference rankings, provides insights beyond benchmarks. Chatbot Arena LLM Leaderboard – a platform where models are pitted against each other and humans vote on the better response, yielding a ranking of chat models — is a notable example in this regard. If you are specifically considering smaller LLMs, GPU-Poor LLM Leaderboard is another useful resource.
Deploying LLMs for inference — the process of generating predictions or responses from a trained model — requires careful consideration of hardware. In almost all cases, GPUs (graphics processing units) are essential for running large models. CPUs are generally far too slow for anything beyond very small models because they are designed for general-purpose, sequential processing. LLM inference is highly parallelizable and memory-intensive, which plays to the strengths of GPUs.
Inference tasks are often more memory-bound than compute-bound. This is because LLM inference requires loading and accessing massive model weights from memory repeatedly, which creates a bottleneck in memory bandwidth and latency, whereas the actual compute operations (matrix multiplications) are relatively efficient on modern GPUs. Therefore, the primary factor in choosing a Nvidia h200nd model is the GPU's VRAM (video RAM, the GPU's on-board memory) capacity and the model's size, respectively.
Note: Because of this, and as a simplification, we primarily focus on VRAM requirements in this discussion. While factors like compute power (FLOPs), latency, and time to first token (TTFT) can be crucial for real-world performance, we do not analyze them in depth here.
For efficient inference, the model's weights must fit entirely within the combined VRAM of your GPUs. You can estimate the VRAM required to load the model using the following formula:
For example, for a 70B parameters model, we would have the following memory requirements for different levels of quantization: FP16 ~140 GB, INT8 ~70 GB, INT4 ~35 GB.
The KV cache (Key-Value cache) is a crucial component during LLM inference, especially for long prompts or dialogues. As the model generates output, it stores intermediate results (the "keys" and "values" for each attention layer at each token position) so that it doesn't recompute them every time it processes a new token. This cache grows linearly with the number of tokens being processed. In practical terms, longer context = more VRAM usage dedicated to the cache. The size of the KV cache can be estimated using the formula:
For example, for a model like Llama 70B with 80 layers and hidden layer size of 8192, using FP16 precision and average number of tokens per request of 750, we would have approximately 3.9GB of KV cache memory usage.
Serving multiple users or requests concurrently further strains memory usage. You need to accommodate N separate contexts (and their caches) if you want to handle N sessions in parallel. In order to determine the maximum number of concurrent users, you will have to know the average number of tokens per request which can be hard to determine. In the KV cache size example calculation above and given a GPU with 40 VRAM, 10 concurrent users can be served without performance degradation (3.9GB * 10 = 39GB).
Experiment with model and user specifications and see the effect on the hardware uptake. Choose pre-set models and GPUs from the dropdowns or specify the parameters yourself. Note: the calculations are approximations.
Model Weight Size: 14.00 GB
Estimated KV Cache per User: 1.07 GB
Total KV Cache (All Users): 1.07 GB
Total VRAM Used: 15.07 GB
VRAM Remaining: 8.93 GB (37.2%)
Selecting the right open-source LLM for local deployment requires balancing performance, hardware constraints, and specific use case requirements.
Once an open-source LLM is selected, the next step is deploying it efficiently. This requires the right infrastructure, optimized inference frameworks, and reliable API servers to handle requests, reduce latency, and manage resources effectively.
To run an LLM in production, a serving framework is essential. This is a piece of software that loads the model, manages its execution, and efficiently handles user queries. Without one, models are slow, memory-intensive, and difficult to scale. Popular frameworks include vLLM (high-speed inference), TGI (optimized for text generation), Llama.cpp (lightweight CPU/GPU execution), and CTransformers (fast C++ backend).
An API server connects applications to the deployed model, handling requests and security. Essential components include an HTTP/REST interface, request processing, security controls (authentication, authorization, rate limiting), and monitoring/logging for performance tracking and debugging.
Deploying an LLM requires more than just picking a serving framework—it demands efficient infrastructure, scalability, and seamless integration with applications. Managing GPU resources, optimizing inference, and ensuring security can quickly become complex. This is where the GRACE Platform provides a complete, hassle-free solution:
In our upcoming blog, we will explore the process of deploying LLMs using the GRACE Platform, highlighting best practices, performance optimizations, and real-world deployment strategies. Stay tuned for more insights.
For further details on LLM inference, hardware sizing, and deployment best practices, refer to the following resources:
Stay up to date on our latest news and industry trends