Introduction
LaunchDarkly’s new AI Configs (in Early Access) now support bringing your own model. This capability unlocks more flexibility than ever before for supporting fine-tuned models, or even running your own models on local hardware.
In this post, we’ll compare two open source models (DeepSeek-R1 and Alibaba’s Qwen) using Ollama and LaunchDarkly AI configs.
Prerequisites
- A developer environment with Python and pip installed. (I’m using Python 3.12 and I’d recommend 3.6 or newer).
- A computer with at least 16gb RAM.
- Ideally, a fast internet connection; otherwise downloading models might take awhile.
Locally running LLMs: why and how
Running LLMs on your own hardware has a few advantages over using cloud-provided models:
- Enhanced data privacy. You are in control and can be confident you’re not leaking data to a model provider.
- Accessibility. Locally running models work without an internet connection.
- Sustainability and reduced costs. Local models take less power to run.
I was intimidated by the idea of running a model locally. Fortunately, tooling and models have come a long way! You don’t need super specialized knowledge, or particularly powerful hardware.
Ollama is an awesome open-source tool we’ll be using to run large language models on local hardware.
Choosing our models
Ultimately, which model to choose is an extremely complex question that depends on your use case, hardware, latency, and accuracy requirements.
Reasoning models are designed to provide more accurate answers to complex tasks such as coding or solving math problems. DeepSeek made a splash releasing their open source R1 reasoning model in January.
The open source DeepSeek models are distillations of the Qwen or Llama models. Distillation is training a smaller, more efficient model to mimic the behavior and knowledge of a larger, more complex model. Let’s pit the distilled version against the original here and see how they stack up.
In this post, we’ll use small versions of these models (deepseek-r1:1.5b and qwen:1.8b) to make this tutorial accessible and fast for those without access to advanced hardware. Feel free to try whatever models best suit your needs as you follow along.
Installing and configuring Ollama
Head to the Ollama download page. Follow the instructions for your operating system of choice.
To install our first model and start it running, type the following command in your terminal:
ollama run deepseek-r1:1.5b
Let’s run a test query to ensure that Ollama can generate results. At the prompt, type a question:
>>> "why is the sky blue?"
</think>
The color of the sky appears blue due to a
combination of factors related to light
refraction and reflection. Here's a
step-by-step explanation:
...
To exit the Ollama prompt, use the /bye command.
Follow the same process to install qwen:1.8b:
ollama run qwen:1.8b
Although you have exited the terminal process, Ollama is still running in the background via a Docker image. That lets us run code that queries the models we’ve downloaded. Next we’ll create a Node.js project that connects with Ollama.
Connecting Ollama with a Python project
Run the following commands to set up a new Python project, activate a virtual environment and install dependencies.
mkdir ollama-python
cd ollama-python
python -m venv venv
source venv/bin/activate
pip install ollama launchdarkly-server-sdk-ai dotenv
What are these dependencies and why do we need them?
- ollama lets us run queries against a locally running model.
- launchdarkly-server-sdk-ai lets us read and write AI Config data from LaunchDarkly
- dotenv helps us securely manage our environment variables so we can keep our credentials nice and secure.
Create a new file named main.py. Enter the following code:
import ollama
stream = ollama.chat(
model='deepseek-r1:1.5b',
messages=[{'role': 'user', 'content': 'Why is the sky blue?'}],
stream=True
)
for chunk in stream:
print(chunk['message']['content'], end='', flush=True)
Run this script to verify that local inference is working.
python main.py
You should see terminal output that answers the question:
import ollama from "ollama";
const response = await ollama.chat({
model: "deepseek-r1:1.5b",
messages: [{ role: "user", content: "Why is the sky blue?" }],
});
console.log(response.message.content);
Run this script to verify that local inference is working.
python main.py
You should see terminal output that answers the question:
<think>
</think>
The color of the sky, also known as atmospheric red or violet, is primarily due to a process called Rayleigh scattering.
...
Adding a custom model to LaunchDarkly AI Configs
Head over to the model configuration page on LaunchDarkly app. Click the Add AI Model config button. Fill out the form using the following configuration:
- Name: deepseek-r1:1.5b
- Provider: DeepSeek
- Model ID: deepseek-r1:1.5b
Model ID represents how you intend to refer to the model in your code. Name is how the model will show up in the LaunchDarkly UI. If you are using a model with one of those terribly long gobbledygook names like llama-mama-10-gajillion-u234oiru32-preview-instrukt-latest, giving it a shorter, human readable Name might make your dashboard more legible. In this case, we’re sticking with deepseek-r1:1.5b since it’s relatively short and clear. Click Save.
Click the Add AI Model Config button again to add the second model.
- Name: qwen:1.8b
- Provider: Custom (Qwen)
- Model ID: qwen:1.8b
Click Save.
Next, we’ll create the AI Config variation representing the prompt and model combinations that we can swap out at runtime. Variations are editable and versioned, so it’s okay to make mistakes. Click the Create button, and select AI Config from the menu.
Enter “model-showdown” in the AI Config name field. Click Create. Configure the next page like so:
- Name: deepseek-r1:1.5b
- Model: deepseek-r1:1.5b
- Role: User. (DeepSeek reasoning models aren’t optimized for System prompts.)
- Message: Why is the sky blue?
Save changes.
To use qwen:1.8b, click Add another variation. Variations record metrics separately. In order to measure how each model performs, we need one variation per model. To compare fairly between models, keep the prompt the same in both variations. When you’re done, Save changes.
- Name: qwen:1.8b
- Model: qwen:1.8b
- Role: User
- Message: Why is the sky blue?
On the Targeting tab, enable the AI Config, then edit the default rule to serve the deepseek-r1:1.5b variation. Click the toggle to enable the AI Config. Click Review and save.
If your LaunchDarkly permissions require it, enter a comment to explain these changes.
There’s one more configuration step. On the … dropdown next to the Test environment on the Targeting tab, select SDK key to copy your SDK key to the clipboard.
Create an .env file at the root of your project. Paste the following line in, replacing “YOUR KEY HERE” with your actual key.
LAUNCHDARKLY_SDK_KEY="YOUR KEY HERE"
Save the .env file. Be careful not to check this file into source control if you are using it. Keep those secrets safe.
Connecting LaunchDarkly AI Configs to Ollama
Back in your Python project, create a new file named generate.py. Add the following lines of code:
from dotenv import load_dotenv
load_dotenv()
import os
import ldclient
from ldclient import Context
from ldclient.config import Config
from ldai.client import LDAIClient, AIConfig, ModelConfig, LDMessage
from ldai.tracker import TokenUsage
import ollama
ldclient.set_config(Config(os.getenv("LAUNCHDARKLY_SDK_KEY")))
ld_client = ldclient.get()
ld_ai_client = LDAIClient(ld_client)
def generate(**kwargs):
context = Context.builder("user-123").kind("user").name("Sandy").build()
default_config = AIConfig(
enabled=True,
model=ModelConfig(name="deepseek-r1:1.5b"),
messages=[LDMessage(role="user", content="Why is the sky blue?")]
)
ai_config_key = "model-showdown"
default_value = default_config
config_value, tracker = ld_ai_client.config(
ai_config_key,
context,
default_value,
kwargs
)
model_name = config_value.model.name
print("CONFIG VALUE: ", config_value)
print("MODEL NAME: ", model_name)
messages = [] if config_value.messages is None else config_value.messages
prompt = messages[0].content
response = ollama.generate(model=model_name, prompt=prompt)
print(response)
if __name__ == "__main__":
generate()
Run this code using python generate.py in your terminal. Output should show the response comes from deepseek-r1:1.5b.
(venv) tthurium@Tildes-MacBook-Pro ollama-python % python generate.py
CONFIG VALUE: AIConfig(enabled=True, model=<ldai.client.ModelConfig object at 0x102d08a50>, messages=[LDMessage(role='user', content='why is the sky blue?')], provider=None)
MODEL NAME: deepseek-r1:1.5b
model='deepseek-r1:1.5b' created_at='2025-03-25T22:40:23.572951Z' done=True done_reason='stop' total_duration=6717361959 load_duration=39487667 prompt_eval_count=9 prompt_eval_duration=546461916 eval_count=549 eval_duration=6130588667 response="<think>\n\n</think>\n\nThe color of the sky, known as ** APPARENT SHADOW**, is caused by a combination of factors in our atmosphere. Here are the key reasons why it appears blue:\n\n### 1. **Meteors and Turbulence**\n
...
Cool! Let’s try the other model. Back in the LaunchDarkly app, edit your default rule for the model-showdown AI Config to serve the qwen:1.8b model. Save changes.
Rerun generate.py and you’ll see the response from qwen:1.8b:.
(venv) tthurium@Tildes-MacBook-Pro ollama-python % python generate.py
CONFIG VALUE: AIConfig(enabled=True, model=<ldai.client.ModelConfig object at 0x1028d8a50>, messages=[LDMessage(role='user', content='Why is the sky blue?')], provider=None)
MODEL NAME: qwen:1.8b
model='qwen:1.8b' created_at='2025-03-25T22:36:31.093725Z' done=True done_reason='stop' total_duration=10908628833 load_duration=8974848583 prompt_eval_count=14 prompt_eval_duration=692508541 eval_count=121 eval_duration=1240502667 response="The blue color of the sky appears due to a phenomenon known as Rayleigh scattering. This process occurs when sunlight enters Earth's atmosphere, primarily consisting of nitrogen (N2) and oxygen (O2)
...
Some other queries you can try to test reasoning models’ capabilities:
- What is 456 plus 789?
- What is the color most closely matching this HEX representation: #8002c6 ?
Google maintains an awesome list of questions to evaluate reasoning models on GitHub. You’ll get the most applicable results if you stick to questions that are close to your use case.
Tracking Ollama metrics, Python style
Let’s implement a function to track metrics in our app. Below the generate function and before the main block script, add the following lines of code:
def track_metrics(response, tracker):
tracker.track_success()
token_usage = TokenUsage(
output=response.eval_count,
input=response.prompt_eval_count,
total=response.eval_count + response.prompt_eval_count)
tracker.track_tokens(token_usage)
duration_in_milliseconds = response.total_duration / 1_000_000;
tracker.track_duration(duration_in_milliseconds);
Update the generate function to call the track_metrics function. While we’re at it, let’s wrap the generation call in a try block so we can track errors too. Your finished function should look like this:
def generate(**kwargs):
context = Context.builder("user-123").kind("user").name("Sandy").build()
default_config = AIConfig(
enabled=True,
model=ModelConfig(name="deepseek-r1:1.5b"),
messages=[LDMessage(role="user", content="Why is the sky blue?")]
)
ai_config_key = "model-showdown"
default_value = default_config
config_value, tracker = ld_ai_client.config(
ai_config_key,
context,
default_value,
kwargs
)
model_name = config_value.model.name
print("CONFIG VALUE: ", config_value)
print("MODEL NAME: ", model_name)
messages = [] if config_value.messages is None else config_value.messages
try:
prompt = messages[0].content
response = ollama.generate(model=model_name, prompt=prompt)
track_metrics(response, tracker)
# metrics won't be sent until the client connection is closed
ldclient.get().close()
print(response)
return response
except Exception as e:
print(e)
tracker.track_error()
Important: you must close the connection to the LaunchDarkly client before metrics will be sent.
Wrapping it up: bring your own model, track your own metrics, take the next steps
In this tutorial you’ve learned how to run large language models locally with Ollama and query the results from a Node.js application. Furthermore, you’ve created a custom model AI config with LaunchDarkly that tracks metrics such as latency, token usage, and generation count.
There’s so much more we could do with AI configs on top of this foundation.
One upgrade would be to add additional metrics. For example, you could track output satisfaction and let users rate the quality of the response. If you are using LLMs in production, AI configs even support running A/B tests and other kinds of experiments to determine which variation performs the best for your use case using the power of statistics.
AI configs also have advanced targeting capabilities. For example, you could use a more expensive model for potentially high-value customers with enterprise-y email addresses. Or you could give users a more linguistically localized experience by serving them a model trained in the language specified in their accept-lang header.
If you want to learn more about runtime model management, here’s some further reading:
- Compare AI Models in Python Flask Applications — Using LaunchDarkly AI configs
- Upgrade OpenAI models in ExpressJS applications — using LaunchDarkly AI configs
Thanks so much for following along. Hit me up on Bluesky if you found this tutorial useful. You can also reach me via email (tthurium@launchdarkly.com) or LinkedIn.