Relaterte kurs

Avansert

Quantization Theory for Neural Networks

A mathematically rigorous exploration of quantization for large neural networks, focusing on numerical representations, error propagation, and the theoretical limits of precision reduction. This course emphasizes the underlying numerical analysis, stability trade-offs, and information loss inherent in quantizing deep models.

Theory

kurs

Avansert

Transformers Theory Essentials

A comprehensive, code-free exploration of transformer-based language models, focusing on their architecture, text generation mechanics, and the theoretical principles underlying their behavior.

Theory

kurs

Nybegynner

Linux Basics

Learning Linux is valuable for many IT professions. For system administrators, DevOps engineers, and backend developers, it enables efficient server management, automation of software development and deployment, and the development and management of server-side applications. For network administrators, cybersecurity professionals, and data analysts, Linux knowledge helps effectively manage networks, ensure security, and analyze data.

Linux

4.5

Artificial IntelligenceMachine LearningDevelopment Tools

Running LLMs Locally with Ollama: A Practical Guide

by Daniil Lypenets

Full Stack Developer

May, 2026・
14 min read

Running LLMs Locally with Ollama: A Practical Guide

Introduction

For most of the past three years, "using AI" meant one thing — sending your data to someone else's servers and waiting for a response. That model worked for plenty of use cases, but it failed completely for others. Sensitive documents that could not legally leave the building. Offline environments where the cloud was simply not available. Latency-critical workflows where every network hop hurt. And the ever-present per-token bill that scaled with usage instead of value.

In 2026 the math changed. Consumer hardware became powerful enough, quantization techniques matured, and a generation of small-but-capable open models hit the scene. Suddenly running a frontier-class model on your own laptop was not a hobby project — it was a viable production strategy.

The tool most teams reach for is Ollama. It is the cleanest entry point into local inference: one install command, one pull command, one chat command. This article walks through what makes local LLMs practical now, how to set up Ollama end to end, and where local inference belongs in your stack.

Why Local LLMs Got Practical

Three things changed at roughly the same time, and they reinforced each other.

Models got smaller without getting worse. A 7-billion-parameter model from 2026 routinely beats a 70-billion-parameter model from 2023 on most benchmarks. The architectures got better, the training data got cleaner, and Mixture-of-Experts designs let you activate only a fraction of the weights per token.

Quantization stopped being lossy. Modern 4-bit quantization (Q4_K_M and friends) retains roughly 95% of a model's quality while shrinking the memory footprint by a factor of four. That difference is why a model that used to require an A100 now runs on a MacBook Air.

Consumer hardware caught up. Apple Silicon, modern Snapdragon X chips, and consumer GPUs with 12+ GB of VRAM removed the hardware ceiling for everything below the frontier tier.

The result is a practical floor that most teams underestimate. A 7B model running locally on a five-year-old laptop is sufficient for:

Summarization and rewriting;
Document Q&A over a known corpus;
Code completion and refactoring suggestions;
Most structured-extraction tasks;
Privacy-sensitive pipelines that must never touch a cloud API.

If your use case does not specifically need GPT-5 or Claude Opus, you are paying for capability you are not using.

That observation is what is moving teams toward local inference faster than the hype tracked.

Installing Ollama in Five Minutes

Ollama is genuinely simple to install. On macOS or Linux:

curl -fsSL https://ollama.com/install.sh | sh

On Windows, the official installer is a standard .exe. After installation, Ollama runs as a background service exposing a local HTTP API on port 11434.

Pull a model:

ollama pull llama3.2

That command downloads a quantized version of the model and stores it locally. The first pull takes a few minutes; subsequent runs are instant.

Chat with it:

ollama run llama3.2

You now have a working LLM running entirely on your machine. No API keys. No usage limits. No network calls.

Picking the Right Model

The Ollama model library is large enough to be confusing. A few rules of thumb cut through it:

For general chat and writing, start with llama3.2 or qwen3. Both are well-rounded and run on modest hardware;
For coding, use codellama, deepseek-coder, or qwen3-coder. They are smaller than general chat models and specifically trained on code;
For reasoning-heavy tasks, look at deepseek-r1 or any model with -thinking in the name. These are slower but produce stronger structured outputs;
For multimodal (image + text), use llama3.2-vision or qwen2.5-vl. Capabilities here are improving fastest;
For tiny hardware, phi-4-mini is the floor — it fits in under 4 GB of RAM and still produces usable output. Model size matters most for memory:
3B models need roughly 4 GB of RAM;
7–8B models need roughly 8 GB;
13–14B models need roughly 12 GB;
30B+ models start needing a dedicated GPU or Apple Silicon with 32+ GB of unified memory. Pick the smallest model that solves your problem. The temptation is always to grab the biggest one — the result is usually a slow, frustrating experience for tasks a smaller model would have handled at 30 tokens per second.

Run Code from Your Browser - No Installation Required

Building Apps Against a Local Model

Ollama exposes an HTTP API that is intentionally close to OpenAI's. If you have existing code that calls https://api.openai.com/v1/chat/completions, swapping it to http://localhost:11434/v1/chat/completions is often a one-line change.

A minimal Python example:

from openai import OpenAI
 
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # any non-empty string works
)
 
response = client.chat.completions.create(
    model="llama3.2",
    messages=[
        {"role": "user", "content": "Explain quantization in one paragraph."}
    ]
)
 
print(response.choices[0].message.content)

That is the entire bridge. Everything else — streaming, tool calling, structured outputs — works with the same code patterns you already use.

For applications that need lower-level control, Ollama also exposes its native REST API. The native API supports more advanced features like custom context-window settings, raw token access, and embedding generation.

The Hardware Question

The question almost everyone asks first is "do I need a GPU?" The honest answer is: it depends on the model size and the latency you can tolerate.

Hardware	Best Model Size	Tokens/sec	Use Case
8 GB RAM laptop	3B (Phi-4-mini)	15–20	Light tasks, learning, demos
16 GB RAM laptop	7B (Llama 3.2)	8–15	Daily development, prototyping
Apple Silicon, 32 GB	14B or smaller	20–40	Production-quality local workflows
Apple Silicon, 64 GB+	30B+ MoE	30–60	Near-frontier quality, fully local
RTX 4090 (24 GB VRAM)	14B or smaller	60–100	High-throughput inference

The cliff is roughly at 7B parameters. Below it, almost anything runs comfortably. Above it, you need to think about hardware.

Comfortable reading speed is around 5–8 tokens per second. Anything above 15 tok/s feels real-time for interactive use.

That threshold is the right one to optimize for. There is no point in running a 70B model at 2 tokens per second when a 14B at 30 tokens per second produces good-enough results.

Start Learning Coding today and boost your Career Potential

Production Patterns

A few patterns separate hobby use from production use:

Run Ollama as a service behind a reverse proxy. Do not expose port 11434 directly to the internet without authentication;
Use a separate model per workload. Keep coding workloads on a code model and chat workloads on a chat model — the cost of switching is one API parameter;
Cache aggressively. Local inference is cheap but not free. If you are summarizing the same document twice, cache the output;
Monitor memory pressure. Ollama keeps models in memory after first use. On constrained machines, configure OLLAMA_KEEP_ALIVE to evict idle models;
Use embeddings locally too. Ollama serves embedding models with the same simple API. Local RAG pipelines are entirely viable without ever touching a cloud vector database. For team setups, run Ollama on a shared server with a GPU and treat it like an internal API. That gives you central management of model versions, shared caching, and one source of truth for what's deployed.

When to Stick With the Cloud

Local inference is not the right answer for every use case. Stay with cloud APIs when:

You need the absolute best available model (GPT-5, Claude Opus, Gemini Ultra);
Your workload has long, bursty spikes that would idle expensive hardware;
Your team does not want to operate model infrastructure;
You need built-in tools like web search or code execution that cloud providers bundle in;
The model is the entire product and quality is a competitive moat. The pragmatic 2026 architecture is hybrid. Use local models for the 80% of workloads where they are sufficient, and route the remaining 20% — the genuinely hard reasoning, the latest capabilities, the edge cases — to the cloud. Most production teams running cost-optimized AI are doing exactly this.

Start Learning Coding today and boost your Career Potential

Conclusion

Local LLMs are not a niche anymore. The combination of better models, smarter quantization, and faster hardware turned what was a research curiosity into a viable production strategy in well under three years.

For developers, the practical takeaway is simple. Install Ollama. Pull a 7B model. Run a real task through it. The first time you watch a capable assistant generate fluent output on a machine that has never touched the internet for inference, the abstraction shifts.

Cloud AI is no longer the default — it is one option among several.

Once you internalize that, the architecture decisions get sharper and the bills get smaller.

FAQ

Q: Do I need an internet connection to use Ollama?

A: Only for the initial model download. After that, Ollama runs entirely offline. This is what makes it suitable for air-gapped environments and privacy-sensitive work.

Q: How does Ollama compare to LM Studio or llama.cpp?

A: Ollama is the most opinionated and easiest to start with. LM Studio offers a richer GUI for model comparison. llama.cpp is the underlying engine that powers most of them and gives you maximum control at the cost of more setup work.

Q: Can I run Ollama on a Raspberry Pi?

A: Yes, but only with the smallest models (around 1–3B parameters). Performance is usable for simple tasks but will not match a modern laptop.

Q: Is local inference actually cheaper than the cloud?

A: Past a certain volume, yes — sometimes dramatically. The break-even point depends on your hardware cost and per-request volume, but teams running thousands of inferences a day on local hardware typically see costs an order of magnitude lower than API equivalents.

Q: Can Ollama models call tools and use function calling?

A: Yes. Tool calling support has matured substantially and works with the same OpenAI-compatible patterns. Quality varies by model — Llama 3.2 and Qwen3 are currently the most reliable for tool use.

Q: What about fine-tuning? Can I fine-tune models in Ollama?

A: Ollama itself does not handle training. You fine-tune models with tools like Unsloth, Axolotl, or Hugging Face TRL, then convert the result to GGUF and import it into Ollama for inference.

Q: Is local AI a security risk?

A: It is generally safer than cloud AI for data privacy, since data never leaves the machine. But you still need to authenticate the API if you expose it on a network, scan models for any known security issues, and treat the model file itself like any other binary you would not run blindly.

Var denne artikkelen nyttig?

Del:

Var denne artikkelen nyttig?

Del:

Relaterte kurs

Se alle kurs

kurs

Avansert

Quantization Theory for Neural Networks

Theory

kurs

Avansert

Transformers Theory Essentials

A comprehensive, code-free exploration of transformer-based language models, focusing on their architecture, text generation mechanics, and the theoretical principles underlying their behavior.

Theory

kurs

Nybegynner

Linux Basics

Linux

4.5

Innholdet i denne artikkelen