Ollama

Tools

Ollama is an open-source tool for running AI language models on your own computer. After a one-time model download, everything, your prompts, the model's responses, and any documents you feed in, stays on your device. Nothing is sent to a cloud server during normal use.

Why it matters for privacy

Most AI assistants work by sending your questions to a company's servers, which process them, store them, and in some cases use them to improve future models. That is a reasonable tradeoff for a free service, but it means your queries are being read, logged, and retained somewhere outside your control.

Ollama changes that arrangement. The model runs locally. When you type a prompt, it goes to a process on your own machine and the response comes back the same way. Ollama's own privacy policy states plainly that they do not collect, store, transmit, or have access to your prompts, responses, or model interactions from local use.

This is useful for anyone who wants to use AI tools without creating a record of what they asked. It is also useful for working with sensitive documents, medical notes, legal drafts, private correspondence, without uploading them to a third party.

What Ollama helps with

Using AI without your queries being logged by a cloud provider
Working with sensitive documents locally without uploading them anywhere
Running AI in situations with limited or no internet access, after the initial model download
Avoiding accounts and the identity trail that comes with them
Experimenting with AI models without usage data being tied to your identity

What Ollama does not do

It does not make model downloads anonymous. When you run ollama pull llama3.1, the request goes to Ollama's servers. Ollama's privacy policy confirms they collect IP address, device metadata, and which models are downloaded. Only the inference itself is local.

It does not protect you from other software on your device. If other processes on your machine can read your filesystem or memory, they could in principle read your prompts and responses. Local does not mean isolated.

It does not replace a full privacy setup. Running a local AI model is one layer. If your OS, browser, or other applications are leaking data, Ollama does not address that.

It does not match frontier model quality at low hardware specs. Smaller local models are less capable than the largest cloud models. The quality gap narrows as hardware improves, but it exists.

The Ollama application itself collects some telemetry. Device information, request counts, and similar metadata are collected. This does not include prompt or response content, but it is worth knowing the app is not entirely silent.

Tradeoffs to be aware of

Running AI locally requires hardware capable of holding the model in memory. A 7-billion parameter model needs around 8GB of RAM to run comfortably. Larger models need more. A modern GPU accelerates inference significantly. CPU-only inference works but is noticeably slower for larger models.

Model quality scales roughly with size. A 7B model is capable but not as thorough as a 70B model, which in turn needs a workstation or server-grade GPU to run at reasonable speed. For most everyday queries, a 7B or 14B model on a machine with a capable GPU is a reasonable starting point.

Ollama runs models from its own library at ollama.com/library. The available models include Llama (Meta), Gemma (Google), Mistral, DeepSeek, Qwen (Alibaba), and dozens of others. All are open-weight, meaning the model files can be downloaded and inspected.

Practical guidance

Ollama is available for macOS, Windows, and Linux. Installation is straightforward.

On macOS and Linux, the install script is available at ollama.com. On Windows, there is a standard installer. On Linux, it also works via Docker if you prefer not to install it directly.

Once installed, pulling a model is a single command. ollama run gemma3 downloads the model if you do not have it and opens an interactive session immediately after. The model stays on disk and does not need to be re-downloaded on future runs.

For most people starting out, a mid-size model in the 7B to 14B range is the practical entry point. These run acceptably on machines with a modern GPU and 16GB of RAM. Smaller models like 3B work on machines with less RAM but with a corresponding reduction in response quality.

Ollama also exposes a local REST API at http://localhost:11434, which means it can be used with compatible front-end apps, code editors, and tools that speak the OpenAI API format. Several open-source chat interfaces work with it out of the box.

Going deeper

Model format. Ollama uses GGUF format model files, a standard originally developed for llama.cpp. GGUF files are self-contained, the model weights, tokeniser, and configuration are in one file. You can use models from sources other than Ollama's library by importing GGUF files directly.

Quantisation. Large models are often distributed in quantised form, meaning the weights are stored at lower precision to reduce file size and memory requirements. A quantised 7B model might need 4GB of RAM rather than 14GB. This trades some accuracy for practicality. Ollama handles quantisation transparently when you pull a model.

Modelfiles. Ollama supports a Modelfile format that lets you define a custom model configuration, including a system prompt, temperature, and base model. This is useful for creating a persistent local assistant with a specific role or set of instructions, without modifying the underlying model weights.

No network traffic during inference. The Ollama API documentation confirms that inference, embeddings, and completions all run on your local server. The only documented external calls are model pulls, pushes, and standard web telemetry on the Ollama application itself. There is no mechanism for prompts to be transmitted externally during a local session.

Open source. Ollama's server, CLI, and official client libraries are published on GitHub under the ollama organisation. The codebase can be audited and built from source.

Foldy tip

Your prompts stay on your machine. That is the whole point, and it is worth taking seriously.

Threat modeling, helps clarify whether local AI is actually the right tool for your situation
Metadata, what data travels alongside even local tool use
Tails, an amnesic OS that leaves no trace, relevant if you want no local record of AI sessions
Qubes OS, compartmentalisation by VM, useful for isolating AI workloads from other activity