Free Setup Guide

Foundry Local: run Phi on your machine for fast offline document Q&A

Shipping document chat normally means uploading PDFs to a cloud bot and paying per message. That is slow to iterate and risky for private files.

Adam BurgeAdam BurgeNano Flow

Foundry Local is the on-device path. You run a small Phi model next to your app, index your docs locally, and answer questions without sending data out. I set this up with Ollama (a local LLM runner) and Phi-3 mini, plus a tiny API for your app. You test features fast, keep files on disk, and skip cloud bills while you prototype.

You will set up a local Phi-3 model, verify it can chat, then wire a tiny API that answers questions over your PDFs. Everything runs on your laptop. No uploads.

First, install Ollama (a local model runner) so you can run Phi-3 with one command. On macOS:

bash
brew install ollama

Start the server, pull a small Phi model, and a tiny embedding model for indexing:

bash
ollama serve &
ollama pull phi3:mini
ollama pull nomic-embed-text

Verify the model answers locally. This runs the chat in your terminal, no internet needed:

bash
ollama run phi3:mini
> You are a helpful assistant.
> What is 2+2?

If you see "4" in the reply, your local LLM is live. Next you will stand up a small FastAPI server that exposes /ask to your app, chunk PDFs, embed with nomic-embed-text, and retrieve top passages for Phi-3. Stop here if you just wanted a quick sanity check.


You will set up a local Phi-3 model, verify it can chat, then wire a tiny API that answers questions over your PDFs. Everything runs on your laptop. No uploads.

First, install Ollama (a local model runner) so you can run Phi-3 with one command. On macOS:

bash
brew install ollama

Start the server, pull a small Phi model, and a tiny embedding model for indexing:

bash
ollama serve &
ollama pull phi3:mini
ollama pull nomic-embed-text

Verify the model answers locally. This runs the chat in your terminal, no internet needed:

bash
ollama run phi3:mini
> You are a helpful assistant.
> What is 2+2?

If you see "4" in the reply, your local LLM is live. Next you will stand up a small FastAPI server that exposes /ask to your app, chunk PDFs, embed with nomic-embed-text, and retrieve top passages for Phi-3. Stop here if you just wanted a quick sanity check.

Setup

Pick your platform and get Ollama running. Then pull the Phi and embedding models.

macOS (Apple Silicon or Intel):

bash
brew install ollama
ollama serve &
ollama pull phi3:mini
ollama pull nomic-embed-text

Linux (Debian, Ubuntu, Fedora, Arch):

bash
curl -fsSL https://ollama.com/install.sh | sh
ollama serve &
ollama pull phi3:mini
ollama pull nomic-embed-text

Windows 11:

bash
winget install Ollama.Ollama
# Open a new terminal after install
ollama serve
# In another terminal
ollama pull phi3:mini
ollama pull nomic-embed-text

Now create a minimal local RAG service. It indexes a docs/ folder, then answers questions via HTTP.

Install Python deps in a virtualenv:

bash
python -m venv .venv
source .venv/bin/activate # Windows: .venv\\Scripts\\activate
pip install fastapi uvicorn[standard] faiss-cpu pypdf requests

Create main.py:

py
from fastapi import FastAPI
from pydantic import BaseModel
from typing import List
import os, glob
from pypdf import PdfReader
import requests
import faiss
import numpy as np

OLLAMA = "http://localhost:11434"
EMBED_MODEL = "nomic-embed-text"
GEN_MODEL = "phi3:mini"
INDEX_PATH = "index.faiss"
CHUNK_SIZE = 800
CHUNK_OVERLAP = 120
DOCS_DIR = "docs"

app = FastAPI()

# Simple in-memory store for chunks
docs_chunks: List[str] = []
index = None

def embed(texts: List[str]) -> np.ndarray:
vecs = []
for t in texts:
r = requests.post(f"{OLLAMA}/api/embeddings", json={"model": EMBED_MODEL, "prompt": t})
r.raise_for_status()
vecs.append(r.json()["embedding"])
return np.array(vecs).astype("float32")

def load_pdfs(dirpath: str) -> List[str]:
chunks = []
for path in glob.glob(os.path.join(dirpath, "**", "*.pdf"), recursive=True):
reader = PdfReader(path)
buf = []
for page in reader.pages:
buf.append(page.extract_text() or "")
text = "\n".join(buf)
i = 0
while i < len(text):
chunks.append(text[i:i+CHUNK_SIZE])
i += CHUNK_SIZE - CHUNK_OVERLAP
for path in glob.glob(os.path.join(dirpath, "**", "*.txt"), recursive=True):
with open(path, "r", encoding="utf-8", errors="ignore") as f:
t = f.read()
i = 0
while i < len(t):
chunks.append(t[i:i+CHUNK_SIZE])
i += CHUNK_SIZE - CHUNK_OVERLAP
return [c.strip() for c in chunks if c.strip()]

def build_index():
global docs_chunks, index
docs_chunks = load_pdfs(DOCS_DIR)
if not docs_chunks:
raise RuntimeError("No docs found. Put PDFs or .txt files under ./docs")
vecs = embed(docs_chunks)
dim = vecs.shape[1]
index = faiss.IndexFlatIP(dim)
# Normalize for cosine similarity
faiss.normalize_L2(vecs)
index.add(vecs)
faiss.write_index(index, INDEX_PATH)

if os.path.exists(INDEX_PATH):
index = faiss.read_index(INDEX_PATH)
docs_chunks = load_pdfs(DOCS_DIR)
else:
os.makedirs(DOCS_DIR, exist_ok=True)
build_index()

class AskBody(BaseModel):
question: str
k: int = 4

@app.post("/ask")
def ask(body: AskBody):
qv = embed([body.question])
faiss.normalize_L2(qv)
D, I = index.search(qv, body.k)
ctx = "\n\n".join([docs_chunks[i] for i in I[0] if i < len(docs_chunks)])
prompt = f"Use the context to answer. If unknown, say you do not know.\n\nContext:\n{ctx}\n\nQuestion: {body.question}"
r = requests.post(f"{OLLAMA}/api/chat", json={
"model": GEN_MODEL,
"messages": [
{"role": "system", "content": "You are a concise assistant."},
{"role": "user", "content": prompt},
],
"options": {"num_ctx": 4096}
})
r.raise_for_status()
data = r.json()
# streaming disabled by default in this call
content = data.get("message", {}).get("content", "") or data.get("output", "")
return {"answer": content.strip()}

Create a docs/ folder with a couple of PDFs or .txt files, then run the API:

bash
mkdir -p docs
uvicorn main:app --reload --port 8000

Test it from another terminal:

bash
curl -s -X POST localhost:8000/ask \
-H 'content-type: application/json' \
-d '{"question": "Summarize the key points in our doc"}' | jq

Verify

Quick model sanity check:

bash
ollama run phi3:mini <<'EOF'
You are a helpful assistant.
What is the capital of France?
EOF

Expected: a short text reply that includes "Paris". If the model streams tokens in your terminal, the local runtime is working.

API sanity check:

bash
curl -s localhost:8000/ask \
-H 'content-type: application/json' \
-d '{"question": "What does the first page say?"}'

Expected JSON with an "answer" field that mentions content from your docs.

Configuration tips

  • Choose model size. For speed on laptops, use phi3:mini. For better recall on longer questions, try phi3:medium and set options.num_ctx to 4096 or higher.
  • Control context window. Set num_ctx in the chat payload to fit your chunking strategy. Keep chunk size plus question under the limit.
  • Speed up retrieval. Precompute and persist FAISS once. Rebuild only when docs change. You already write index.faiss to disk.
  • GPU acceleration. On Apple Silicon, Ollama uses Metal by default. On NVIDIA, install CUDA, then set OLLAMA_NUM_GPU=1 before starting the server.
  • Caching. Wrap embed() with a tiny sqlite or disk cache keyed by sha256(text). This avoids re-embedding unchanged chunks.

Troubleshooting

  • Connection refused at http://localhost:11434. Fix: start the server with ollama serve and keep it running. Then retry pulls and API calls.
  • Model not found error in chat or embeddings. Fix: ollama pull phi3:mini and ollama pull nomic-embed-text. Restart your API after new pulls.
  • Embedding dimension mismatch in FAISS. Fix: delete index.faiss and rebuild after switching embedding models to keep vector dims consistent.

When it beats cloud chat tools

  • Prototyping doc chat inside your app. You change prompts or chunking and get instant feedback, with no queue time or per-message costs.
  • Handling private support docs during early testing. You keep files on disk and avoid accidental uploads while you tune retrieval.
  • Shipping offline helpers. Field teams with spotty internet still get answers over local manuals and notes.
  • Cost control for long docs. Embeddings run locally, so you test with 100 to 1,000 pages without spiking a cloud bill.

Sources

-

Want a hand?

Book a 30-min call.

Walk through your stack with us. We'll find the bottleneck and map out the exact wiring you need — free.

Book the callFree intro · 30 min · cal.com
Nano Flow

© 2026 Nano Flow. All rights reserved.