Running LLMs locally saves money. It keeps your data private. But you're stuck on one machine. ngrok fixes that.
I use ngrok to expose my local models to the internet. Now I can test from my phone. Share with teammates. Demo without shipping my laptop. I've been doing this daily for months, and it's become a core part of how I develop AI-powered features.
The Problem
Local LLMs run on localhost. That means:
- No mobile testing
- No sharing with the team
- No remote access from a coffee shop or second machine
- No webhook integrations (Slack bots, Telegram bots, etc.)
- Deployment is overkill when you just want to test a prompt
You could deploy to a cloud GPU, but that costs real money for experimentation. You could set up a VPN, but that's heavy infrastructure for a quick demo. Whether you're running a RAG pipeline or a custom fine-tuned model, remote access to local inference is essential.
ngrok creates a secure tunnel. It gives you a public HTTPS URL pointing to your local port. Takes 30 seconds to set up. No firewall rules, no DNS configuration, no cloud bills.
Quick Setup
Step 1: Run Your LLM Server
You need a local LLM server running before ngrok has anything to tunnel to. Here are the three most common options.
Ollama (recommended for beginners):
# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh
# Pull a model (llama3 is a good default)
ollama pull llama3
# Start the server (runs on port 11434)
ollama serveVerify it works locally before adding ngrok:
curl http://localhost:11434/api/generate \
-d '{"model": "llama3", "prompt": "Say hello", "stream": false}'LM Studio:
Download from lmstudio.ai. Load a model from the built-in browser. Go to the "Local Server" tab and click Start. Default port is 1234. LM Studio exposes an OpenAI-compatible API, which makes it easy to swap into existing code.
curl http://localhost:1234/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "local-model",
"messages": [{"role": "user", "content": "Say hello"}]
}'llama.cpp (for maximum control):
# Build from source
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make -j
# Start the server with a GGUF model
./llama-server -m models/llama-3-8b-q4_k_m.gguf \
--host 0.0.0.0 --port 8080 \
--n-gpu-layers 35llama.cpp also exposes an OpenAI-compatible endpoint at /v1/chat/completions, plus its own native API at /completion.
Step 2: Install ngrok
# macOS
brew install ngrok
# Ubuntu/Debian
curl -sSL https://ngrok-agent.s3.amazonaws.com/ngrok.asc \
| sudo tee /etc/apt/trusted.gpg.d/ngrok.asc >/dev/null \
&& echo "deb https://ngrok-agent.s3.amazonaws.com buster main" \
| sudo tee /etc/apt/sources.list.d/ngrok.list \
&& sudo apt update && sudo apt install ngrok
# Or download directly from ngrok.com/download for any platformStep 3: Authenticate
Sign up at ngrok.com and grab your auth token from the dashboard. Run:
ngrok config add-authtoken YOUR_TOKENThis saves the token to your local config file. Skip this step and every tunnel attempt will fail with an authentication error.
Step 4: Create the Tunnel
# For Ollama (port 11434)
ngrok http 11434
# For LM Studio (port 1234)
ngrok http 1234
# For llama.cpp (port 8080)
ngrok http 8080You get output like:
Session Status online
Account alex@example.com (Plan: Free)
Forwarding https://abc123.ngrok-free.app -> http://localhost:11434
Connections ttl opn rt1 rt5 p50 p90
0 0 0.00 0.00 0.00 0.00
That HTTPS URL is your public endpoint. Copy it.
Step 5: Test It
For Ollama:
curl https://abc123.ngrok-free.app/api/generate \
-d '{"model": "llama3", "prompt": "Hello", "stream": false}'For LM Studio or llama.cpp (OpenAI-compatible):
curl https://abc123.ngrok-free.app/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "local-model",
"messages": [{"role": "user", "content": "Hello"}]
}'If you see a JSON response with generated text, you're done.
Security Considerations
This is important. You're exposing a GPU-heavy service to the public internet.
Always add authentication. Without it, anyone who discovers your URL can send unlimited requests and peg your GPU at 100%.
ngrok http 11434 --basic-auth="user:strongpassword123"Clients then need to include the auth header:
curl -u user:strongpassword123 \
https://abc123.ngrok-free.app/api/generate \
-d '{"model": "llama3", "prompt": "Hello", "stream": false}'Restrict by IP if possible. On paid ngrok plans, you can whitelist specific IP ranges:
ngrok http 11434 --cidr-allow="203.0.113.0/24"Monitor traffic in real time. ngrok provides a local dashboard at http://localhost:4040. Every request is logged with full headers and body. This is invaluable for debugging and spotting unauthorized access.
Watch system resources. Every inference request burns GPU/CPU and memory. Run htop or nvidia-smi -l 1 in another terminal so you can see when something hammers your machine.
Understand free tier limits. Free tunnels get a randomly generated URL that changes every time you restart ngrok. Sessions also have a limited duration. The paid plan ($8/month) gives you a stable custom domain and persistent connections.
Never leave a tunnel running unattended unless you've added authentication. I've seen people post ngrok URLs in Slack channels and forget about them. Anyone with that link owns your GPU.
Config File for Repeated Use
Typing the same ngrok command every day gets old. Create a config file at ~/.config/ngrok/ngrok.yml:
version: "3"
authtoken: YOUR_TOKEN
tunnels:
ollama:
proto: http
addr: 11434
basic_auth:
- "user:strongpassword123"
lmstudio:
proto: http
addr: 1234
basic_auth:
- "user:strongpassword123"Now start a specific tunnel by name:
# Start just the Ollama tunnel
ngrok start ollama
# Start all tunnels at once
ngrok start --allOne command. No flags to remember.
Using the Tunnel from Python
Here's a complete example that works with Ollama's API:
import requests
import os
NGROK_URL = os.environ["NGROK_LLM_URL"]
AUTH = ("user", "strongpassword123")
def query_ollama(prompt, model="llama3"):
response = requests.post(
f"{NGROK_URL}/api/generate",
auth=AUTH,
json={
"model": model,
"prompt": prompt,
"stream": False
},
timeout=120
)
response.raise_for_status()
return response.json()["response"]
print(query_ollama("Explain recursion in one sentence."))And here's one for OpenAI-compatible servers (LM Studio, llama.cpp):
from openai import OpenAI
client = OpenAI(
base_url=os.environ["NGROK_LLM_URL"] + "/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="local-model",
messages=[{"role": "user", "content": "What is ngrok?"}],
temperature=0.7
)
print(response.choices[0].message.content)The advantage of the OpenAI SDK approach is that your code works the same whether you point it at a local model via ngrok or at the real OpenAI API. Just change the base_url.
Store the ngrok URL in an environment variable. On the free tier, it changes every time you restart the tunnel.
Real Use Cases
Mobile app development. I build a chat interface on iOS, point it at the ngrok URL, and test inference on my phone without deploying anything. Change the model, tweak parameters, instantly see results.
Team demos. When I want to show a colleague what a local fine-tuned model can do, I send them the ngrok URL and basic auth credentials. They can test from their browser or Postman.
Webhook integrations. Building a Slack bot or Telegram bot that calls your local LLM? Those platforms need a public URL to send events to. ngrok gives you one without deploying a server.
CI/CD testing. Run integration tests in GitHub Actions against a local model exposed via ngrok. Useful when you want to test prompt templates against a specific model version.
Common Issues
Connection refused: Your LLM server isn't running, or it's bound to a different port. Verify with curl http://localhost:PORT before starting ngrok.
Slow responses: Large models on CPU are slow. Use quantized models (Q4_K_M or Q5_K_M for a good quality/speed tradeoff). If you have a GPU, make sure the model is actually loaded on it — check with nvidia-smi. For a deeper dive into quantization techniques, see our guide on fine-tuning LLMs with QLoRA.
Tunnel drops after a few hours: Free tier limitation. Restart ngrok, or write a simple shell script that restarts it automatically:
#!/bin/bash
while true; do
ngrok start ollama
echo "Tunnel dropped. Restarting in 5 seconds..."
sleep 5
donengrok-free.app browser warning: Free tier URLs show a warning page on first visit in a browser. API calls (curl, Python requests) are not affected. Add the ngrok-skip-browser-warning header to bypass it programmatically:
curl -H "ngrok-skip-browser-warning: true" \
https://abc123.ngrok-free.app/api/generate \
-d '{"model": "llama3", "prompt": "Hello", "stream": false}'When Not to Use ngrok
ngrok is for development and experimentation. For production workloads, consider:
- Cloudflare Tunnels — free alternative with stable URLs, but more setup involved
- Tailscale — peer-to-peer mesh VPN, great for accessing your machine from your own devices without exposing to the public internet
- VPS with Docker — deploy Ollama on a cheap GPU instance if you need uptime guarantees
- Cloud LLM APIs — if latency and cost are acceptable, skip local entirely
My Daily Workflow
- Start Ollama (
ollama serve) - Run
ngrok start ollama - Copy the URL, export it:
export NGROK_LLM_URL=https://... - Test with curl to confirm it's live
- Use the URL in whatever I'm building that day
- Monitor at
localhost:4040if debugging
It takes 30 seconds from cold start to a working remote LLM endpoint. No cloud bills. No deployment pipeline. Just a tunnel.
Summary
ngrok turns your local LLM into a remote API with one command. The setup is: install ngrok, authenticate, run ngrok http <port>, add basic auth. That's it.
Start with the free tier to see if the workflow fits. Upgrade to a paid plan when you get tired of changing URLs every restart. Pair it with Ollama for the simplest possible local LLM setup, or use LM Studio and llama.cpp if you need more control over model loading and quantization. If you're new to running LLMs locally, our guide on how to train an LLM covers the fundamentals of model selection and setup.