Expose Local LLMs with ngrok: Ollama and LM Studio

Alex Kholodniak

• 02 Aug 2025 • 8 min read

Running LLMs locally saves money. It keeps your data private. But you're stuck on one machine. ngrok fixes that.

I use ngrok to expose my local models to the internet. Now I can test from my phone. Share with teammates. Demo without shipping my laptop. I've been doing this daily for months, and it's become a core part of how I develop AI-powered features.

The Problem

Local LLMs run on localhost. That means:

No mobile testing
No sharing with the team
No remote access from a coffee shop or second machine
No webhook integrations (Slack bots, Telegram bots, etc.)
Deployment is overkill when you just want to test a prompt

You could deploy to a cloud GPU, but that costs real money for experimentation. You could set up a VPN, but that's heavy infrastructure for a quick demo. Whether you're running a RAG pipeline or a custom fine-tuned model, remote access to local inference is essential.

ngrok creates a secure tunnel. It gives you a public HTTPS URL pointing to your local port. Takes 30 seconds to set up. No firewall rules, no DNS configuration, no cloud bills.

Quick Setup

Step 1: Run Your LLM Server

You need a local LLM server running before ngrok has anything to tunnel to. Here are the three most common options.

Ollama (recommended for beginners):

# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Pull a model (llama3 is a good default)
ollama pull llama3

# Start the server (runs on port 11434)
ollama serve

Verify it works locally before adding ngrok:

curl http://localhost:11434/api/generate \
  -d '{"model": "llama3", "prompt": "Say hello", "stream": false}'

LM Studio:

Download from lmstudio.ai. Load a model from the built-in browser. Go to the "Local Server" tab and click Start. Default port is 1234. LM Studio exposes an OpenAI-compatible API, which makes it easy to swap into existing code.

curl http://localhost:1234/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local-model",
    "messages": [{"role": "user", "content": "Say hello"}]
  }'

llama.cpp (for maximum control):

# Build from source
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make -j

# Start the server with a GGUF model
./llama-server -m models/llama-3-8b-q4_k_m.gguf \
  --host 0.0.0.0 --port 8080 \
  --n-gpu-layers 35

llama.cpp also exposes an OpenAI-compatible endpoint at /v1/chat/completions, plus its own native API at /completion.

Step 2: Install ngrok

# macOS
brew install ngrok

# Ubuntu/Debian
curl -sSL https://ngrok-agent.s3.amazonaws.com/ngrok.asc \
  | sudo tee /etc/apt/trusted.gpg.d/ngrok.asc >/dev/null \
  && echo "deb https://ngrok-agent.s3.amazonaws.com buster main" \
  | sudo tee /etc/apt/sources.list.d/ngrok.list \
  && sudo apt update && sudo apt install ngrok

# Or download directly from ngrok.com/download for any platform

Step 3: Authenticate

ngrok config add-authtoken YOUR_TOKEN

This saves the token to your local config file. Skip this step and every tunnel attempt will fail with an authentication error.

Step 4: Create the Tunnel

# For Ollama (port 11434)
ngrok http 11434

# For LM Studio (port 1234)
ngrok http 1234

# For llama.cpp (port 8080)
ngrok http 8080

You get output like:

Session Status    online
Account           alex@example.com (Plan: Free)
Forwarding        https://abc123.ngrok-free.app -> http://localhost:11434

Connections       ttl     opn     rt1     rt5     p50     p90
                  0       0       0.00    0.00    0.00    0.00

That HTTPS URL is your public endpoint. Copy it.

Step 5: Test It

For Ollama:

curl https://abc123.ngrok-free.app/api/generate \
  -d '{"model": "llama3", "prompt": "Hello", "stream": false}'

For LM Studio or llama.cpp (OpenAI-compatible):

curl https://abc123.ngrok-free.app/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local-model",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

If you see a JSON response with generated text, you're done.

Security Considerations

This is important. You're exposing a GPU-heavy service to the public internet.

Always add authentication. Without it, anyone who discovers your URL can send unlimited requests and peg your GPU at 100%.

ngrok http 11434 --basic-auth="user:strongpassword123"

Clients then need to include the auth header:

curl -u user:strongpassword123 \
  https://abc123.ngrok-free.app/api/generate \
  -d '{"model": "llama3", "prompt": "Hello", "stream": false}'

Restrict by IP if possible. On paid ngrok plans, you can whitelist specific IP ranges:

ngrok http 11434 --cidr-allow="203.0.113.0/24"

Monitor traffic in real time. ngrok provides a local dashboard at http://localhost:4040. Every request is logged with full headers and body. This is invaluable for debugging and spotting unauthorized access.

Watch system resources. Every inference request burns GPU/CPU and memory. Run htop or nvidia-smi -l 1 in another terminal so you can see when something hammers your machine.

Understand free tier limits. Free tunnels get a randomly generated URL that changes every time you restart ngrok. Sessions also have a limited duration. The paid plan ($8/month) gives you a stable custom domain and persistent connections.

Never leave a tunnel running unattended unless you've added authentication. I've seen people post ngrok URLs in Slack channels and forget about them. Anyone with that link owns your GPU.

Config File for Repeated Use

Typing the same ngrok command every day gets old. Create a config file at ~/.config/ngrok/ngrok.yml:

version: "3"
authtoken: YOUR_TOKEN
tunnels:
  ollama:
    proto: http
    addr: 11434
    basic_auth:
      - "user:strongpassword123"
  lmstudio:
    proto: http
    addr: 1234
    basic_auth:
      - "user:strongpassword123"

Now start a specific tunnel by name:

# Start just the Ollama tunnel
ngrok start ollama

# Start all tunnels at once
ngrok start --all

One command. No flags to remember.

Using the Tunnel from Python

Here's a complete example that works with Ollama's API:

import requests
import os

NGROK_URL = os.environ["NGROK_LLM_URL"]
AUTH = ("user", "strongpassword123")

def query_ollama(prompt, model="llama3"):
    response = requests.post(
        f"{NGROK_URL}/api/generate",
        auth=AUTH,
        json={
            "model": model,
            "prompt": prompt,
            "stream": False
        },
        timeout=120
    )
    response.raise_for_status()
    return response.json()["response"]

print(query_ollama("Explain recursion in one sentence."))

And here's one for OpenAI-compatible servers (LM Studio, llama.cpp):

from openai import OpenAI

client = OpenAI(
    base_url=os.environ["NGROK_LLM_URL"] + "/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="local-model",
    messages=[{"role": "user", "content": "What is ngrok?"}],
    temperature=0.7
)

print(response.choices[0].message.content)

The advantage of the OpenAI SDK approach is that your code works the same whether you point it at a local model via ngrok or at the real OpenAI API. Just change the base_url.

Store the ngrok URL in an environment variable. On the free tier, it changes every time you restart the tunnel.

Real Use Cases

Mobile app development. I build a chat interface on iOS, point it at the ngrok URL, and test inference on my phone without deploying anything. Change the model, tweak parameters, instantly see results.

Team demos. When I want to show a colleague what a local fine-tuned model can do, I send them the ngrok URL and basic auth credentials. They can test from their browser or Postman.

Webhook integrations. Building a Slack bot or Telegram bot that calls your local LLM? Those platforms need a public URL to send events to. ngrok gives you one without deploying a server.

CI/CD testing. Run integration tests in GitHub Actions against a local model exposed via ngrok. Useful when you want to test prompt templates against a specific model version.

Common Issues

Connection refused: Your LLM server isn't running, or it's bound to a different port. Verify with curl http://localhost:PORT before starting ngrok.

Slow responses: Large models on CPU are slow. Use quantized models (Q4_K_M or Q5_K_M for a good quality/speed tradeoff). If you have a GPU, make sure the model is actually loaded on it — check with nvidia-smi. For a deeper dive into quantization techniques, see our guide on fine-tuning LLMs with QLoRA.

Tunnel drops after a few hours: Free tier limitation. Restart ngrok, or write a simple shell script that restarts it automatically:

#!/bin/bash
while true; do
  ngrok start ollama
  echo "Tunnel dropped. Restarting in 5 seconds..."
  sleep 5
done

ngrok-free.app browser warning: Free tier URLs show a warning page on first visit in a browser. API calls (curl, Python requests) are not affected. Add the ngrok-skip-browser-warning header to bypass it programmatically:

curl -H "ngrok-skip-browser-warning: true" \
  https://abc123.ngrok-free.app/api/generate \
  -d '{"model": "llama3", "prompt": "Hello", "stream": false}'

When Not to Use ngrok

ngrok is for development and experimentation. For production workloads, consider:

Cloudflare Tunnels — free alternative with stable URLs, but more setup involved
Tailscale — peer-to-peer mesh VPN, great for accessing your machine from your own devices without exposing to the public internet
VPS with Docker — deploy Ollama on a cheap GPU instance if you need uptime guarantees
Cloud LLM APIs — if latency and cost are acceptable, skip local entirely

My Daily Workflow

Start Ollama (ollama serve)
Run ngrok start ollama
Copy the URL, export it: export NGROK_LLM_URL=https://...
Test with curl to confirm it's live
Use the URL in whatever I'm building that day
Monitor at localhost:4040 if debugging

It takes 30 seconds from cold start to a working remote LLM endpoint. No cloud bills. No deployment pipeline. Just a tunnel.

Summary

ngrok turns your local LLM into a remote API with one command. The setup is: install ngrok, authenticate, run ngrok http <port>, add basic auth. That's it.

Start with the free tier to see if the workflow fits. Upgrade to a paid plan when you get tired of changing URLs every restart. Pair it with Ollama for the simplest possible local LLM setup, or use LM Studio and llama.cpp if you need more control over model loading and quantization. If you're new to running LLMs locally, our guide on how to train an LLM covers the fundamentals of model selection and setup.