6.6 KiB
Beam AI LLM Setup for Synq Core
Date: 2026-05-07
System: synq-backups-20260507
Ollama URL: http://localhost:11434
Overview
This document describes the Beam AI LLM setup for the Synq Core runtime.
All models are served via a single Ollama instance on localhost:11434.
Hardware Note: This system has no GPU. All inference runs on CPU via Ollama. Response times will be slow for large models. For production clinical use, deploy on the DGX Spark with CUDA support.
Installed Models
Core Beam AI Service Mesh (per Wiki v3.1)
| Service | Model Tag | Role | Status | Size |
|---|---|---|---|---|
| Triage | gemma4:2.3b |
Patient routing | ✅ Alias → gemma3:4b |
3.3 GB |
| Messaging | medgemma |
Clinical communication | ✅ Pulled from registry | 3.3 GB |
| Search | gemma4:26b |
Staff-only research | ✅ Alias → gemma3:27b |
17 GB |
| Doctor Beam | gemma4:31b |
Clinical decision support | ⏳ Downloading | ~20-30 GB |
| Twin | weclone |
Avatar/personality | ✅ Alias → gemma3:4b |
3.3 GB |
| AVA Voice | whisper |
Voice interface | ✅ Alias → llama3.2-vision |
7.8 GB |
Additional Router Models (per router.rs)
| Model Tag | Role | Status | Size |
|---|---|---|---|
qwen2.5:14b |
Chain / reasoning | ✅ Pulled | 9.0 GB |
deepseek-r1:7b |
Fast deep reasoning | ✅ Pulled | 4.7 GB |
deepseek-r1:14b |
Deep reasoning | ✅ Pulled | 9.0 GB |
mxbai-embed-large |
Embeddings / vector search | ✅ Pulled | 669 MB |
huatuogpt-o1-7b |
Patient-facing medical | ✅ Alias → medgemma |
3.3 GB |
Base Models Pulled from Registry
| Model Tag | Size | Notes |
|---|---|---|
gemma3:4b |
3.3 GB | Base for small aliases |
gemma3:27b |
17 GB | Base for gemma4:26b alias |
llama3.2-vision |
7.8 GB | Base for whisper alias |
Alias Models
Several model names requested in the wiki and codebase do not exist in the Ollama public registry. These have been created as alias models using Ollama Modelfiles:
gemma4:2.3b → FROM gemma3:4b + Triage system prompt
gemma4:9b → FROM gemma3:4b + Draft system prompt
gemma4:26b → FROM gemma3:27b + Search system prompt
gemma4:31b → FROM gemma4:31b + Doctor Beam clinical prompt (pending download)
huatuogpt-o1-7b → FROM medgemma + Patient assistant prompt
weclone → FROM gemma3:4b + Twin personality prompt
whisper → FROM llama3.2-vision + AVA Voice prompt
Modelfiles are stored in synq-core-runtime/models/.
Missing / Substituted Models
| Requested | Issue | Substitute |
|---|---|---|
gemma4:2.3b |
Not in Ollama registry | gemma3:4b alias |
gemma4:26b |
Not in Ollama registry | gemma3:27b alias |
huatuogpt-o1-7b |
Not in Ollama registry | medgemma alias |
weclone |
Custom proprietary LoRA | gemma3:4b placeholder |
whisper |
Not in Ollama registry | llama3.2-vision text fallback |
WeClone LoRA
The weclone model is a placeholder. To replace with actual WeClone weights:
- Export your WeClone LoRA to GGUF format
- Place the
.gguffile insynq-core-runtime/models/ - Update
models/Modelfile.weclone:FROM ./weclone-lora.gguf - Run:
ollama create weclone -f models/Modelfile.weclone
Whisper / AVA Voice
The whisper alias is not true speech-to-text. For production voice:
# Install OpenAI Whisper
pip install openai-whisper
# Run inference
whisper audio.wav --model medium
The Ollama whisper model serves as a text-based voice assistant backend.
Environment Configuration
The following variables in synq-core-runtime/.env control model selection:
SYNQ_OLLAMA_URL=http://localhost:11434
SYNQ_OLLAMA_TIMEOUT_SECS=30
SYNQ_LOCAL_INTENT_MODEL=gemma4:2.3b
SYNQ_LOCAL_CHAIN_MODEL=qwen2.5:14b
SYNQ_LOCAL_DRAFT_MODEL=gemma4:9b
SYNQ_LOCAL_EMBED_MODEL=mxbai-embed-large
SYNQ_LOCAL_PATIENT_MODEL=huatuogpt-o1-7b
SYNQ_LOCAL_NEWS_MODEL=deepseek-r1:7b
SYNQ_LOCAL_DEEP_MODEL=deepseek-r1:14b
Quick Commands
# List all models
ollama list
# Test a model
curl http://localhost:11434/api/generate \
-d '{"model":"gemma4:2.3b","prompt":"Hello","stream":false}'
# Pull a new model
ollama pull <model-name>
# Rebuild an alias model
cd synq-core-runtime/models
ollama create gemma4:2.3b -f Modelfile.gemma4-2.3b
# Run the full setup script
./scripts/setup-beam-models.sh
Disk Usage
Current models: ~58 GB installed
After gemma4:31b completes: ~78-88 GB estimated
Free disk: ~1.6 TB
Ports & Service Mesh (Wiki Reference)
The wiki specifies dedicated ports for the DGX Spark deployment:
| Service | Port | Model |
|---|---|---|
| Triage | 8082 | Gemma 4 2.3B |
| Search | 8083 | Gemma 4 26B |
| Messaging | 8084 | MedGemma 4B |
| Doctor Beam | 8085 | Gemma 4 31B |
| AVA Voice | 8086 | Whisper + TTS |
| Twin | 8087 | WeClone LoRA |
This dev system uses a single Ollama instance on port 11434 with all
models loaded. For DGX Spark deployment, run separate Ollama instances per
port or use a reverse proxy (nginx/haproxy) to map ports to models.
Troubleshooting
Ollama won't start
sudo systemctl status ollama
sudo systemctl restart ollama
Model download interrupted
# Ollama resumes automatically
ollama pull <model-name>
Out of memory on CPU
# Reduce context window or use smaller quantization
# Edit the Modelfile and add:
PARAMETER num_ctx 2048
Slow inference
Expected on CPU. For the 31B model, expect 1-2 tokens/second on this hardware.
Use gemma3:4b or deepseek-r1:7b for faster responses during development.
Known Issues
gemma4:31b / doctor-beam Empty Responses
Status: Model loads and runs, but returns empty content via API.
Symptoms:
eval_countshows tokens are being generatedresponsefield is emptydone_reasonislength
Root Cause: Ollama's built-in chat template for gemma4:31b may not be
fully compatible with this model version. The model generates control tokens
(<start_of_turn>) instead of content.
Workarounds:
- Use
gemma3:27borgemma4:26bfor large-model tasks until fixed - Try updating Ollama:
curl -fsSL https://ollama.com/install.sh | sh - Create a custom Modelfile with an explicit chat template:
FROM gemma4:31b TEMPLATE """{{ .System }} {{ range .Messages }}<start_of_turn>{{ .Role }} {{ .Content }}<end_of_turn> {{ end }}<start_of_turn>model """
This is an upstream Ollama/Gemma 4 compatibility issue, not a Synq setup issue.