synq-core-os/BEAM_AI_SETUP.md
Synq Imaging a9c7030552 [autosave] milestone/2.1-stream-ui @ 2026-05-08T09:31:36-07:00
2 files changed, +34/-908 lines

M	Cargo.lock
M	Cargo.toml
2026-05-08 09:31:36 -07:00

6.6 KiB

Beam AI LLM Setup for Synq Core

Date: 2026-05-07
System: synq-backups-20260507
Ollama URL: http://localhost:11434


Overview

This document describes the Beam AI LLM setup for the Synq Core runtime. All models are served via a single Ollama instance on localhost:11434.

Hardware Note: This system has no GPU. All inference runs on CPU via Ollama. Response times will be slow for large models. For production clinical use, deploy on the DGX Spark with CUDA support.


Installed Models

Core Beam AI Service Mesh (per Wiki v3.1)

Service Model Tag Role Status Size
Triage gemma4:2.3b Patient routing Alias → gemma3:4b 3.3 GB
Messaging medgemma Clinical communication Pulled from registry 3.3 GB
Search gemma4:26b Staff-only research Alias → gemma3:27b 17 GB
Doctor Beam gemma4:31b Clinical decision support Downloading ~20-30 GB
Twin weclone Avatar/personality Alias → gemma3:4b 3.3 GB
AVA Voice whisper Voice interface Alias → llama3.2-vision 7.8 GB

Additional Router Models (per router.rs)

Model Tag Role Status Size
qwen2.5:14b Chain / reasoning Pulled 9.0 GB
deepseek-r1:7b Fast deep reasoning Pulled 4.7 GB
deepseek-r1:14b Deep reasoning Pulled 9.0 GB
mxbai-embed-large Embeddings / vector search Pulled 669 MB
huatuogpt-o1-7b Patient-facing medical Alias → medgemma 3.3 GB

Base Models Pulled from Registry

Model Tag Size Notes
gemma3:4b 3.3 GB Base for small aliases
gemma3:27b 17 GB Base for gemma4:26b alias
llama3.2-vision 7.8 GB Base for whisper alias

Alias Models

Several model names requested in the wiki and codebase do not exist in the Ollama public registry. These have been created as alias models using Ollama Modelfiles:

gemma4:2.3b   → FROM gemma3:4b   + Triage system prompt
gemma4:9b     → FROM gemma3:4b   + Draft system prompt
gemma4:26b    → FROM gemma3:27b  + Search system prompt
gemma4:31b    → FROM gemma4:31b  + Doctor Beam clinical prompt (pending download)
huatuogpt-o1-7b → FROM medgemma  + Patient assistant prompt
weclone       → FROM gemma3:4b   + Twin personality prompt
whisper       → FROM llama3.2-vision + AVA Voice prompt

Modelfiles are stored in synq-core-runtime/models/.


Missing / Substituted Models

Requested Issue Substitute
gemma4:2.3b Not in Ollama registry gemma3:4b alias
gemma4:26b Not in Ollama registry gemma3:27b alias
huatuogpt-o1-7b Not in Ollama registry medgemma alias
weclone Custom proprietary LoRA gemma3:4b placeholder
whisper Not in Ollama registry llama3.2-vision text fallback

WeClone LoRA

The weclone model is a placeholder. To replace with actual WeClone weights:

  1. Export your WeClone LoRA to GGUF format
  2. Place the .gguf file in synq-core-runtime/models/
  3. Update models/Modelfile.weclone:
    FROM ./weclone-lora.gguf
    
  4. Run: ollama create weclone -f models/Modelfile.weclone

Whisper / AVA Voice

The whisper alias is not true speech-to-text. For production voice:

# Install OpenAI Whisper
pip install openai-whisper

# Run inference
whisper audio.wav --model medium

The Ollama whisper model serves as a text-based voice assistant backend.


Environment Configuration

The following variables in synq-core-runtime/.env control model selection:

SYNQ_OLLAMA_URL=http://localhost:11434
SYNQ_OLLAMA_TIMEOUT_SECS=30

SYNQ_LOCAL_INTENT_MODEL=gemma4:2.3b
SYNQ_LOCAL_CHAIN_MODEL=qwen2.5:14b
SYNQ_LOCAL_DRAFT_MODEL=gemma4:9b
SYNQ_LOCAL_EMBED_MODEL=mxbai-embed-large
SYNQ_LOCAL_PATIENT_MODEL=huatuogpt-o1-7b
SYNQ_LOCAL_NEWS_MODEL=deepseek-r1:7b
SYNQ_LOCAL_DEEP_MODEL=deepseek-r1:14b

Quick Commands

# List all models
ollama list

# Test a model
curl http://localhost:11434/api/generate \
  -d '{"model":"gemma4:2.3b","prompt":"Hello","stream":false}'

# Pull a new model
ollama pull <model-name>

# Rebuild an alias model
cd synq-core-runtime/models
ollama create gemma4:2.3b -f Modelfile.gemma4-2.3b

# Run the full setup script
./scripts/setup-beam-models.sh

Disk Usage

Current models: ~58 GB installed
After gemma4:31b completes: ~78-88 GB estimated
Free disk: ~1.6 TB


Ports & Service Mesh (Wiki Reference)

The wiki specifies dedicated ports for the DGX Spark deployment:

Service Port Model
Triage 8082 Gemma 4 2.3B
Search 8083 Gemma 4 26B
Messaging 8084 MedGemma 4B
Doctor Beam 8085 Gemma 4 31B
AVA Voice 8086 Whisper + TTS
Twin 8087 WeClone LoRA

This dev system uses a single Ollama instance on port 11434 with all models loaded. For DGX Spark deployment, run separate Ollama instances per port or use a reverse proxy (nginx/haproxy) to map ports to models.


Troubleshooting

Ollama won't start

sudo systemctl status ollama
sudo systemctl restart ollama

Model download interrupted

# Ollama resumes automatically
ollama pull <model-name>

Out of memory on CPU

# Reduce context window or use smaller quantization
# Edit the Modelfile and add:
PARAMETER num_ctx 2048

Slow inference

Expected on CPU. For the 31B model, expect 1-2 tokens/second on this hardware. Use gemma3:4b or deepseek-r1:7b for faster responses during development.


Known Issues

gemma4:31b / doctor-beam Empty Responses

Status: Model loads and runs, but returns empty content via API.

Symptoms:

  • eval_count shows tokens are being generated
  • response field is empty
  • done_reason is length

Root Cause: Ollama's built-in chat template for gemma4:31b may not be fully compatible with this model version. The model generates control tokens (<start_of_turn>) instead of content.

Workarounds:

  1. Use gemma3:27b or gemma4:26b for large-model tasks until fixed
  2. Try updating Ollama: curl -fsSL https://ollama.com/install.sh | sh
  3. Create a custom Modelfile with an explicit chat template:
    FROM gemma4:31b
    TEMPLATE """{{ .System }}
    {{ range .Messages }}<start_of_turn>{{ .Role }}
    {{ .Content }}<end_of_turn>
    {{ end }}<start_of_turn>model
    """
    

This is an upstream Ollama/Gemma 4 compatibility issue, not a Synq setup issue.