When I first built OneCamp as the self-hosted, anti-SaaS workspace, my goal was simple: give teams absolute control over their communication, data, and workflows without paying massive recurring subscription bills.
But as artificial intelligence shifted from a novelty to an essential teammate, I faced a new engineering dilemma. Standard SaaS platforms integrate AI by funneling all your private company discussions directly into a single, proprietary cloud LLM.
For a self-hosted workspace, that model is fundamentally broken. Some teams want absolute privacy, running small local models (like Llama 3 or Phi 3) entirely offline on their own hardware. Others want to leverage cloud endpoints (like OpenAI or Anthropic) with their own keys, while some want to route queries through specialized custom endpoints (like vLLM, LM Studio, or OpenRouter).
To solve this, I spent the last week re-engineering OneCamp into a fully model-agnostic, AI-native operating hub.
In this post, I will break down how I engineered an AI-agnostic provider system, solved the silent context-window truncation problem using an Exponentially Weighted Moving Average (EWMA) token calibrator, and built an Active Workspace Memory Layer that synchronizes decisions, commitments, and open questions across three separate databases in real-time.
To support local runtimes, cloud APIs, and arbitrary developer endpoints without duplicating LLM prompt logic across every feature, I decoupled the execution layer into a unified interface in Go.
+--------------------------+
| AI Service Manager |
+--------------------------+
|
+----------------------+----------------------+
| | |
+------v------+ +------v------+ +------v------+
| Ollama | | OpenAI | | Anthropic |
| (Local) | | (Reasoning) | | (Cloud) |
+-------------+ +-------------+ +-------------+
Every LLM capability inside OneCamp—whether it is the chat assistant, daily briefing generator, or transcript recap bot—calls a single central interface in services/AI/interfaces.go:
type ModelLister interface {
ListModels(ctx context.Context) ([]ModelInfo, error)
}
type ModelManager interface {
PullModel(ctx context.Context, tag string, onProgress func(percent float64)) error
DeleteModel(ctx context.Context, tag string) error
}
type Summarizer interface {
Summarize(ctx context.Context, text string, systemPrompt string) (string, error)
}
By decoupling these services, features like our Chat Composer or Document AI can remain completely oblivious to where the model is running. Swapping from a local Ollama model to Claude 3.5 Sonnet requires zero logic changes in the communication controllers.
In self-hosted environments, administrators expect configuration changes to take effect instantly.
When an admin updates API credentials or toggles reasoning budgets in the settings console, the backend triggers ai.ReloadAIService(). This does not require restarting the Go binary:
openai_compatible address is provided, it passes the dialer through an SSRF Guard (httpguard.go), verifying that the destination IP does not map to localhost, link-local addresses, or cloud instance metadata endpoints (which could allow a rogue LLM agent to map the server’s private network).If you’ve ever integrated local LLMs served by Ollama, you’ve likely hit a silent, highly frustrating bug.
Local models are typically initialized with a fixed context window (e.g., num_ctx = 8192 tokens). When your assembled prompt (which includes system instructions, tool schemas, multi-turn history, and retrieved database context) exceeds that window, the runtime silently truncates the prompt from the front to make it fit.
Because system instructions and tool definitions sit at the very front of the prompt, the model suddenly loses its core guidelines. It begins outputting raw prose instead of JSON, hallucinating formatting, or hallucinating tools.
To solve this, I designed a dynamic context-budgeting system (contextBudget.go).
+-----------------------------------------------------------------+
| Total Usable Context |
+-----------------------------------------------------------------+
| Response Reserve (1024) | Scaffold / System Prompt (1024) |
|-----------------------------------------------------------------|
| Session History (25% of input) | Workspace Context (75% of input) |
+-----------------------------------------------------------------+
The budget estimator dynamically interrogates the active model’s context ceiling and splits it into strict, guarded pools:
If the history or database context exceeds its allotted budget, the system recursively trims old turns or low-ranking semantic search items, appending a visible tag: "\n\n[Context truncated to fit the model's window.]".
But how do we count tokens accurately? Using heavy tokenizers for every model family (tiktoken, Llama BPE, etc.) in a self-hosted Go backend is extremely expensive and practically impossible for custom models.
While a static 4 characters/token heuristic is standard, REAL tokenizers diverge wildly (especially on code blocks, markdown tables, or non-English text).
To solve this, I built a self-correcting calibrator that learns the active model’s true ratio at runtime:
Most AI integrations rely solely on raw document chunk vectors. When you ask, “What did we decide during yesterday’s planning meeting?”, the model runs a vector search on the transcript and hopes the relevant text chunks get pulled.
This often fails because raw chat transcripts are noisy, disorganized, and full of context shifts.
To fix this, I designed the Structured Workspace Memory Engine (memoryExtractor.go). Instead of storing raw data, OneCamp runs an ambient, opt-in agent that converts chat threads, transcripts, and documents into structured, atomic facts categorized into three kinds:
To make these memories performant, searchable, and secure, the engine processes each extracted fact by writing it to three databases simultaneously:
+--------------------------+
| Memory Fact Extracted |
+--------------------------+
|
+------------------------+------------------------+
| | |
+----------v----------+ +----------v----------+ +----------v----------+
| Postgres | | OpenSearch | | DGraph |
| (System of Record) | | (Vector Semantic) | | (GraphRAG) |
+---------------------+ +---------------------+ +---------------------+
workspace_memories table, securing dates, confidence scores, ownerships, and parent team/project bindings."Decision: Migrated file-upload pipelines to zero-trust magic-byte signatures.") as a high-density vector embedding, allowing instant semantic query matching.To guarantee compliance and trust, I added a granular opt-out system. Administrators or channel moderators can exclude specific channels, group chats, or projects from the memory layer entirely.
Before the extractor runs, it validates exclusions via scopeIsExcluded():
func scopeIsExcluded(ctx context.Context, scope MemoryScope) bool {
return check(memoryModels.ExclusionChannel, scope.ChannelUUID) ||
check(memoryModels.ExclusionProject, scope.ProjectUUID) ||
check(memoryModels.ExclusionChatGrp, scope.ChatGrpID)
}
If a scope is excluded, the extraction pipeline aborts immediately, ensuring private discussions remain completely invisible to the semantic index.
To make this complex backend architecture feel fast and responsive to the user, I had to completely redesign the frontend client workflows in Next.js.
<SafeHtml>)To render rich, formatted AI responses safely, we use DOMPurify on the client. But running plain DOMPurify on the server during Next.js Pre-rendering (SSR) throws errors because JSDOM is absent.
This typically leads developers to use ad-hoc checks, resulting in dangerous React hydration mismatches because the server layout differs from the client paint.
To solve this, I built the <SafeHtml> component (SafeHtml.tsx).
It outputs an empty tag on both the server and the first client paint, ensuring a perfect 1:1 match. Then, inside useLayoutEffect (which executes synchronously after DOM mutations but before the browser paints), the component sanitizes the HTML string and updates state.
The user experiences a seamless, safe layout paint with zero layout shifting or unstyled content flashes.
For reasoning models (like DeepSeek R1 or OpenAI o3-mini), the model yields two streams: the thinking trace (complex logical chains) and the final output.
I built streamFetch.ts using direct browser ReadableStream readers. It parses Server-Sent Event (SSE) chunks on-the-fly, allowing our chat interfaces to dynamically render both the expandable “thought process” block and the final Markdown response as the characters are generated in real-time.
[SSE Stream Source] ──> [ReadableStream Reader] ──> [Split SSE Frame \n\n]
|
+-------------------+-------------------+
| |
[Update Thinking Trace State] [Update Chat Output State]
By combining model-agnostic client abstractions, EWMA-based token constraints, and a multi-database GraphRAG workspace memory layer, OneCamp delivers on its promise of an enterprise-grade workspace that operates entirely on your own terms.
You get the power of intelligent workspace briefings, instant unread digests (“Catch Me Up”), automated meeting recaps, and deep semantic searches—without streaming a single byte of your proprietary company data to external SaaS aggregators.
The new AI configuration panel and local model browser are now live in the Admin Settings > AI Configuration dashboard. Give it a spin, pull down a local Llama model, and let me know how it performs on your hardware!
Previous posts: Universal Import Engine: Migrating from 8 SaaS Platforms · OneCamp v2.0: GitHub Sync, Webhooks, Archiving · Why we use two databases · Building the Anti-SaaS Workspace
For real-time updates on self-hosting and self-correcting systems, follow me on Twitter.