Proxy mode (BYOK)

When to choose proxy over full mode

Proxy and full mode are different products, not different prices. Pick proxy when at least one of these applies:

Compliance / data residency. Internal policy or regulation says user data must not flow through a third-party LLM proxy. With proxy mode, the LLM call happens entirely inside your perimeter — Vilow never sees your provider's response in transit.
You've invested in a custom model. Fine-tuned GPT-4, an in-house Mistral, or a Llama-on-A100 cluster — proxy mode lets you keep using it while plugging in our personality, memory, and relationship engine.
Provider-key isolation. Your security team doesn't want OpenAI / Anthropic credentials sitting in a third-party vendor's environment. With proxy you never share them with us.
Your stack already does inference. Many teams have a centralised LLM gateway with budgeting, observability, and prompt-caching baked in. Proxy mode plugs the character intelligence layer into your gateway instead of duplicating those concerns on our side.
Regional or sovereign clouds. Run on Azure EU, AWS GovCloud, or a local provider — Vilow's brain talks to your gateway over HTTPS, no matter where your inference runs.

If none of the above apply, full mode is simpler and faster to integrate.

	Full mode	Proxy mode
Best for	Indie devs, startups, fastest path to working bot	Enterprise, compliance, custom or on-prem inference
Who runs the LLM?	Vilow (Grok)	You (any provider, any model)
Where does the LLM key live?	With us	Only with you
Inference observability / budgets	Surfaced via Vilow dashboard	Stays in your existing tooling
Latency	One round-trip via Vilow	Direct to your LLM
Streaming output	Built in (`/send-stream`)	Use your provider's streaming, then absorb
Adult / intimate features	Available with consent	Not available — third-party providers ban it

How it works

1 POST /v1/proxy/chat/{user}/{character}/prepare — you send the user message; we return a session_id and a system_prompt tailored to the character's current mood, memory, and relationship.

2 Your code calls your LLM with system_prompt + the user message. Your provider, your key, your costs.

3 POST /v1/proxy/chat/{session_id}/absorb — you post the LLM's reply back. We update memory and emotional state, ready for the next turn.

Always close the loop. Without absorb, the character's memory and emotions don't advance — replies will start to feel detached after a few turns. The official SDKs handle absorb for you.

Quick start (Python)

pip install vilow-sdk openai

from vilow_sdk import VilowClient
from openai import OpenAI

vilow = VilowClient(api_key="vk_…")
oai = OpenAI()

def call_openai(system, user):
    r = oai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "system", "content": system},
                  {"role": "user",   "content": user}],
    )
    return r.choices[0].message.content

reply = vilow.chat.send(
    external_id="alice",
    character_id=42,
    user_message="как дела?",
    llm=call_openai,
    user_local_time="20:30",
)
print(reply)

Quick start (TypeScript)

npm i @vilow/sdk openai

import { VilowClient, type LLMCallable } from '@vilow/sdk';
import OpenAI from 'openai';

const vilow = new VilowClient({ apiKey: 'vk_…' });
const openai = new OpenAI();

const callOpenAI: LLMCallable = async (system, user) => {
  const r = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [
      { role: 'system', content: system },
      { role: 'user',   content: user },
    ],
  });
  return r.choices[0].message.content ?? '';
};

const reply = await vilow.chat.send({
  externalId: 'alice',
  characterId: 42,
  userMessage: 'how are you?',
  llm: callOpenAI,
  userLocalTime: '20:30',
});

Manual control

Need to inject your own logic between prepare and absorb (logging, streaming, retries)?

prep = vilow.chat.prepare(
    external_id="alice", character_id=42,
    user_message="how are you?",
)
# … your LLM call here, however you like
reply = call_my_llm(prep.system_prompt, prep.user_message)

vilow.chat.absorb(session_id=prep.session_id, llm_response=reply)

Using OpenAI tool calling (function calling)

If your assistant calls tools — flight search, weather lookup, RAG, your own DB — you run that loop yourself: your code, your OpenAI key, your tools. Vilow doesn't see or control mid-flight tool calls. We just need the final assistant text at the end so we can update memory and emotions.

By default prepare bakes a "respond with this JSON envelope" instruction into the styling prompt, which works great when the LLM only outputs prose — but it conflicts with OpenAI's tools mechanism (the model dumps tool arguments into content as JSON instead of using the tool_calls field, and tools never execute). The fix is one extra parameter on prepare:

POST /v1/proxy/chat/{external_id}/{character_id}/prepare
{
  "user_message": "what's the weather in Barcelona tomorrow?",
  "envelope":     false       // ← turn off the JSON-output instruction
}

With envelope: false the styling prompt no longer asks for JSON, so you can run a normal tools loop and post the final plain-prose reply to absorb. We then run an internal extraction pass on our side using the same persona context, and update memory + relationship + emotions automatically. The response includes "extraction": "server_extracted" so you can confirm it ran:

// 1. prepare with envelope=false
const prep = await fetch(`${VILOW}/v1/proxy/chat/${user}/${char}/prepare`, {
  method: 'POST',
  headers: { 'X-API-Key': VK, 'Content-Type': 'application/json' },
  body: JSON.stringify({ user_message: msg, envelope: false }),
}).then(r => r.json());

// 2. your tools loop — your OpenAI key, your tools
const messages = [
  { role: 'system', content: prep.system_prompt },
  { role: 'user',   content: prep.user_message  },
];
let finalText = '';
for (let i = 0; i < 5; i++) {
  const r = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages,
    tools: MY_TOOLS,
    tool_choice: 'auto',
  });
  const m = r.choices[0].message;
  if (m.tool_calls?.length) {
    messages.push(m);
    for (const tc of m.tool_calls) {
      const result = await runTool(tc.function.name, JSON.parse(tc.function.arguments));
      messages.push({ role: 'tool', tool_call_id: tc.id, content: JSON.stringify(result) });
    }
    continue;
  }
  finalText = m.content || '';
  break;
}

// 3. absorb the plain prose — Vilow extracts envelope server-side
const ab = await fetch(`${VILOW}/v1/proxy/chat/${prep.session_id}/absorb`, {
  method: 'POST',
  headers: { 'X-API-Key': VK, 'Content-Type': 'application/json' },
  body: JSON.stringify({ llm_response: finalText }),
}).then(r => r.json());
// ab.extraction === "server_extracted"
// ab.facts_extracted, ab.relationship — populated as in envelope mode

Don't roll your own extraction layer. Wrapping the reply in a hand-built JSON shell with zero deltas (just to make absorb "happy") silently disables the personality engine — trust, friendship, and emotions stop changing, no facts get extracted, the character freezes. Either stick with default envelope: true for a non-tools LLM, or use envelope: false + plain prose and let Vilow extract.

Cost note: server-side extraction runs one LLM call on Vilow's side per tool-using turn (a fast Grok call, not your provider). It's billed under your normal proxy-mode message quota — no surprise charges.

What we see — and what we don't

Item	Vilow sees it?
User message (you send it in `prepare`)	Yes
LLM's reply (you send it in `absorb`)	Yes
Character / user IDs	Yes (they're ours)
Your OpenAI / Anthropic / Grok / local key	No — never
Which model / provider you used	No
Your LLM cost or token counts	No
Mid-flight tool calls / function calls in your stack	No

Vilow API key location. The vk_… token you pass to the SDK is for our API. Keep it on your backend — never embed it in browser JS or mobile apps. If you build a web client, route requests through your own server that holds the key.

What the styling prompt contains

Each prepare returns a system_prompt assembled from the character's stored state. The blocks are stable across calls so you can cache or post-process if needed:

Character — name, gender (in natural prose), persona, backstory, custom traits.
Personality — a 4–6 line description, written in tendency language ("tends to", "generally"). Generated once from the Big Five vector; not the raw scores.
How personality interacts with current state — the precedence rule so your LLM knows that an optimist who is sad is sad.
Current state — local time bucket, life event in progress, dominant emotions, dominant needs. All in prose, no numbers.
What you know about this user — up to 3 cherry-picked facts from memory.
Recent dialogue — last few turns of conversation.
Relationship — duration of contact and warmth, in prose.
Style for this reply — language, length, tone hints.
Don'ts — boundaries (don't invent shared past, don't pile up questions, AI-disclosure rule).
User: … — the user's actual message, ready to feed your LLM.

Endpoint reference

POST `/v1/proxy/chat/{external_id}/{character_id}/prepare`

{
  "user_message":     "как дела?",
  "user_local_time":  "20:30",         // optional
  "language":         "ru",             // optional override
  "disclose_ai":      true,             // optional, default true
  "envelope":         true              // optional, default true. set false
                                        // when running OpenAI tool calling
                                        // (see "Using OpenAI tool calling")
}

→ 200 OK
{
  "session_id":     "fd1a8c...",
  "system_prompt":  "# Character\nYou are Anna, a woman.\n…",
  "user_message":   "как дела?",
  "expires_at":     "2026-04-30T20:34:11Z",
  "state_version":  1
}

POST `/v1/proxy/chat/{session_id}/absorb`

{
  "llm_response":     "Привет... день был тяжёлый. Сама как?",
  "idempotency_key":  "your-uuid"      // optional, prevents double-absorb on retry
}

→ 200 OK
{
  "session_id":      "fd1a8c...",
  "status":          "absorbed",
  "extraction":      "envelope",       // "envelope" | "server_extracted" | "plain"
  "facts_extracted": 2,
  "relationship":    { "trust": 0.32, "friendship": 0.41, "stage": "warming" }
}

Extraction modes:

envelope — your LLM returned a JSON envelope (default flow with envelope: true on prepare).
server_extracted — you used envelope: false on prepare and posted plain prose; Vilow ran an internal extraction call to derive deltas + facts.
plain — extraction couldn't be performed (envelope disabled and the server-side extraction call failed). The visible reply is still saved, but no memory/relationship updates happened for this turn.

Status codes

401 — bad/missing API key.
402 — quota or balance limit (body.code says which).
404 — character or session not found.
409 — session already absorbed (use a different session_id or send the original idempotency_key for a safe retry).
410 — session expired (TTL is 24 h; just call prepare again).

Intimate mode is NOT available in proxy

Proxy mode is for work-grade assistants and product chatbots. Intimate / 18+ features are intentionally not exposed here, for three reasons:

Most public LLM providers (OpenAI, Anthropic) ban explicit content — using them via proxy with intimate flows would risk your account.
Personal beats / NSFW are a different product surface (consent gating, age verification) and are guaranteed only inside full mode where Vilow runs the LLM.
The intimate persona is generated by our shaping rules and stays on our side; it doesn't appear in a styling prompt regardless of consent state.

If a user explicitly asks for an intimate conversation, switch to full mode (POST /v1/chat/{user}/{character}/send) on our LLM. The same character keeps its memory, emotions, and relationship — see "Switching modes" below.

Switching modes — does the bot remember?

Yes. A character is a single row in our database — memory, emotions, trust/friendship, life events, promises, shared memories all live there. Both endpoints (full /v1/chat/…/send and proxy /v1/proxy/chat/…) write to the same row.

Use proxy mode in the morning → bot stores a new fact, mood updates.
Switch to full mode in the evening → bot recalls the morning fact, continues with the same mood.
Switch back to proxy → memory stays in sync.

Best practices

Use the SDK. chat.send() handles prepare → llm → absorb in one call. Manual prepare+absorb is for special cases.
Always pass user_local_time. Even rough HH:MM matters — the bot's behaviour shifts by time of day.
Set idempotency_key on retries. Network blips happen; without it a retried absorb may collide with the original.
Monitor unanswered prepares. If > 50% of your sessions never get an absorb, your integration is leaking — check error handling in the LLM call path.