CodeWaifu: VTuber Voice Assistant

So, after my first attempt at making this huge Electron app and kind of overcomplicating things, I've started from a clean slate and got some actual results.

New Architecture

This time I ditched everything that's not necessary. It's still a web app, but instead of Electron it's just a CLI app that serves a web frontend to localhost. This simplifies a lot of things because I don't need to deal with executables and any of that other stuff that doesn't really matter. A short overview of the stack:

tRPC
Express (with Vite middleware)
Vite
Mastra
React
shadcn
Three.js

This is actually a rather nice stack and allows me to move quickly, especially with the help of AI coding tools. Mastra is particularly nice because it abstracts quite nicely over the native LLM APIs and allows you to very easily create agents, and much more importantly, interface with voice LLMs directly.

AI Coding Agent Integration

This pretty much describes the basics but still leaves out the coding agent-specific changes. Here I'm still experimenting quite a bit. In general, just hooking up Claude 4 to an agent that has access to the filesystem and can run CLI commands gets you quite far, but the API pricing is far too much for me. Additionally, I'd like to play around with local models.

To actually get anywhere, I'm abstracting over the AI coding assistant, giving the actual CodeWaifu assistant just access to a CLI coding agent. In the beginning it'll just be Claude Code, but on the side I'm experimenting with using Qwen3 32B, which would be far cheaper and could actually be run locally on some (beefy) machines.

Avatar Creation and Voice Models

In general, I've got to say that VRoid Studio was really nice for creating the avatar. It also exports to VRM, which is a standard format for VTuber avatars and can be easily rendered in Three.js, including animations!

So far the voice models still seem somewhat disappointing. I still have to experiment with the various models and find one that actually allows for somewhat natural conversation with workable tool calling abilities. I've gotten the furthest by just utilizing Whisper for a STT → LLM → TTS pipeline, which while not really natural, did work reasonably well.

Animation and Expression Control

Apart from that, as soon as I get this settled I'll have to experiment with which tools to provide to the avatar, since I'd like the LLM to have some control over the animation and expression of the avatar. Currently I'm thinking of building a finite state machine with the actual animations and interpolation between states being handled by normal code, but the LLM being in charge of which state to change into. Though it might be tricky to actually time it right, because it'd be great if animations could change mid-conversation in a sort of natural fashion. That could make the assistant quite lively.

Mobile Development Use Case

One other thing I'd like to try out and use CodeWaifu for is coding on the go. So far I still don't have a good setup for feeding prompts to Claude Code with just my phone. It would be great if I could just go on a webpage and talk to the avatar, which then prompts Claude Code for me. This is mainly because I'd like to experiment with coding while taking a walk or going to the gym.

Voice-Only Code Review Challenge

The main problem here will probably be that I need to think of a good way for CodeWaifu to explain what's been done using just voice. Because while Claude Code is quite competent, it still tends to make mistakes quite often and for me only really works if I at least casually glance over everything, making sure it doesn't build itself into a corner or make architectural decisions that are terrible.

For this I not only need the executive summary but actually need to be able to understand what happens in a file using voice only. But maybe it's enough to feed an entire file into an LLM and let it give a description of the functionality contained using normal language. Far too often have I heard things like "function, trim, open parenthesis, string, colon, string, close parenthesis, bracket open...", instead of something like "function trim that has a single argument, string of type string".

We'll see. If this fails, using TreeSitter to parse the syntax and then generating a sort of botched English description that an LLM can fix up might also be a workable solution.

Adiós, べン

Ben's blog

rambling through cyberspace