Creature Listener

The Creature Listener is what gives my creatures the ability to listen and talk back. It’s a standalone C++ application that runs on a Raspberry Pi 5 sitting near each creature, and it handles the entire conversational pipeline — from hearing a wake word, to understanding what was said, to coming up with a response, to making the creature actually say it out loud with lip-synced animation.

The code is on GitHub, like everything else. It’s written in C++20 and packaged as a Debian package for easy deployment.

This is probably the most ambitious piece of the whole system. Getting a creature to respond conversationally in real time required stitching together a lot of different technologies, and making them all work together on a Pi was a fun challenge. But when it all comes together and you say “Hey Beaky” and she responds with something clever? Totally worth it. 💜

How It Works

The listener runs as a state machine that loops through a conversation flow:

LISTENING — Waiting for a wake word (or a button press)
RECORDING — Capturing what the person is saying, using voice activity detection to know when they’ve stopped talking
TRANSCRIBING — Converting the recorded audio to text
THINKING — Sending the transcript to a local LLM to generate a response
SPEAKING — Streaming the response to the Creature Server, which handles text-to-speech, lip sync generation, and playback

The whole thing is pipelined so you don’t have to wait for the entire response to be generated before the creature starts talking. As soon as the LLM produces its first sentence, it gets sent to the server and the creature starts speaking while the rest of the response is still being generated. This took the time from “done talking” to “creature responds” from about 15 seconds down to around 2 seconds. That matters a lot when you’re trying to have a natural conversation!

Wake Word Detection

Each creature has its own wake word. Beaky responds to “Hey Beaky” and Mango responds to “Hey Mango.” I’m using LOWWI (Lightweight Open Wake Word Implementation) for this, which runs openWakeWord models via ONNX Runtime. It all runs locally on the Pi with no cloud services needed.

Training a custom wake word model is surprisingly straightforward. You generate synthetic samples using text-to-speech services, mix them with background noise, and train a small neural network. The resulting model is tiny and runs with almost no CPU overhead.

I’ll be honest — wake word detection with a cheap USB microphone sitting next to a speaker that’s playing the creature’s voice is hard. I ended up adding a cooldown timer after each conversation turn so the creature doesn’t hear itself talking and try to respond to its own voice. (That was a fun bug to debug. 😅)

Speech-to-Text

For transcription I’m using whisper.cpp. It can run locally on the Pi, but the Pi 5 is a bit slow for this — transcription was taking longer than the actual audio. So I added the option to offload STT to the Creature Server, which has much beefier hardware and can transcribe in a fraction of the time.

One thing I learned the hard way is that Whisper hallucinates on silence. If you feed it a recording that’s mostly quiet, it’ll confidently transcribe things like “Thank you for watching!” or “(speaking in foreign language)” that nobody actually said. I ended up building a hallucination filter that catches common patterns and throws them out.

The LLM

The responses are generated by a local LLM running on llama.cpp’s server. I’m currently using Mistral Nemo, which is small enough to run well on consumer hardware but smart enough to be genuinely funny. Each creature has its own system prompt that gives it a personality — Beaky is a sassy macaw, Mango is sweet and enthusiastic.

The listener maintains a conversation history so creatures remember what you’ve been talking about within a session. If you tell Beaky a joke and then say “tell me another one,” she knows what you mean.

Home Assistant Integration

This is where it gets really fun. The LLM has access to tools that let it query my Home Assistant server. So you can ask Beaky “what’s the temperature outside?” and she’ll actually check the sensor and tell you. Or ask if the front door is locked. The creatures can answer questions about the real state of the house!

Under the hood this uses the OpenAI function calling format. When the LLM decides it needs to check a sensor, it emits a tool call, the listener executes it against the Home Assistant REST API, and feeds the result back to the LLM so it can formulate a natural response. (Getting this to work reliably with a local model took some iteration, but it works great now.)

Audio Capture

Audio is captured via PortAudio from whatever USB microphone is attached to the Pi. Most cheap USB microphones run at 48kHz natively, but the speech processing pipeline needs 16kHz, so the listener auto-detects the device’s native sample rate and decimates down as needed.

Voice activity detection is energy-based with an adaptive noise floor. When the listener starts up, it spends about a second calibrating to the ambient noise level, and then uses that as a baseline to detect when someone is actually speaking. This works surprisingly well even in a room with a TV on in the background.

Observability

The whole conversational pipeline is traced end-to-end with OpenTelemetry. Each conversation turn creates a trace that follows the request from wake word detection through STT, LLM inference, and all the way to the Creature Server for TTS and playback. I send the traces to Honeycomb so I can see exactly where time is being spent and debug issues.

This has been invaluable for optimizing latency. When I can see that STT is taking 4 seconds on the Pi vs 200ms on the server, it makes the decision to offload pretty easy.

Deployment

The listener is packaged as a .deb and built automatically by GitHub Actions for both amd64 and arm64. It runs as a systemd service on each Pi, configured via a YAML file that specifies which creature it’s listening for, the LLM endpoint, wake word model, and other settings.

The goal is to have a listener running on a Pi near every creature in the house. Each one is independent — its own microphone, its own wake word, its own personality. But they all talk to the same Creature Server for the heavy lifting.

This is still a work in progress and I keep finding new things to improve. But hearing a creature respond to you in real time, with its beak moving in sync with the words, feels like actual magic. I love it. 😍