VOICE ASSISTANT

Christine Everhart waking up in the Stark mansion

"Good morning. It’s 7:00 a.m. The weather in Malibu is 72 degrees with scattered clouds. The surf conditions are fair with waist-to-shoulder high lines. High tide will be at 10:52 a.m."

I want it.

As soon as I heard that line, I knew I wanted a JARVIS.

There's only one problem--JARVIS is a bit beyond the technology of our time. But maybe I can build something useful. Something that I control. Something that does exactly what I want.

What do I want exactly?

I hate wake words.

"Hey Siri." *pause* (Did it actually work?) "Set a timer for 5 minutes."

It hardly seems like an intelligent assistant if you have to call it by a specific phrase to wake it up, right? I want to interact with something naturally--no push to talk either. I want a real assistant.

It should interact with smart home accessories. It should orchestrate things more holistically than piecemeal solutions. If it uses electricity, I want to control it to whatever extent possible with my voice.

I want to interact with it everywhere. In any room, at home, at the office, or anywhere else. I like the Rabbit R1 and Humane pin concepts, but I'm not going to carry around a pin.

It needs to be smart. I'm not in the mood to tell it to turn off the lights in the living room. Then the kitchen. Then the office. No one has time for that.

Lets recap:

  • No wake words
  • Smart home integration
  • Location awareness to contextualize actions and responses
  • Keeps track of mundane things (and reminds me) so I don't have to
  • Accessible via voice, app, or web browser
  • Conversational and functional
  • Handles commands more intelligently than Google Home/Siri/Alexa

What I'm working with

In the last couple of years I've started building all new things with Elixir. Phoenix Liveview makes building interactive applications easy and Elixir machine learning has really grown up in the last 12 months.

Elixir offers Bumblebee to run models out of the box. Using a Large Language Model (LLM) is going to be a critical part of this endeavor. I'm currently using llama-3-8b-instruct for interaction and function calling.

I haven't solved the HomePod-like device that will provide audio interaction around the house. I hope it can be done with a Raspberry Pi and a Microphone HAT. The Jetson Nano is another more expensive and less available option. I plan on pairing each "Homepod" with a presence detection device--if a human is alone in a room, the assistant can assume it is being talked to.

Starting Small

"Motivation follows action." -Adam Wathan

This is an ambitious project, one where I could easily get caught in planning the details for months. Or I could start with something simple that solves a need and iterate on its parts. I choose this as a starting point:

INPUT

LiveView App

SpeechRecognition API

RASA

Fast command processing

LLM

OpenHermes 2 Mistral 7B

Chat / Conversation

OUTPUT

LiveView Chat

SpeechSynthesis API

This has a number of advantages for a first version:

  • TTS using browser APIs solves having to implement Silerno VAD, Whisper and some sort of audio relay mechanism
  • I didn't think LLMs could parse commands as fast as I wanted a response, so I used RASA to pick up commands. On user input, a dispatcher would send the text to RASA and the LLM. If it got a confident result back from RASA it would cancel the LLM task, which usually hadn't generated a token yet.
  • This first version was more or less "solved." I could focus on implementing features that would be useful to the rest of the family.

But it came with some big downsides.

I couldn't get RASA to work well. The most common failure mode I encountered was detecting phantom commands and hijacking the LLM's response. Its likely that if I invested more time into RASA it would have performed better, but I don't think it's part of my long-term plan.

I pulled RASA out and replaced Hermes-2-Mistral-7b with llama-3. I haven't yet found a prompt that allows function calling while maintaining the model's conversational knowledge. The closest I've come is an approach I found Hamel Husain posted at replicate-examples/cog-vllm-tools/predict.py at master, but it doesn't work very well with the 8B model. Until I solve this, users choose which "version" of the assistant they want to interact with--the conversational or the function calling assistant.

Our assistant helps with weekly meal planning, kitchen timers, and watches the weather (how else can it tell me the weather at 7am?).

What's Next?

It's time to get a physical device working. Solving two-way audio to text will be an interesting challenge. This could involve Rhasspy or Elixir's Membrane. Or maybe I'll send the raw audio to whisper running on llama.cpp. My goals conflict with much of the available prior work, and I'm not sure exactly what solution I'll land on.