"Good morning. It’s 7:00 a.m. The weather in Malibu is 72 degrees with scattered clouds. The surf conditions are fair with waist-to-shoulder high lines. High tide will be at 10:52 a.m."
I want it.
As soon as I heard that line, I knew I wanted a JARVIS.
There's only one problem--JARVIS is a bit beyond the technology of our time. But maybe I can build something useful. Something that I control. Something that does exactly what I want.
I hate wake words.
"Hey Siri." *pause* (Did it actually work?) "Set a timer for 5 minutes."
It hardly seems like an intelligent assistant if you have to call it by a specific phrase to wake it up, right? I want to interact with something naturally--no push to talk either. I want a real assistant.
It should interact with smart home accessories. It should orchestrate things more holistically than piecemeal solutions. If it uses electricity, I want to control it to whatever extent possible with my voice.
I want to interact with it everywhere. In any room, at home, at the office, or anywhere else. I like the Rabbit R1 and Humane pin concepts, but I'm not going to carry around a pin.
It needs to be smart. I'm not in the mood to tell it to turn off the lights in the living room. Then the kitchen. Then the office. No one has time for that.
Lets recap:
In the last couple of years I've started building all new things with Elixir. Phoenix Liveview makes building interactive applications easy and Elixir machine learning has really grown up in the last 12 months.
Elixir offers Bumblebee to run models out of the box. Using a Large Language Model (LLM) is going to be a critical part of this endeavor. I'm currently using llama-3-8b-instruct for interaction and function calling.
I haven't solved the HomePod-like device that will provide audio interaction around the house. I hope it can be done with a Raspberry Pi and a Microphone HAT. The Jetson Nano is another more expensive and less available option. I plan on pairing each "Homepod" with a presence detection device--if a human is alone in a room, the assistant can assume it is being talked to.
"Motivation follows action." -Adam Wathan
This is an ambitious project, one where I could easily get caught in planning the details for months. Or I could start with something simple that solves a need and iterate on its parts. I choose this as a starting point:
INPUT
LiveView App
SpeechRecognition API
RASA
Fast command processing
LLM
OpenHermes 2 Mistral 7B
Chat / Conversation
OUTPUT
LiveView Chat
SpeechSynthesis API
This has a number of advantages for a first version:
But it came with some big downsides.
I couldn't get RASA to work well. The most common failure mode I encountered was detecting phantom commands and hijacking the LLM's response. Its likely that if I invested more time into RASA it would have performed better, but I don't think it's part of my long-term plan.
I pulled RASA out and replaced Hermes-2-Mistral-7b with llama-3. I haven't yet found a prompt that allows function calling while maintaining the model's conversational knowledge. The closest I've come is an approach I found Hamel Husain posted at replicate-examples/cog-vllm-tools/predict.py at master, but it doesn't work very well with the 8B model. Until I solve this, users choose which "version" of the assistant they want to interact with--the conversational or the function calling assistant.
Our assistant helps with weekly meal planning, kitchen timers, and watches the weather (how else can it tell me the weather at 7am?).
It's time to get a physical device working. Solving two-way audio to text will be an interesting challenge. This could involve Rhasspy or Elixir's Membrane. Or maybe I'll send the raw audio to whisper running on llama.cpp. My goals conflict with much of the available prior work, and I'm not sure exactly what solution I'll land on.