Making an AI assistant inspired by the film 'Her'

TLDR: For that don't want to be bored by my explanation on how I did this, you can view the finished demo here.

Back in 2013 the film 'Her' was released which for those that don't know is the story of a middle aged man who falls in love with Samantha, the AI assistant that ships with his new operating system. She was basically a sexier version of Clippy from Microsoft XP.

I remember watching it and thinking it was probably an accurate representation of what the future could be like. Fast forward to the present day and while I was out running last week it suddenly dawned on me that all of the tools needed to recreate Samantha probably exist. So I thought I'd have a go at it.

In my head the plan was pretty simple, I'd just need to,

1. Use the MediaRecorder API that ships with browsers to capture the user's microphone
2. Send the recording to a speech-to-text API
3. Send the transcription to ChatGPT using the OpenAI API
4. Convert the response to speech using an AI voice API
5. Play the audio to user
6. Wrap it in a nice interface based on the film

Working on the interface

Although it's the last point above, I actually started out building the interface which meant taking screenshots of the film and trying to get the most accurate hex codes as I could.

A screenshot from the film showing the OS UI

It's a fairly simple UI, in fact it could be the birthplace of the subtle gradient design trend we've seen in the last few years.

I think I'd have struggled to recreate the loading animation with my limited experience of animation libraries but luckily I found this amazing CodePen inspired by the film which I tweaked slightly for my needs.

Finally I also took a few small audio clips from the film using FFmpeg that I could use to make the UI feel more immersive and true to the film. It's not quite 1:1 with the film and there's still some improvements I can make but seeing this all come together was pretty exciting.

The finished UI

Recording audio

From a UX standpoint it would be nice if the microphone was recording all the time and the app would just detect when you've stopped talking. However this seemed like something that would be hard to implement without being buggy, plus it would add more latency, so for now you press and hold the spacebar to record or touch and hold on mobile. Perhaps for v2.0 I can readdress this.

Speech to text

When I was building Textreel, a saas for converting audio into video, I spent quite a bit of time experimenting with different speech to text tools and found that generally the IBM API gives the most consistent and accurate results.

One of ways you can get accurate transcripts is to supply the name of the language being spoken (UK English / American English) when calling the API. I didn't have time to implement this at the moment so the default is American English. In the future it would be better if I could work out the users country based on their IP and best guess what language they might be speaking along with the option to override it.

ChatGPT

One of my frustrations when using ChatGPT is that the output seems purposefully quite restricted at times. For example it never forms an opinion on anything which I'm guessing is mainly for legal reasons.

This was also where my plan started to unravel. I was hoping I'd be able to ask ChatGPT to assume an identity and play along with my little game but it wasn't having it.

Lame af

Text to speech

I'd heard good things about ElevenLabs Prime Voice API and while it has its limitations I was blown away by what it can do. Through the dashboard you can create custom voices and if you are on a paid plan you can create these by uploading a sample audio clip of the voice you want to replicate.

This was where I hit my second big stumbling block. I guess for obvious reasons you can only train the AI using your own voice or a voice you have permission to use. I'm not that close to Scarlett Johanson so I wasn't able to get her permission but as an alternative I was able to hire and get the permission of an impressionist who gave her voice their best shot.

Once I uploaded the clip I kind of expected there to be a waiting time of a few hours while the AI was "trained" but to my surprise it was ready to use immediately. The end result is pretty impressive, the voice sounds slightly robotic but it's close enough. One gripe I have is that there's no way of controlling the speed and tone of the speech. I'd like it to be slow and seductive but kind of sounds rushed and cold.

Putting it all together

Because I am using a chain of 3 different third party services there's obviously going to be quite a bit of latency from when the user speaks to them hearing a response. I didn't know how bad this would be until I'd finished and yeah it's not that great. Sometimes it's acceptable but other times it's painfully long and kind of ruins the experience.

The main bottleneck is the Prime Voice API which can take anywhere from 1s - 5s to send a response. I don't think there's much I can do to optimize this, hopefully it might improve over time.

All in all it was able to put together a proof of concept that works and although it wasn't quite as I envisioned this tech is only going to accelerate over time so I'm going to revisit this project in a few months.

View the finished demo

If you liked this article and want more content like this in the future maybe you can give me a cheeky follow on Twitter or check out my other projects.