Lesson 5.2 — Audio and Voice AI

Microphone in a recording studio setup

Audio AI has arrived quietly but arrived significantly. The ability to transcribe speech accurately, convert text to natural-sounding voice, and clone someone's vocal identity — these were futuristic capabilities only a few years ago. Today they are available to anyone with a browser and, in many cases, a free account.

This lesson covers three distinct capabilities: transcription (turning speech into text), text-to-speech (turning text into spoken audio), and voice cloning (creating a synthetic version of a specific voice). For each, we'll cover practical tools, step-by-step guidance, and the ethical considerations that are genuinely important in this area.

Part 1: Transcription — Turning Speech into Text

Transcription is arguably the most immediately practical audio AI capability for most people. Accurate, fast, affordable transcription of meetings, interviews, lectures, podcasts, and voice memos saves enormous amounts of time.

Otter.ai: Step-by-Step Guide

Otter.ai is one of the most accessible transcription tools for beginners, with a generous free tier.

Setting up:

Go to otter.ai and create a free account
You can import audio files or record directly in the browser
For meetings, Otter offers calendar integration — it can join Zoom, Google Meet, or Microsoft Teams calls automatically

Transcribing a recording:

Click "Import" and upload your audio or video file (MP3, MP4, WAV, M4A)
Otter processes the file — typically one-quarter of the audio length or faster
The transcript appears with timestamps; each speaker is detected and labelled
Click any word in the transcript to jump to that moment in the audio

Cleaning up the transcript:

Listen back for any errors — technical terms, proper nouns, and accented speech are most likely to be wrong
Click on any word to correct it
Use the "Speakers" panel to name each speaker

Generating a summary:

Click "Summary" in the sidebar
Otter will generate key points, action items, and a brief overview
This works well for structured meetings; it works less well for casual conversations

Free tier limits: 300 minutes per month. Paid plans start at around $10/month for 1,200 minutes and add features like advanced search and team sharing.

Other Transcription Tools Worth Knowing

Tool	Best for	Key strength	Cost
Otter.ai	Meetings, interviews	Real-time transcription, speaker labels	Free tier; from ~$10/month
Whisper (OpenAI)	Technical users, high accuracy	Excellent accuracy, open source, many languages	Free (self-hosted) or via API
Descript	Podcasters, video creators	Edit audio by editing transcript text	Free tier; from $12/month
Riverside.fm	Podcast recording + transcription	High-quality remote recording	From $15/month
Rev	High accuracy, human review option	Most accurate for complex audio	From $0.25/minute AI; $1.50/min human

Part 2: Text-to-Speech — Turning Text into Voice

Text-to-speech (TTS) has gone from robotic and obviously synthetic to genuinely natural-sounding. Modern TTS is used in audiobooks, explainer videos, e-learning content, accessibility tools, and more.

Common use cases:

Creating voiceovers for videos without recording yourself
Making written content accessible to people with reading difficulties
Producing podcast-style content without a microphone
Listening back to your own writing to catch errors

Tools Overview

ElevenLabs is the industry leader for quality. Their voices are remarkably natural, with appropriate breath sounds, pacing variation, and emotional inflection.

Free tier: 10,000 characters per month (~6–7 minutes of audio)
Simple interface: paste text, choose a voice, click Generate
Voice library has hundreds of options including different accents and ages
Supports 29 languages

Google Text-to-Speech / Amazon Polly: These are more robotic but are built into many platforms and are extremely reliable. Good for integration into apps and workflows; less good for high-quality content.

Murf.ai: Designed specifically for content creators. Has a built-in studio interface for adding pauses, emphasis, and background music. Good for explainer videos.

A practical workflow for video voiceovers:

Write your script in a document
Paste into ElevenLabs, choose a voice that matches your content's tone
Download the MP3
Import into your video editor (DaVinci Resolve, CapCut, iMovie)
Sync the audio to your visuals

Part 3: Voice Cloning — Creating a Synthetic Version of a Voice

Voice cloning creates a model of a specific person's voice that can then speak any text provided. This is the most powerful and the most ethically charged capability in audio AI.

ElevenLabs Voice Cloning: How It Works

Record (or collect) 1–3 minutes of clean audio of the target voice
In ElevenLabs, go to "Voice Lab" → "Add a new voice" → "Instant Voice Clone"
Upload your audio samples
ElevenLabs creates a voice model
You can now type any text and have it spoken in that voice

Legitimate uses:

Cloning your own voice for content creation without re-recording
Producing consistent voiceovers across a long project
Accessibility — enabling someone with a speech impairment to communicate in their natural voice
Localising content to different languages while keeping the original speaker's voice characteristics

Ethical Rules for Audio AI

Voice cloning and synthesis open significant ethical questions. Here are the rules to operate by:

Key takeaway: The same technology that enables genuinely useful applications also enables harmful ones. Using audio AI ethically requires explicit consent and clear transparency about synthetic content.

Situation	Ethical?	Rule
Cloning your own voice	Yes	No restriction
Cloning someone's voice with their explicit written consent	Yes	Document the consent
Creating a voiceover that sounds like a celebrity without permission	No	Potential legal liability; always wrong
Using synthetic voice in content without disclosing it	Grey area	Disclose where disclosure is expected (journalism, advertising)
Creating audio of someone saying something they didn't say	No	Could constitute defamation or fraud
Voice cloning a deceased person for family use	Ethically complex	Consent cannot be obtained; proceed with great care

Most jurisdictions are still developing law around AI-generated audio, but several high-profile cases have resulted in significant legal action. The practical rule: never use someone else's voice without their consent, and always disclose when audio is synthetic in contexts where that matters.

A Note on Deepfake Audio

Synthetic audio used to deceive — making someone appear to have said something they didn't — is a growing problem. In 2024, a finance employee at a multinational was defrauded of $25 million after being convinced on a deepfake video call that colleagues were approving a transaction. The audio component was AI-generated.

This is not a theoretical risk. It is useful to know:

Voice deepfakes can be created with publicly available tools
Verification over a separate channel is essential for any high-stakes instruction
Emotion and urgency in an unexpected call are red flags, not reassurances

Practice Task

Try transcribing something this week. Upload a recording of a meeting, an interview, a voice memo, or even a podcast you've saved — anything with clear spoken audio. Use Otter.ai's free tier and then use the AI summary feature. Notice what the summary captures and what it misses. That gap is where your own judgment still matters.

Audio & Voice AI Tools