Modules/Module 5/Lesson 2
Lesson 2 of 6 ~10 min read

Audio & Voice AI Tools

Lesson 5.2 — Audio and Voice AI

Microphone in a recording studio setup

Audio AI has arrived quietly but arrived significantly. The ability to transcribe speech accurately, convert text to natural-sounding voice, and clone someone's vocal identity — these were futuristic capabilities only a few years ago. Today they are available to anyone with a browser and, in many cases, a free account.

This lesson covers three distinct capabilities: transcription (turning speech into text), text-to-speech (turning text into spoken audio), and voice cloning (creating a synthetic version of a specific voice). For each, we'll cover practical tools, step-by-step guidance, and the ethical considerations that are genuinely important in this area.


Part 1: Transcription — Turning Speech into Text

Transcription is arguably the most immediately practical audio AI capability for most people. Accurate, fast, affordable transcription of meetings, interviews, lectures, podcasts, and voice memos saves enormous amounts of time.

Otter.ai: Step-by-Step Guide

Otter.ai is one of the most accessible transcription tools for beginners, with a generous free tier.

Setting up:

  1. Go to otter.ai and create a free account
  2. You can import audio files or record directly in the browser
  3. For meetings, Otter offers calendar integration — it can join Zoom, Google Meet, or Microsoft Teams calls automatically

Transcribing a recording:

  1. Click "Import" and upload your audio or video file (MP3, MP4, WAV, M4A)
  2. Otter processes the file — typically one-quarter of the audio length or faster
  3. The transcript appears with timestamps; each speaker is detected and labelled
  4. Click any word in the transcript to jump to that moment in the audio

Cleaning up the transcript:

  1. Listen back for any errors — technical terms, proper nouns, and accented speech are most likely to be wrong
  2. Click on any word to correct it
  3. Use the "Speakers" panel to name each speaker

Generating a summary:

  • Click "Summary" in the sidebar
  • Otter will generate key points, action items, and a brief overview
  • This works well for structured meetings; it works less well for casual conversations

Free tier limits: 300 minutes per month. Paid plans start at around $10/month for 1,200 minutes and add features like advanced search and team sharing.


Other Transcription Tools Worth Knowing

ToolBest forKey strengthCost
Otter.aiMeetings, interviewsReal-time transcription, speaker labelsFree tier; from ~$10/month
Whisper (OpenAI)Technical users, high accuracyExcellent accuracy, open source, many languagesFree (self-hosted) or via API
DescriptPodcasters, video creatorsEdit audio by editing transcript textFree tier; from $12/month
Riverside.fmPodcast recording + transcriptionHigh-quality remote recordingFrom $15/month
RevHigh accuracy, human review optionMost accurate for complex audioFrom $0.25/minute AI; $1.50/min human

Part 2: Text-to-Speech — Turning Text into Voice

Text-to-speech (TTS) has gone from robotic and obviously synthetic to genuinely natural-sounding. Modern TTS is used in audiobooks, explainer videos, e-learning content, accessibility tools, and more.

Common use cases:

  • Creating voiceovers for videos without recording yourself
  • Making written content accessible to people with reading difficulties
  • Producing podcast-style content without a microphone
  • Listening back to your own writing to catch errors

Tools Overview

ElevenLabs is the industry leader for quality. Their voices are remarkably natural, with appropriate breath sounds, pacing variation, and emotional inflection.

  • Free tier: 10,000 characters per month (~6–7 minutes of audio)
  • Simple interface: paste text, choose a voice, click Generate
  • Voice library has hundreds of options including different accents and ages
  • Supports 29 languages

Google Text-to-Speech / Amazon Polly: These are more robotic but are built into many platforms and are extremely reliable. Good for integration into apps and workflows; less good for high-quality content.

Murf.ai: Designed specifically for content creators. Has a built-in studio interface for adding pauses, emphasis, and background music. Good for explainer videos.

A practical workflow for video voiceovers:

  1. Write your script in a document
  2. Paste into ElevenLabs, choose a voice that matches your content's tone
  3. Download the MP3
  4. Import into your video editor (DaVinci Resolve, CapCut, iMovie)
  5. Sync the audio to your visuals

Part 3: Voice Cloning — Creating a Synthetic Version of a Voice

Voice cloning creates a model of a specific person's voice that can then speak any text provided. This is the most powerful and the most ethically charged capability in audio AI.

ElevenLabs Voice Cloning: How It Works

  1. Record (or collect) 1–3 minutes of clean audio of the target voice
  2. In ElevenLabs, go to "Voice Lab" → "Add a new voice" → "Instant Voice Clone"
  3. Upload your audio samples
  4. ElevenLabs creates a voice model
  5. You can now type any text and have it spoken in that voice

Legitimate uses:

  • Cloning your own voice for content creation without re-recording
  • Producing consistent voiceovers across a long project
  • Accessibility — enabling someone with a speech impairment to communicate in their natural voice
  • Localising content to different languages while keeping the original speaker's voice characteristics

Ethical Rules for Audio AI

Voice cloning and synthesis open significant ethical questions. Here are the rules to operate by:

Key takeaway: The same technology that enables genuinely useful applications also enables harmful ones. Using audio AI ethically requires explicit consent and clear transparency about synthetic content.

SituationEthical?Rule
Cloning your own voiceYesNo restriction
Cloning someone's voice with their explicit written consentYesDocument the consent
Creating a voiceover that sounds like a celebrity without permissionNoPotential legal liability; always wrong
Using synthetic voice in content without disclosing itGrey areaDisclose where disclosure is expected (journalism, advertising)
Creating audio of someone saying something they didn't sayNoCould constitute defamation or fraud
Voice cloning a deceased person for family useEthically complexConsent cannot be obtained; proceed with great care

Most jurisdictions are still developing law around AI-generated audio, but several high-profile cases have resulted in significant legal action. The practical rule: never use someone else's voice without their consent, and always disclose when audio is synthetic in contexts where that matters.


A Note on Deepfake Audio

Synthetic audio used to deceive — making someone appear to have said something they didn't — is a growing problem. In 2024, a finance employee at a multinational was defrauded of $25 million after being convinced on a deepfake video call that colleagues were approving a transaction. The audio component was AI-generated.

This is not a theoretical risk. It is useful to know:

  • Voice deepfakes can be created with publicly available tools
  • Verification over a separate channel is essential for any high-stakes instruction
  • Emotion and urgency in an unexpected call are red flags, not reassurances

Practice Task

Try transcribing something this week. Upload a recording of a meeting, an interview, a voice memo, or even a podcast you've saved — anything with clear spoken audio. Use Otter.ai's free tier and then use the AI summary feature. Notice what the summary captures and what it misses. That gap is where your own judgment still matters.