How to: Generate Synthetic text-to-speech audio (CSM - Sesame AI) and curate it in Labelbox

Hello Community :waving_hand:

As the artificial intelligence community and AI Labs aim to empower true multi-modal capabilities, the interaction we require with AI needs to be more advanced and feel more authentic. Labelbox provides a dedicated Audio editor that is feature-rich and can cater to all your speech-to-text, classification, and other needs. Whether you want to fine-tune or create an ML pipeline from scratch, we have the tools to handle either your data or data generated from a model (which we will cover in this post).

Requirements

In this guide, this is what you need to follow along

Synthetic data generation script

Code to generate the inferences locally

Couple of note to provide here:
I installed the CSM locally, and retrieve the package from the path

# Clone the repository using HTTPS
!git clone https://github.com/SesameAILabs/csm.git
%cd csm
!pip install -q -r requirements.txt

I took recording for my own experimentation hence the local files under segments (context) those are not required but CSM will randomize the voice generated if not provided.

Finally, you may need to authenticate to HF:

from huggingface_hub import login
import os

HF_TOKEN=os.environ.get("HF_AUTH")
login(token = HF_TOKEN)

Main script:

import sys
sys.path.append('/Users/me/Documents/python stuff/csm')
from generator import load_csm_1b, Segment
import torchaudio
import torch
import os
from tqdm import tqdm

# Use cuda if you can
device = "cpu"

# Initialize the generator
generator = load_csm_1b(device=device)

# Reference audio files and transcripts for context
speakers = [0, 0, 0]
transcripts = [
    "Lets try to use my voice as a template to see if that can work.",
    "Hey, how are you doing?",
    "Its me Paul, your helpful AI."
]
audio_paths = [
    "/Users/me/Documents/recording_reference/20250314 150814.wav",
    "/Users/me/Documents/recording_reference/20250317 133312.wav",
    "/Users/me/Documents/recording_reference/20250317 133410.wav"
]

# List of conversation starters
conversation_starters = [
    "Describe an interesting encounter or experience you had recently and how it made you feel.",
    "Talk about your favorite hobbies, sharing a story or memorable moment related to one of them.",
    "Discuss your dream vacation destination, including activities you'd like to do and places you'd visit.",
    "Share a piece of advice that has stuck with you, and explain how it has impacted your life.",
    "Elaborate on your favorite holiday, including traditions, memories, and why it's special to you.",
    "Recommend your favorite cuisine, a dish you love, and share a story or experience related to that cuisine.",
    "Discuss your favorite movie genre, a film you recommend, and why you think it's a must-watch.",
    "Talk about your favorite book genre, a recommended read, and how it has influenced your life.",
    "Share your favorite music genre, artist, or band, and describe a concert or memorable moment involving their music.",
    "Give an overview of your favorite TV show, highlighting its unique qualities and why it's worth watching.",
    "Describe an intriguing fact or piece of information you've learned and how it has affected your perspective.",
    "Explain why you prefer a particular season, sharing specific activities, events, or experiences related to that season.",
    "Discuss your favorite animal, its characteristics, and why it holds a special place in your heart.",
    "Share your favorite color, how it makes you feel, and any personal associations or memories tied to it.",
    "Describe your ideal way to spend a day off, mentioning specific activities, locations, or experiences that make it enjoyable."
]

def load_audio(audio_path):
    audio_tensor, sample_rate = torchaudio.load(audio_path)
    audio_tensor = torchaudio.functional.resample(
        audio_tensor.squeeze(0), orig_freq=sample_rate, new_freq=generator.sample_rate
    )
    return audio_tensor

# Create output directory if it doesn't exist
output_dir = "conversation_prompts"
os.makedirs(output_dir, exist_ok=True)

# Load audio segments
print("Loading reference audio segments...")
segments = [
    Segment(text=transcript, speaker=speaker, audio=load_audio(audio_path))
    for transcript, speaker, audio_path in zip(transcripts, speakers, audio_paths)
]
print("Reference audio loaded successfully.")

# Generate audio for each conversation starter with tqdm progress bar
for i, prompt in enumerate(tqdm(conversation_starters, desc="Generating audio files")):
    # Generate filename
    output_filename = os.path.join(output_dir, f"prompt_{i+1}.wav")
    
    # Generate audio
    audio = generator.generate(
        text=prompt,
        speaker=0,
        context=segments,
        max_audio_length_ms=90_000,
    )
    
    # Save audio file
    torchaudio.save(output_filename, audio.unsqueeze(0).cpu(), generator.sample_rate)

print(f"All audio files generated successfully in: {os.path.abspath(output_dir)}")

Send the data to Labelbox

We then retrieve the data to get annotated in Labelbox

Code to send your Labelbox formated inference file(s) to Azure
import labelbox as lb
from azure.storage.blob import BlobServiceClient, ContainerClient
import os

API_KEY = os.environ.get('LABELBOX')
client = lb.Client(api_key=API_KEY)

# Azure connection string
connect_str = os.environ.get('connect_str')
service = BlobServiceClient.from_connection_string(connect_str)
container_name = 'synthetic-audio-csm'
container_client = service.get_container_client(container_name)

#retrieve LB integration
organization = client.get_organization()
iam_integration = organization.get_iam_integrations()[0]
iam_integration.name

dataset = client.create_dataset(name=f"Azure_{container_name}",iam_integration=iam_integration)
blob_list = container_client.list_blobs()

uploads = []
for blob in blob_list:
    url = f"{service.url}{container_client.container_name}/{blob.name}"
    uploads.append(dict(row_data = url, global_key = blob.name))

task = dataset.create_data_rows(uploads)
task.wait_till_done()
print(task.errors)

Configure the Audio editor in Labelbox

Then after creating your ontology, you should be able to start doing some labeling, including auto transcribing!


(AI generated)

* If you decide to follow along and use the open-source library from Sesame (which is great!) you will need the access to Hugging Face and precisely 2 models: