Inside the Code: Ankit Kumar (Sesame) & Anjney Midha (a16z) on the Future of Voice AI
What goes into building a truly natural-sounding AI voice? In this episode, Sesame’s cofounder and CTO, Ankit Kumar, joins a16z’s Anjney Midha for a deep dive into the research and engineering behind their voice technology.
They discuss the technical challenges of real-time speech generation, the trade-offs in balancing personality with efficiency, and why the team is open-sourcing key components of their model. Ankit breaks down the complexities of multimodal AI, full-duplex conversation modeling, and the computational optimizations that enable low-latency interactions. They also explore the evolution of natural language as a user interface and its potential to redefine human-computer interaction.
Plus, we take audience questions on everything from scaling laws in speech synthesis to the role of in-context learning in making AI voices more expressive.
Key Takeaways:
- How Sesame achieves natural voice interactions through real-time speech generation.
- The impact of open-sourcing their speech model and what it means for AI research.
- The role of full-duplex modeling in improving AI responsiveness.
- How computational efficiency and system latency shape AI conversation quality.
- The growing role of natural language as a user interface in AI-driven experiences.
For anyone interested in AI and voice technology, this episode offers an in-depth look at the latest advancements pushing the boundaries of human-computer interaction.
Follow everyone on X:
Ankit Kumar - https://x.com/_apkumar
Anjney Midha - https://x.com/anjneymidha
Check out everything a16z is doing with artificial intelligence, including articles, projects, and more podcasts here – https://a16z.com/ai/
Chapters:
0:00 -
00:51 | Intro
00:52 -
04:58 | Challenges Of Building
04:59 -
07:45 | Q + A: What Was Done To Bridge Transcription And Text Processing?
07:46 -
09:57 | How Is Sesame So Much Better Than Others?
09:58 -
12:42 | Challenges In| Making AI Accessible To All
12:43 -
14:10 | Great Researchers Prioritize User Experience
14:11 -
15:47 | What Is Good Taste In ML?
15:48 -
17:45 | Problems That Can Be Solved That Add Value To The World
17:46 -
26:25 | Open Source Audio For Speech Generation
26:26 -
34:00 | Contextual Speech vs Text to Speech, Differences
34:01 -
35:50 | Value Proposition Of Glasses With No Friction
35:51 -
38:00 | General Purpose API vs Open Source Model
38:01 -
40:47 | Creating High Quality APIs
40:48 -
45:54 | Companions And How Sesame Will Handle Context Retention In Long Conversations
45:55 -
46:59 | Talent: What It Takes To Become A Part Of The Sesame Team
47:00 -
54:37 | How Scaling Laws For Speech Differ From Text
54:38 -
58:33 | How An Organic Conversation Be Preserved Using A Voice Companion
58:34 -
1:03:52 | App Building Technology: Roadmap
1:03:53 -
1:09:09 | Architectures and Transformers
1:09:10 -
1:15:56 | The Focus On Personality, And The Differences In Products
1:15:57 -
1:25:25 | New AI Interface: Interacting With AI Companion
1:25:26 -
1:26:56 | Companion Challenges
1:26:57 -
1:29:22 | Computing Interface Of The Future
1:29:23 -
1:31:45 | Focused Product Experience Built By Small Teams
1:31:46 -
1:36:13 | Join Sesame If You Want To Make A Consumer Product People Love