

As part of the customization options, users can choose to display spoken words, one word at a time or set images, sounds, emojis and font colors to specific words.
#Google speech to text api python mac#
Whisper is able to accurately transcribe even the most complicated freestyle rap, as demonstrated by the following video of rapper Mac Lethal packing 400 words in a minute.Īrchitecture of Whisper’s production deploymentĬaptions allows users to style their transcriptions to best reflect their personal brand and message. If you're doubting Whisper because Eminem's lyrics are publicly available, we've got a video for you. Google’s Speech-to-Text was nowhere close to transcribing it. In one instance, we watched Whisper transcribe Eminem's "Godzilla" perfectly - a feat considering the song holds the Guinness World Record for the Fastest Rap in a No.1 Single, with 224 words packed into 31 seconds. We were blown away by how much Whisper outperformed our existing system in transcription accuracy. We assembled ~20 videos that displayed one or more of the listed characteristics and compared the performance between the contenders. features people speaking English with an accent.speaker is performing music (singing, rapping, spoken-word poetry).background noise, such as ambient room noise, outside noise or music playing in the background.These videos shared common characteristics: Over the last year, we’d been using Google to create transcriptions for our customers, and kept a log of when its Speech-to-Text performed particularly poorly. We took inspiration from our Customer inbox to identify videos that could test Whisper’s limits.

With the Whisper pipeline hooked up to the frontend, we rallied the team to a bug bash. It is our go-to tool for running A/B tests and analyzing outcomes. Statsig is a comprehensive statistical tool that provides real-time analytics and has a simple user interface. Using Statsig’s feature gates, we targeted internal users for our initial rounds of testing. As messages stream in, the application runs Whisper to transcribe the audio files and writes its results to a database. Each message in the topic represents a transcription request with the audio file location and a request id. The application listens on a PubSub topic.

We converted those instructions into a Dockerfile and used Google Cloud Console to provision an NVIDIA A100 machine to run our containerized application. OpenAI offers installation instructions for the Whisper Python library. Our journey kicks off with a proof-of-concept. But we wanted to see what Whisper could do. We've been running Google's Speech-to-Text API in production for the last year, and it's been working well. To that end, we’re always in pursuit of the most accurate Automatic Speech Recognition (ASR) model. We strive to minimize the number of corrections our users need to make to transcripts produced by our AI. When Captions returns inaccurate results, it creates additional work for our users and decreases the overall quality of their experience. We also produce professional-looking, word-by-word captions that are synced to voice without needing expensive human transcription services.Ĭaptions uses AI to transcribe videos in real-time and offers an intuitive transcript editor to make any desired updates. Our app offers a transcript-based video editing interface that makes video editing as simple as text editing. Introducing CaptionsĬaptions is a new mobile creator studio that uses AI to help creators through the entire process of content creation, from scripting to recording to editing and sharing. The results show that Whisper is the clear winner in transcription accuracy. The purpose of the test was to understand which platform produces fewer transcription errors on Caption’s production workload and how the change in AI model impacts the user experience. We ran an A/B test on Statsig comparing the error rate of Whisper against Google's flagship Speech-to-Text API. Known as Whisper, the model makes 50% fewer errors than its predecessors. OpenAI published a new artificial intelligence model that can transcribe speech with near-human accuracy. Special Thanks: Statsig Team + Timothy Chan, James Kirk, Mike Vernal, Patrick Phipps, and Jessica Owen
