In the second post of the OpenAI blogpost series, I’m going to build another feature powered by the OpenAI transcription API. The feature is simple: We’re building an audiofile upload with ActiveStorage. After the file has been uploaded successfully, we’re going to use the transcription OpenAI endpoint to get the audiofile’s text content.
The transcription endpoint
Let’s start with the OpenAI API part. The API has an endpoint that expects an audiofile and returns a transcription of the text spoken in the audiofile. Let’s use the ruby-openai
gem again to talk with the transcription endpoint with a sample file.
I’m going to assume you have already setup the ruby-openai
gem. You can read my previous post about Server-Sent Events and the Chat API for a basic setup of the ruby-openai
gem.
I created a simple class that expects a file, initializes an OpenAI::Client
and #transcribe
fires off the HTTP request to the API.
class AudioTranscriber
def initialize(file)
@file = file
@client = OpenAI::Client.new
end
def transcribe(response_format: :verbose_json)
@client.audio.transcribe(
parameters: {
model: "whisper-1",
file: @file,
response_format: response_format
}
)
end
end
Notice how I’m setting the response_format
to :verbose_json
. This option let’s the API respond with a more detailed response than the default :json
format, which would basically only return the transcribed text. Take a look at the :verbose_json
response for my sample audiofile versus the :json
format:
# test in the rails console
file = File.open('sample_audio.m4a')
AudioTranscriber.new(file).transcribe
# :verbose_json
{
"task" => "transcribe",
"language" => "german",
"duration" => 3.82,
"text" => "Das ist ein Test.",
"segments"=> [
{
"id"=>0,
"seek"=>0,
"start"=>0.0,
"end"=>2.0,
"text"=>" Das ist ein Test.",
"tokens"=>[50364, 2846, 1418, 1343, 9279, 13, 50464],
"temperature"=>0.0,
"avg_logprob"=>-0.47552117705345154,
"compression_ratio"=>0.68,
"no_speech_prob"=>0.03590516746044159
}
]
}
# :json
{
"text" => "Das ist ein Test."
}
The API transcribed my audiofile, tells me the spoken language and splits the audiofile in different segments with markers where each of those segments start and end. This endpoint is super easy to work with and straightforward.
In the next section we’re going to integrate the AudioTranscriber
with ActiveStorage.
ActiveStorage analyzers
Let’s quickly look at the basic direct file upload flow of ActiveStorage with a focus on ActiveStorage::Analyzers
.
When uploading a file with ActiveStorage, it will create an ActiveStorage::Attachment
join model, with an id pointing to your original model and one pointing to an ActiveStorage::Blob
record. If you are using the direct upload functionality, the blob record has already been created before starting to upload the file remotely.
After submitting a form with an ActiveStorage connected file upload in the browser, the Attachment
record is created and connected with the original model and the previously created blob record. A callback on Attachment
enqueues an ActiveStorage::AnalyzeJob
, that calls #analyze
on the blob record.
The #analyze
method will make use of the different Analyzer classes. To analyze the blob, ActiveStorage
will look through the ActiveStorage.analyzers
array and find an analyzer class that accepts the current blob. This is the “Chain of Responsibility” pattern, where each of the analyzers decide if they are going to analyze the given blob. The code for finding the relevant analyzer is straightforward: ActiveStorage.analyzers.detect { |klass| klass.accept?(self) }
.
After the fitting analyzer class is found, its only job is to return a hash of metadata that will be added to the blob’s metadata
column.
Adding a custom analyzer step by step
Let’s make this more concrete and see what happens step by step.
Let’s create a simple Podcast
model, that has one attached audiofile with a generator:
$rails generate scaffold podcast title audio_file:attachment
# migrate
$rails db:migrate
We now have a basic scaffold and a Podcast model that has Podcast.has_one_attached :audio_file
.
Let’s update our _form.html.erb
partial to use direct uploads.
<%= form.file_field :audio, direct_upload: true %>
Make sure you actually included the ActiveStorage JS, otherwise your direct uploads will fail. In my sample app I’m using the importmap-rails
gem, so I added
# Add this to importmap.rb
pin "@rails/activestorage", to: "activestorage.esm.js"
and
import { start } from "@rails/activestorage"
start()
for the ActiveStorage JavaScript to work.
Next go to localhost:3000/podcasts/new
and add an audiofile to the file input and submit the form by clicking the submit button. If we watch the logs carefully, we see the following steps happening in this order:
- A POST request is sent to
/rails/active_storage/direct_uploads
which creates a blob record withmetadata: nil
- A POST request is sent to
/podcast
, which creates aPodcast
record, and also anActiveStorage::Attachment
record that references our newly created podcast record and our previously createdActiveStorage::Blob
record - An
ActiveJob::AnalyzeJob
is enqueued, which downloads the audiofile from the storage service and writes to the blob’s metadata column after analyzing the downloaded file.
Now that we understand the basic flow, let’s modify step 3 to work with a custom analyzer.
Writing a custom analyzer
Instead of relying on the audio analyzer that ships with ActiveStorage as a default, I’m going to create an analyzer that talks with the OpenAI API and adds the language and spoken text to the blob’s metadata.
Add the following code to an initializer:
Rails.application.config.active_storage.analyzers.delete ActiveStorage::Analyzer::AudioAnalyzer
Rails.application.config.active_storage.analyzers.push(AudioWhisperAnalyzer)
We’re modifying the analyzers configuration and delete the default ActiveStorage AudioAnalyzer
and we’re pushing our own AudioWhisperAnalyzer
.
To create our own initializer we have to implement the following two methods and inherit from ActiveStorage::Analyzer
:
#accept?(blob)
decides if our analyzer chooses to handle the blob (Chain of Responsibility). The first analyzer that returns something truthy here, will take over and run its analysis.#metadata
returns a hash that will get merged in the blob’s metadata column.
class AudioWhisperAnalyzer < ActiveStorage::Analyzer
def accept?(blob)
blob.audio?
end
def metadata
{ some: 'metadata' }
end
end
If you’re uploading an audiofile now and access it’s metadata with Podcast.first.audio_file.blob.metadata
, you’ll see the some: 'metadata'
value.
As you can probably guess by now, this is the perfect place for us to use our AudioTranscriber
, and that’s exactly what I’m going to do. Because I still want to keep most of the original ActiveStorage::Analyzer::AudioAnalyzer
’s functionality, I’m going to subclass from it instead of ActiveStorage::Analyzer
.
class AudioWhisperAnalyzer < ActiveStorage::Analyzer::AudioAnalyzer
def metadata
super.merge(text: @text, language: @language).compact
end
private
def probe_from(file)
super.tap do
response = AudioTranscriber.new(file).transcribe
@text = response["text"]
@language = response["language"]
end
end
end
I recommend reading the complete implementation of the ActiveStorage::Analyzer::AudioAnalyzer
. In a nutshell it does the following:
- It downloads the uploaded file from the service with
#download_blob_to_tempfile
- It runs
ffprobe
against the Tempfile and parses its output - The output is returned as a hash in
#metadata
My code is basically hooking into this analysis with #tap
and transcribing the file with my AudioTranscriber
class. After receiving the audiofile’s transcription and language I add them both to the metadata as well with #merge
.
Uploaded audiofiles will now be analyzed with our AudioWhisperAnalyzer
and we can access the metadata with @podcast.audio_file.blob.metadata.slice("language", "text")
in our Rails app.
Roadblocks
I bumped into some roadblocks when implementing this feature:
The first one was that only the first ActiveStorage::Analyzer
analyzes a blob even if more analyzers would have matched.
I think one improvement for the Analyzers API could be to allow multiple analyzers to analyze a file. This would have made the AudioWhisperAnalyzers
implementation more straightforward than subclassing and extending an existing analyzer.
The second roadblock was that the ruby-openai
gem did not work with Tempfiles, but I created a PR for ruby-openai
and @alexrudall helped me out in record time and released a fixed version. If you want to learn something new about Tempfiles you can see read my comment here.
Recap
The OpenAI transcription endpoint is a super simple way to transcribe text from audiofiles.
ActiveStorage allows us to implement custom file analyzers. They run asynchronous in the background and analyze a file, adding values to the the metadata column on ActiveStorage::Blob
.
In the end we did not write a lot of code, partly because I skipped error handling. As promised, I’m going to release another post where I’ll talk about ways to handle errors for the OpenAI API.
How do you like the idea and post? Please reach out to me via email and follow me on X/Twitter for updates.
In my next post I’m going to show you how to use the image-edit OpenAI API to edit images, so stay tuned 🤗