OpenAI transcription API and ActiveStorage analyzers

In the second post of the OpenAI blogpost series, I’m going to build another feature powered by the OpenAI transcription API. The feature is simple: We’re building an audiofile upload with ActiveStorage. After the file has been uploaded successfully, we’re going to use the transcription OpenAI endpoint to get the audiofile’s text content.

The transcription endpoint

Let’s start with the OpenAI API part. The API has an endpoint that expects an audiofile and returns a transcription of the text spoken in the audiofile. Let’s use the ruby-openai gem again to talk with the transcription endpoint with a sample file. I’m going to assume you have already setup the ruby-openai gem. You can read my previous post about Server-Sent Events and the Chat API for a basic setup of the ruby-openai gem.

I created a simple class that expects a file, initializes an OpenAI::Client and #transcribe fires off the HTTP request to the API.

class AudioTranscriber
  def initialize(file)
    @file = file
    @client = OpenAI::Client.new
  end

  def transcribe(response_format: :verbose_json)
    @client.audio.transcribe(
      parameters: {
        model: "whisper-1",
        file: @file,
        response_format: response_format
      }
    )
  end
end

Notice how I’m setting the response_format to :verbose_json. This option let’s the API respond with a more detailed response than the default :json format, which would basically only return the transcribed text. Take a look at the :verbose_json response for my sample audiofile versus the :json format:

# test in the rails console
file = File.open('sample_audio.m4a')
AudioTranscriber.new(file).transcribe

# :verbose_json
{
  "task" => "transcribe",
  "language" => "german",
  "duration" => 3.82,
  "text" => "Das ist ein Test.",
  "segments"=> [
    {
      "id"=>0,
      "seek"=>0,
      "start"=>0.0,
      "end"=>2.0,
      "text"=>" Das ist ein Test.",
      "tokens"=>[50364, 2846, 1418, 1343, 9279, 13, 50464],
      "temperature"=>0.0,
      "avg_logprob"=>-0.47552117705345154,
      "compression_ratio"=>0.68,
      "no_speech_prob"=>0.03590516746044159
      }
  ]
}

# :json
{
  "text" => "Das ist ein Test."
}

The API transcribed my audiofile, tells me the spoken language and splits the audiofile in different segments with markers where each of those segments start and end. This endpoint is super easy to work with and straightforward.

In the next section we’re going to integrate the AudioTranscriber with ActiveStorage.

ActiveStorage analyzers

Let’s quickly look at the basic direct file upload flow of ActiveStorage with a focus on ActiveStorage::Analyzers.

When uploading a file with ActiveStorage, it will create an ActiveStorage::Attachment join model, with an id pointing to your original model and one pointing to an ActiveStorage::Blob record. If you are using the direct upload functionality, the blob record has already been created before starting to upload the file remotely.

After submitting a form with an ActiveStorage connected file upload in the browser, the Attachment record is created and connected with the original model and the previously created blob record. A callback on Attachment enqueues an ActiveStorage::AnalyzeJob, that calls #analyze on the blob record.

The #analyze method will make use of the different Analyzer classes. To analyze the blob, ActiveStorage will look through the ActiveStorage.analyzers array and find an analyzer class that accepts the current blob. This is the “Chain of Responsibility” pattern, where each of the analyzers decide if they are going to analyze the given blob. The code for finding the relevant analyzer is straightforward: ActiveStorage.analyzers.detect { |klass| klass.accept?(self) }.

After the fitting analyzer class is found, its only job is to return a hash of metadata that will be added to the blob’s metadata column.

Adding a custom analyzer step by step

Let’s make this more concrete and see what happens step by step.

Let’s create a simple Podcast model, that has one attached audiofile with a generator:

$rails generate scaffold podcast title audio_file:attachment

# migrate
$rails db:migrate

We now have a basic scaffold and a Podcast model that has Podcast.has_one_attached :audio_file.

Let’s update our _form.html.erb partial to use direct uploads.

<%= form.file_field :audio, direct_upload: true %>

Make sure you actually included the ActiveStorage JS, otherwise your direct uploads will fail. In my sample app I’m using the importmap-rails gem, so I added

# Add this to importmap.rb
pin "@rails/activestorage", to: "activestorage.esm.js"

and

import { start } from "@rails/activestorage"

start()

for the ActiveStorage JavaScript to work.

Next go to localhost:3000/podcasts/new and add an audiofile to the file input and submit the form by clicking the submit button. If we watch the logs carefully, we see the following steps happening in this order:

A POST request is sent to /rails/active_storage/direct_uploads which creates a blob record with metadata: nil
A POST request is sent to /podcast, which creates a Podcast record, and also an ActiveStorage::Attachment record that references our newly created podcast record and our previously created ActiveStorage::Blob record
An ActiveJob::AnalyzeJob is enqueued, which downloads the audiofile from the storage service and writes to the blob’s metadata column after analyzing the downloaded file.

Now that we understand the basic flow, let’s modify step 3 to work with a custom analyzer.

Writing a custom analyzer

Instead of relying on the audio analyzer that ships with ActiveStorage as a default, I’m going to create an analyzer that talks with the OpenAI API and adds the language and spoken text to the blob’s metadata.

Add the following code to an initializer:

Rails.application.config.active_storage.analyzers.delete ActiveStorage::Analyzer::AudioAnalyzer

Rails.application.config.active_storage.analyzers.push(AudioWhisperAnalyzer)

We’re modifying the analyzers configuration and delete the default ActiveStorage AudioAnalyzer and we’re pushing our own AudioWhisperAnalyzer.

To create our own initializer we have to implement the following two methods and inherit from ActiveStorage::Analyzer:

#accept?(blob) decides if our analyzer chooses to handle the blob (Chain of Responsibility). The first analyzer that returns something truthy here, will take over and run its analysis.
#metadata returns a hash that will get merged in the blob’s metadata column.

class AudioWhisperAnalyzer < ActiveStorage::Analyzer
  def accept?(blob)
    blob.audio?
  end

  def metadata
    { some: 'metadata' }
  end
end

If you’re uploading an audiofile now and access it’s metadata with Podcast.first.audio_file.blob.metadata, you’ll see the some: 'metadata' value.

As you can probably guess by now, this is the perfect place for us to use our AudioTranscriber, and that’s exactly what I’m going to do. Because I still want to keep most of the original ActiveStorage::Analyzer::AudioAnalyzer’s functionality, I’m going to subclass from it instead of ActiveStorage::Analyzer.

class AudioWhisperAnalyzer < ActiveStorage::Analyzer::AudioAnalyzer
  def metadata
    super.merge(text: @text, language: @language).compact
  end

  private

  def probe_from(file)
    super.tap do
      response = AudioTranscriber.new(file).transcribe
      @text = response["text"]
      @language = response["language"]
    end
  end
end

I recommend reading the complete implementation of the ActiveStorage::Analyzer::AudioAnalyzer. In a nutshell it does the following:

It downloads the uploaded file from the service with #download_blob_to_tempfile
It runs ffprobe against the Tempfile and parses its output
The output is returned as a hash in #metadata

My code is basically hooking into this analysis with #tap and transcribing the file with my AudioTranscriber class. After receiving the audiofile’s transcription and language I add them both to the metadata as well with #merge.

Uploaded audiofiles will now be analyzed with our AudioWhisperAnalyzer and we can access the metadata with @podcast.audio_file.blob.metadata.slice("language", "text") in our Rails app.

Roadblocks

I bumped into some roadblocks when implementing this feature:

The first one was that only the first ActiveStorage::Analyzer analyzes a blob even if more analyzers would have matched. I think one improvement for the Analyzers API could be to allow multiple analyzers to analyze a file. This would have made the AudioWhisperAnalyzers implementation more straightforward than subclassing and extending an existing analyzer.

The second roadblock was that the ruby-openai gem did not work with Tempfiles, but I created a PR for ruby-openai and @alexrudall helped me out in record time and released a fixed version. If you want to learn something new about Tempfiles you can see read my comment here.

Recap

The OpenAI transcription endpoint is a super simple way to transcribe text from audiofiles. ActiveStorage allows us to implement custom file analyzers. They run asynchronous in the background and analyze a file, adding values to the the metadata column on ActiveStorage::Blob.

In the end we did not write a lot of code, partly because I skipped error handling. As promised, I’m going to release another post where I’ll talk about ways to handle errors for the OpenAI API.

How do you like the idea and post? Please reach out to me via email and follow me on X/Twitter for updates.

In my next post I’m going to show you how to use the image-edit OpenAI API to edit images, so stay tuned 🤗

The transcription endpoint#

ActiveStorage analyzers#

Adding a custom analyzer step by step#

Writing a custom analyzer#

Roadblocks#

Recap#