Subclippy: Edit Videos with Python and AI

Auto-clip videos with Python and AI

Subclippy: Edit Videos with Python and AI
DallE3 - an image of a robot editing film for a movie in a vintage style room:

I've recently spun up a YouTube channel and – my oh my – video editing can be such a grind. And so, being the developer that I am, I thought, "instead of spending a two hours editing, why don't I spend two weeks writing a script to do it for me?"

The result is subclip.py – a Python script that uses AssemblyAI and GPT-4 to auto-edit long videos into sub-clips. At a high-level, here's how Subclippy works:

  1. Download video and audio from YouTube with pytube
  2. Transcribe the audio with AssemblyAI
  3. Select words/phrases for extraction (manually, or with GPT-4)
  4. Turn phrases into timestamps
  5. Clip videos with FFmpeg

For example, here's a 70 second recap of Jason Liu's talk from AI Engineer Summit, Pydantic is All You Need (17 minutes full-length). The highlights were picked by GPT-4.

0:00
/1:11

In the rest of this post I'll explain how you can use Python, PyTube, AssemblyAI, and GPT-4 to auto-clip videos. I'll explain the concepts in this post, but for the full integrated code, check out this github repo.

Mise en place

Before you use subclip.py, you'll want to create some directories for all the files that get created during this process. Those are:

  • The source video (containing audio and video)
  • An audio file
  • A video_only file in the case of 1080p streams
  • A JSON file to store transcript data
  • A video for every clip generated
  • A final video that stitches all the clips together

To prepare your directory you can:

mkdir source_video audio data clips rendered

You'll need to install FFmpeg. On OSX, that's easiest done with homebrew:

brew install ffmpeg

Also install three Python packages:

pip install pytube assemblyai openai

Download video (and audio) from YouTube with pytube

To start our auto-edit process, we need two files saved locally:

a source video (with audio)an audio-only file extracted from the video that we will send to AssemblyAI for transcription

If you're working with your own videos, you can skip this step. If you want to edit YouTube videos, pytube makes it easy to download audio and video from YouTube with Python.

Let's start with the easiest case: downloading a 720p video stream:

def video_filename():
    return f"source_videos/{BASE_FILENAME}.mp4"
    

def download_720p(url=YOUTUBE_URL):
    yt = YouTube(url)
    video = yt.streams.filter(file_extension="mp4", res="720p").first()
    video.download(filename=video_filename())

Once you have the video, you can extract the audio with FFmpeg:

def audio_filename():
    return f"audio/{BASE_FILENAME}.mp3"

def extract_audio(infile=video_filename(), outfile=audio_filename()):
    command = f"ffmpeg -i {infile} -vn -acodec libmp3lame {outfile}"
    subprocess.run(command, shell=True)

Things get more complicated if you want 1080p...

YouTube's 1080p stream doesn't include audio

YouTube's 1080p stream is video only. From the pytube docs:

You may notice that some streams listed have both a video codec and audio codec, while others have just video or just audio, this is a result of YouTube supporting a streaming technique called Dynamic Adaptive Streaming over HTTP (DASH).
In the context of pytube, the implications are for the highest quality streams; you now need to download both the audio and video tracks and then post-process them with software like FFmpeg to merge them.
The legacy streams that contain the audio and video in a single file (referred to as “progressive download”) are still available, but only for resolutions 720p and below.

There's two reasons we need the audio stream:

  1. To send extracted audio to AssemblyAI
  2. For the created clips and final video

That means that if you want to work with 1080p, you need to acquire a separate audio stream and merge the video and audio together to create our source video.

One method is to download an audio stream on its own:

def download_audio(url=YOUTUBE_URL):
    audio_filename = f"audio/{ROOT_FILENAME}.mp3"
    yt = YouTube(url)
    audio = yt.streams.filter(type="audio")[27]
    audio.download(filename=audio_filename)

But there's a problem! For example, Mr. Beast has a robust international dubbing operation and may include up to thirty (!) audio streams for a single video and, unfortunately, there's no way to filter by language, so you have to download every audio stream and listen to them* to figure out which one to use. In the case of his grocery store video, the English stream is twenty-eighth! (*I suppose you could send all the transcripts to AssemblyAI and use language detection, but I digress...)

Alternatively, since you know that the 720p stream includes audio, you can just download the 720p version, extract the audio, and then combine the audio with the video only 1080p stream.

With that in mind, here's a script that will result in a source video (with audio) in either 720p or 1080p, along with along with extracted audio:

import pytube

def video_filename():
    return f"source_videos/{BASE_FILENAME}.mp4"


def audio_filename():
    return f"audio/{BASE_FILENAME}.mp3"


def video_only_filename():
    return f"source_videos/{BASE_FILENAME}_video_only.mp4"
    

def download_1080p(url=YOUTUBE_URL):
    yt = YouTube(url)
    video = yt.streams.filter(file_extension="mp4", res="1080p").first()
    video.download(filename=video_only_filename())
    merge_audio_and_video()


def download_720p(url=YOUTUBE_URL):
    yt = YouTube(url)
    video = yt.streams.filter(file_extension="mp4", res="720p").first()
    video.download(filename=video_filename())


def download_video(res="720p"):
    download_720p()
    extract_audio()
    if res == "1080p":
        download_1080p()

if __name__ == "__main__":
    YOUTUBE_URL = "https://www.youtube.com/watch?v=tnTPaLOaHz8"
    BASE_FILENAME = "mrbeast_grocery"
    
    if not os.path.exists(video_filename()):
        download_video(res="720p")

Notice that we avoid re-downloading the video. You'll probably be running this script over and over as you iterate on your editing settings. If you can avoid unnecessarily repeating work you'll save you a lot of time (when downloading and editing) and money (when hitting APIs). Also, working with 1080p is a lot slower than 720p, so you probably want to do your iterating on the lower-res version and switch to high-res once you have everything dialed in.

With video and audio in place, we can now get a transcription.

Transcribe audio with AssemblyAI

AssemblyAI provides state of the art transcriptions via a developer friendly API. In the spirit of transparency: AssemblyAI is also a client of mine, and this blog post came about while I was creating paid video content for them.

To get started, sign up for an AssemblyAI account, find your API key on the dashboard, and set it as an environment variable in your terminal

export AAI_API_KEY=xxxxxxxxxxxxxxx

This code turns an audio file into a transcript object:

import os
import assemblyai as aai

def transcribe():
    aai.settings.api_key = os.environ.get("AAI_API_KEY")
    config = aai.TranscriptionConfig(
        speaker_labels=True, auto_highlights=True
    )
    transcriber = aai.Transcriber(config=config)
    transcript = transcriber.transcribe(audio_filename())
    return transcript

A few notes:

transcript.text stores the raw text of the transcript. transcript.words stores a list word objects which contain text, start, end and confidence values. start and end are timestamps measured in milliseconds. We'll use those later. speaker_labels=True turns on speaker detection, which is helpful for podcasts and interviews. It creates a transcript.utterances list with speakers indicated as Speaker A, Speaker B, etc. This is how I retrieved all the questions Tim Ferriss asked, but not his guest. auto_highlights=True creates a transcript.highlights list which includes the most important phrases in the transcript. Each phrase has a frequency counter and start and end timestamps of each occurrence.

Another note about working with AssemblyAI: it gets expensive if you're not careful. To be clear – transcribing an audio file with AssemblyAI once is great value at ~$0.65 per hour. But if you unnecessarily re-transcribe every time you run your script, you're going to burn through credit fast.

To avoid repeatedly transcribing the same audio file, you'll want to store that transcript.id and use aai.get_transcript_by_id(transcript_id) to fetch data on subsequent runs.

"Does that mean I need to set up a datastore?" Well, no and yes.

You can find a log of all your transcripts on your AssemblyAI dashboard under Async Transcription. So, if you lose track of a transcript ID, you can always pop over there and copy-paste.

But yeah, at some point you'll want to store the transcript id and, while you're at it, you might as well save the transcript data you care about. For my purposes, I simply write a json file to the data/ directory.

def data_filename():
    return f"data/{ROOT_FILENAME}.json'


def write_data(data): 
    with open(data_filename(), 'w') as f:
        json.dump(data, f, indent=4)

    
def load_data(): 
    with open(data_filename(), 'r') as f:
        return json.load(f)

I extract the data I care about from the transcript object, with a few modifications:

def get_transcript_data(transcript): 
    data = {}
    data['youtube_url'] = YOUTUBE_URL
    data['transcript_id'] = transcript.id
    data['transcript'] = transcript.text
    data['duration'] = transcript.audio_duration
    
    data['utterances'] = []
    for utterance in transcript.utterances:
        data['utterances'].append({
            "start": utterance.start,
            "end": utterance.end,
            "speaker": utterance.speaker,
            "duration": int(utterance.end) - int(utterance.start),
            "text": utterance.text,
        })
    
    data['words'] = []
    for word in transcript.words:
        data['words'].append({
            "start": word.start,
            "end": word.end,
            "text": word.text.lower().replace(".", "").replace("!", "").replace("?", ""),
            "confidence": word.confidence,
        })
    
    data['highlights'] = []
    for result in transcript.auto_highlights.results:
        timestamps = []
        for t in result.timestamps:
            timestamps.append({"start": t.start, "end": t.end})
        
        data['highlights'].append({
            "text": result.text,
            "count": result.count,
            "rank": result.rank,
            "timestamps": timestamps,
        })

    return data

To put it all together:

YOUTUBE_URL = "https://www.youtube.com/watch?v=yj-wSRJwrrc"
BASE_FILENAME = "pydantic_is_all_you_need"

if __name__ == "__main__":    
    if not os.path.exists(video_filename()):
        download_video(res="720p")

    if not os.path.exists(data_filename()):
        transcript = transcribe()
        data = get_transcript_data(transcript)
        write_data(data)
    else:
        data = load_data()

Now we have a transcript and metadata stored in data which has all the text and timestamps we need to identify what parts of the video to clip.

Select words and phrases for extraction

I've experimented with a few strategies for to identify phrases for extraction, ranging from manual to GPT-4 decides what to cut.

Supercuts of specific words

Here's every mention of "money|cash|dollars" in Mr. Beast's grocery store video:

0:00
/0:35

This is the most straightforward approach. AssemblyAI's transcript object has a transcript.words – a list of word objects. Each word has a:

  • word.text - the text of the word (including any punctuation attached to it – important to know when matching)
  • word.start - when the word starts, measured in milliseconds
  • word.end - when the word ends, measured in milliseconds

Once converted my data dictionary a word looks like this:

[
...
{
    "start": 26764,
    "end": 26886,
    "text": "right",
    "confidence": 0.99968,
},
...
]

This function takes data and needles (either a string or a list of strings) and returns word dictionaries that match the needles.

def find_words(data, needles): 
    needles = [needles].flatten().join(" ")
    found = []
    for w in data['words']:
        if w['text'].lower() in needles:
            found.append(w)
    return found

For the first video example, where we extracted every mention of money from the Mr. Beast video, the function call looked like this:

data = load_data()
needles = ["money", "$10,000", "dollars", "$450,000", "cash"]
words = find_words(data, needles)

Rapper's Delight

A few years ago the Jimmy Fallon show produced this beautiful cut of Brian Williams (ft. Lester Holt) "performing" Rapper's Delight.

The idea is straightforward: for a list of words, extract a clip of that word, and string them together. Here's one from subclip.py of Bluey clips saying "I love you and can't wait to see you soon."

0:00
/0:07

As you can tell, editing a video with this technique give you a very good start, but to get anywhere near the level of polish demonstrated in the Fallon video you'll need to clean it up manually.

Since we're more or less looking for a single instance of specific words, we can create a Rapper's Delight video this a slight modification of our previous code:

def get_words_to_make_phrase(data, phrase):
    word_list = []
    phrase = phrase.lower()
    for w in phrase.split(" "):
        words = find_words(data, w)
        if not words:
            raise Exception("Could not find word: ", w)
        # iterate over words and add the one with highest confidence to the word_list
        max_confidence = 0
        for word in words:
            if word["confidence"] > max_confidence:
                max_duration = word["confidence"]
                best_word = word
        word_list.append(best_word)
    return word_list

Similar to our get_word_list() of before, this returns a list of "words," each word being a dictionary that contains values for text, end, start, confidence. A couple notes:

  • It's possible that a word you want doesn't show up in the transcript. We raise an error if so.
  • There may be multiple occurrences of a word you're looking for. We use the one with the highest confidence.

Auto-highlights

In the previous examples, I hardcoded words/phrases for extraction. You can alternatively use AssemblyAI's auto_highlights to identify key phrases from the video, which are stored in data['highlights']:

"highlights": [
        {
            "text": "language models",
            "count": 14,
            "rank": 0.07,
            "timestamps": [
                {
                    "start": 28892,
                    "end": 29922
                },
                {
                    "start": 45792,
                    "end": 46662
                },
                {
                    "start": 64132,
                    "end": 64778
                },

The tricky part here is that we're now looking for phrases as opposed to single words, so matching gets a little more complicated. Nice thing here is that AssemblyAI gives us the timestamps for the whole phrase, which makes clipping easier later on.

Select phrases with GPT-4

The extraction we've done so far has for long been possible with a transcript and timestamps using a tool like videogrep. More exciting to me, you can use an LLM to identify important/interesting/funny quotes from the transcript.

I'll use the GPT-4 API here. With its recently increased token limits, you can cram a surprisingly large transcript into the context. Crucially, you need the GPT-4 to return exact quotes so that you can hunt them down in the transcript. I've also turned on JSON mode so that I can easily add the responses back to my data.

def ask_gpt(transcript_text, prompt=""):
    MODEL = "gpt-4-1106-preview"
    client = OpenAI()
    
    sys_msg = f"""
{prompt}
Return results in JSON in this format: 
{"phrases": ["What is your name?"]}
"""

    messages=[
        {"role": "system", "content": sys_msg},
    ]
    
    messages.append({"role": "user", "content": transcript_text})
    print("Asking GPT...", messages)
    response = client.chat.completions.create(
        model=MODEL,
        response_format={ "type": "json_object" },
        messages = messages
    )
    str_response = response.choices[0].message.content
    data = json.loads(str_response)
    return(data)

A prompt might look something like this:

    prompt = """
            This is a transcript from a youtube video.
            Extract the most interesting quotes from this transcript.
            Each quote should be 100-300 words. 
            A sublip of the video will be created for each quote.
            that subclip will be posted to tiktok.
            please create subclips that are interesting and will go viral.
            The quote should have enough context to stand on its own. 
            extract only exact quotes. 
            do not modify the quote in any way. 
            """

To pull it all together:

def get_phrases(data, prompt=None):
    if not data.get("phrases"):
        data["phrases"] = []
    new_phrases = ask_gpt(data["transcript"], prompt=prompt)
    for p in new_phrases["phrases"]:
        data["phrases"].append({"text": p})

    write_data(data)
    return data

This saves the phrase to disk and makes sure I don't repeatedly spend OpenAI credits if I don't have to.

Also notice that for each phrase, I'm creating a dictionary with a key of text. This is so that I can add on keys of start and end in the next step...

Moving on

We just looked at four strategies to identify clips from a transcript.

Turn phrases into timestamps

In the first two examples above, creating timestamps is pretty easy – we have a list of "words", and each word has an associated start and end:

"words": [
        {
            "start": 16970,
            "end": 17086,
            "confidence": 0.7752,
            "text": "be"
        },
        {
            "start": 17108,
            "end": 17198,
            "confidence": 1.0,
            "text": "one"
        },
        {
            "start": 17204,
            "end": 17278,
            "confidence": 1.0,
            "text": "of"
        }
        ...
        

Similarly, it's easy to get timestamps for auto-highlights, because AssemblyAI returns them to you in the transcript data.

def get_timestamps_for_highlights(data):
    timestamps = []
    for h in data["highlights"]:
        for t in h["timestamps"]:
            timestamps.append(
                {
                    "start": t.get("start"),
                    "end": t.get("end"),
                }
            )
    return timestamps

It's trickier to find the start and end for an arbitrary multi-word phrase. Here you need to scan through the full words list and find the sequence of words that match your phrase.

This function searches data['words'] for a phrase and returns a tuple with the start, end (measured in milliseconds). It raises an exception if it can't find an exact match. Comparisons are performed on lower()-case words stripped of punctuation.

def find_exact_stamp(data, phrase):
    # Clean up the phrase text.
    phrase_text = phrase['text'].lower().translate(str.maketrans('', '', ',.!?'))
    phrase_words = phrase_text.split()

    # Iterate through words in data to find the matching phrase.
    for i in range(len(data['words']) - len(phrase_words) + 1):
        if all(data['words'][i + j]['text'] == phrase_words[j] for j in range(len(phrase_words))):
            phrase_start = int(data['words'][i]['start'])
            phrase_end = int(data['words'][i + len(phrase_words) - 1]['end'])

            if phrase_end < phrase_start:
                raise Exception(f"ERROR: End time {phrase_end} is less than start time {phrase_start} for phrase:\n{phrase_text}")

            return phrase_start, phrase_end

    # Phrase not found.
    raise Exception(f"ERROR: Could not find exact stamp for phrase:\n{phrase_text}")

We will often be searching for multiple phrases, and its helpful to save the start and end to data for debugging. You have a choice on what to do if you don't find a phrase – LLMs aren't always precise in returning exact quotes. Currently I'm opting to print it, and continue with the rest of the phrases until I figure out a better option.

def get_timestamps_for_phrases(data):
    for i, p in enumerate(data["phrases"]):
        start, end = find_exact_stamp(data, p)
        if start and end:
            p["start"] = int(start)
            p["end"] = int(end)
            data["phrases"][i] = p
        else:
            print("Could not find exact stamp for phrase: ", p["text"])
            del data["phrases"][i]
    return data

Now you have either:

  • a list of "words" (a dictionary, start and end keys)
  • a list of "phrases" (ditto)
  • a list of timestamps

Each word or phrase is dictionary with text, start, end fields. Let's use those timestamps to make some subclips.

Clip videos using FFmpeg

This function will slice a single video using FFmpeg.

import subprocess
def slice_video(source, start, end, buffer=50, filename=video_filename()):
    if not filename:
        raise Exception("Filename is required")    
    start = (start - buffer) / 1000
    end = (end + buffer) / 1000  
    if start < 0: 
        start = 0
    print("Slicing video from", start, " to ", end, "into", filename)
    command = ['ffmpeg', '-i', source, '-ss', str(start), '-to', str(end), '-reset_timestamps', '1', filename]
    subprocess.run(command, check=True)

Timestamps returned by AssemblyAI are quite precise, and it can sometimes be useful to add a buffer, either to grab some whitespace, or to add auditory context. Also, FFmpeg expects timestamps expressed in seconds, not milliseconds.

Once you can slice a single video, you can slice a list of timestamps:

def slice_by_timestamps(timestamps=[], buffer=50):    
    source_video = f"source_videos/{ROOT_FILENAME}.mp4"
    for counter, t in enumerate(timestamps):
        ipad = str(counter).zfill(3)
        outfile = f"clips/{ROOT_FILENAME}_{ipad}.mp4"
        slice_video(source_video, t['start'], t['end'], buffer=buffer, filename=outfile)

Or you can slice a list of word dictionaries:

def clip_filename(i):
    return f"clips/{BASE_FILENAME}_{str(i).zfill(3)}.mp4"

def slice_by_words(words, buffer=50):
    for i, w in enumerate(words):
        slice_video(
            video_filename(),
            w["start"],
            w["end"],
            buffer=buffer,
            filename=clip_filename(i),
        )

Once you have clips, you can stitch them all together in a final video:

def rendered_filename():
    return f"rendered/{BASE_FILENAME}.mp4"


def stitch_clips():
    import os
    import subprocess

    clips_dir = "clips/"
    clips = [
        clips_dir + clip
        for clip in os.listdir(clips_dir)
        if clip.endswith(".mp4") and clip.startswith(BASE_FILENAME)
    ]
    clips.sort()

    with open("file_list.txt", "w") as f:
        for clip in clips:
            f.write(f"file '{clip}'\n")

    subprocess.run(
        ["ffmpeg", "-f", "concat", "-i", "file_list.txt", "-c", "copy", rendered_filename()]
    )
    os.remove("file_list.txt")

To pull it all together for our simplest example, single words:

def clip_and_stitch_from_needles(data, needles=""):
    word_list = []
    for needle in needles.split(" "):
        words = find_words(data, needle)
        word_list.extend(words)

    # sort word_list by word['start']
    word_list.sort(key=lambda x: int(x["start"]))
    slice_by_words(word_list, buffer=100)
    stitch_clips()

For Rapper's Delight:

def clip_and_stitch_from_phrase(data, phrase):
    words = get_words_to_make_phrase(data, phrase)
    slice_by_words(words, buffer=50)

For auto-highlights:

def clip_and_stitch_from_highlights(data):
    timestamps = get_timestamps_for_highlights(data)
    slice_by_timestamps(timestamps, buffer=50)
    stitch_clips()

And to generate subclips with GPT-4:

def clip_and_stitch_from_prompt(data, prompt=None):
    if not data.get("phrases"):
        data["phrases"] = []
        data = get_phrases(data, prompt=prompt)
        write_data(data)

    data = get_timestamps_for_phrases(data)
    write_data(data)

    slice_by_phrases(data["phrases"], buffer=150)
    stitch_clips()

Next Steps

This feels like scratching the surface of what's possible here. There's a few applications I can think of here, but I'd love to hear the internet's take on others:

Organizing archives of video content: Podcasts. Lectures. Newscasts. Creating short vertical videos from longform content for social media Creating recaps of your favorite podcast

If you enjoyed this tutorial please subscribe to this newsletter or my YouTube channel. I'd also love it if you shared this on HackerNews, Reddit, or X.

Subscribe to HaiHai Labs

Sign up now to get access to the library of members-only issues.
Jamie Larson
Subscribe