YouTube Transcript:
Byte Pair Encoding Tokenization

Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.

AutoDub

Understand YouTube Foreign Videos

Immersive YouTube Voice Translation

Break language barriers, embrace global quality content

Solve Foreign Video Barriers Instantly

Video Transcript

Video Summary

Summary

Core Theme

The Byte Pair Encoding (BPE) algorithm is a subword tokenization method that starts with character-level units and iteratively merges the most frequent adjacent pairs to build a vocabulary, enabling efficient representation of text for language models.

Mind Map

Click to expand

Click to explore the full interactive mind map • Zoom, pan, and navigate

You are at the right place if you want to

understand what the Byte pair Encoding subword tokenization algorithm is, how to train it

and how the tokenization of a text is done with this algorithm.

The BPE algorithm was initially proposed as a text compression algorithm

but it is also very well suited as a tokenizer for your language models.

The idea of BPE is to divide words into a sequence of "subword units" which are units

that appear frequently in a reference corpus - that is, the corpus we used to train it.

How is a BPE tokenizer trained? First of all, we have to get a corpus of texts. We will not

train our tokenizer on this raw text but we will first normalize it then pre-tokenize it. As the

pre-tokenization divides the text into a list of words, we can represent our corpus in another

way by gathering together the same words and by maintaining a counter, here represented in blue.

To understand how the training works, we consider this toy corpus composed of the following words:

huggingface, hugging, hug, hugger, etc. BPE is an algorithm that starts with an initial vocabulary

and then increases it to the desired size.

To build the initial vocabulary, we start by separating each word of the corpus

into a list of elementary units that compose them -here the characters.

We could also have chosen bytes as elementary units but it would have been less visual. We list

in our vocabulary all the characters that appear and that will constitute our initial vocabulary!

Let's now see how to increase it. We return to our split corpus, we will go through the words

one by one and count all the occurrences of token pairs. The first pair is composed of the token "h"

and "u", the second 'u' and "g", and we continue like that until we have the complete list.

Once we know all the pairs and their frequency of appearance, we will choose the one that

appears the most frequently: here it is the pair composed of the letters 'l' and 'e'.

We note our first merging rule and we add the new token to our vocabulary.

We can then apply this merging rule to our splits:

you can see that we have merged all the pairs of tokens composed of the tokens "l" and "e".

And now we just have to reproduce the same steps with our new splits:

we calculate the frequency of occurrence of each pair of tokens,

we select the pair with the highest frequency, we note it in our merge rules,

we add the new one to the vocabulary

and then we merge all the pairs of tokens composed of the token "le" and "a" into our splits.

And we can repeat this operation until we reach the desired vocabulary size.

Here we stopped when our vocabulary reached 21 tokens. We can see now that the words of our

corpus are now divided into far fewer tokens than at the beginning of the training. We can see that

our algorithm has learned the radicals "hug" and "learn" and also the verbal ending "ing".

Now that we have learned our vocabulary and our merging rules, we can tokenize new texts.

For example, if we want to tokenize the word

hugs: first we'll divide it into elementary units so it became a sequence of characters.

Then we'll go through our merge rules until we have one that we can apply.

Here we can merge the letters h and u. And here we can merge 2 tokens to get the new token hug.

When we get to the end of our merge rule the tokenization is finished.

ßAnd that's it, I hope that now the BPE algorithm has no more secret for you!

Click on any text or timestamp to jump to that moment in the video

Most transcripts ready in under 5 seconds

One-Click Copy125+ LanguagesSearch ContentJump to Timestamps

Paste YouTube URL

Enter any YouTube video link to get the full transcript

Most transcripts ready in under 5 seconds

Get Our Chrome Extension

Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.

Add to Chrome — Free

Works with YouTube, Coursera, Udemy and more educational platforms

Get Instant Transcripts: Just Edit the Domain in Your Address Bar!

YouTube

←

→

↻

https://www.youtube.com/watch?v=UF8uR6Z6KLc

YoutubeToText

←

→

↻

https://youtubetotext.net/watch?v=UF8uR6Z6KLc

YouTube TranscriptPreparing your results…

YouTube Transcript:Byte Pair Encoding Tokenization