Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.
Share:
Video Transcript
Video Summary
Summary
Core Theme
The Byte Pair Encoding (BPE) algorithm is a subword tokenization method that starts with character-level units and iteratively merges the most frequent adjacent pairs to build a vocabulary, enabling efficient representation of text for language models.
Mind Map
Click to expand
Click to explore the full interactive mind map • Zoom, pan, and navigate
You are at the right place if you want to
understand what the Byte pair Encoding subword tokenization algorithm is, how to train it
and how the tokenization of a text is done with this algorithm.
The BPE algorithm was initially proposed as a text compression algorithm
but it is also very well suited as a tokenizer for your language models.
The idea of BPE is to divide words into a sequence of "subword units" which are units
that appear frequently in a reference corpus - that is, the corpus we used to train it.
How is a BPE tokenizer trained? First of all, we have to get a corpus of texts. We will not
train our tokenizer on this raw text but we will first normalize it then pre-tokenize it. As the
pre-tokenization divides the text into a list of words, we can represent our corpus in another
way by gathering together the same words and by maintaining a counter, here represented in blue.
To understand how the training works, we consider this toy corpus composed of the following words:
huggingface, hugging, hug, hugger, etc. BPE is an algorithm that starts with an initial vocabulary
and then increases it to the desired size.
To build the initial vocabulary, we start by separating each word of the corpus
into a list of elementary units that compose them -here the characters.
We could also have chosen bytes as elementary units but it would have been less visual. We list
in our vocabulary all the characters that appear and that will constitute our initial vocabulary!
Let's now see how to increase it. We return to our split corpus, we will go through the words
one by one and count all the occurrences of token pairs. The first pair is composed of the token "h"
and "u", the second 'u' and "g", and we continue like that until we have the complete list.
Once we know all the pairs and their frequency of appearance, we will choose the one that
appears the most frequently: here it is the pair composed of the letters 'l' and 'e'.
We note our first merging rule and we add the new token to our vocabulary.
We can then apply this merging rule to our splits:
you can see that we have merged all the pairs of tokens composed of the tokens "l" and "e".
And now we just have to reproduce the same steps with our new splits:
we calculate the frequency of occurrence of each pair of tokens,
we select the pair with the highest frequency, we note it in our merge rules,
we add the new one to the vocabulary
and then we merge all the pairs of tokens composed of the token "le" and "a" into our splits.
And we can repeat this operation until we reach the desired vocabulary size.
Here we stopped when our vocabulary reached 21 tokens. We can see now that the words of our
corpus are now divided into far fewer tokens than at the beginning of the training. We can see that
our algorithm has learned the radicals "hug" and "learn" and also the verbal ending "ing".
Now that we have learned our vocabulary and our merging rules, we can tokenize new texts.
For example, if we want to tokenize the word
hugs: first we'll divide it into elementary units so it became a sequence of characters.
Then we'll go through our merge rules until we have one that we can apply.
Here we can merge the letters h and u. And here we can merge 2 tokens to get the new token hug.
When we get to the end of our merge rule the tokenization is finished.
ßAnd that's it, I hope that now the BPE algorithm has no more secret for you!
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.