The Byte Pair Encoding (BPE) algorithm is a subword tokenization method that starts with character-level units and iteratively merges the most frequent adjacent pairs to build a vocabulary, enabling efficient representation of text for language models.
Mind Map
클릭해서 펼치기
클릭해서 인터랙티브 마인드맵 전체 보기
You are at the right place if you want to
understand what the Byte pair Encoding subword tokenization algorithm is, how to train it
and how the tokenization of a text is done with this algorithm.
The BPE algorithm was initially proposed as a text compression algorithm
but it is also very well suited as a tokenizer for your language models.
The idea of BPE is to divide words into a sequence of "subword units" which are units
that appear frequently in a reference corpus - that is, the corpus we used to train it.
How is a BPE tokenizer trained? First of all, we have to get a corpus of texts. We will not
train our tokenizer on this raw text but we will first normalize it then pre-tokenize it. As the
pre-tokenization divides the text into a list of words, we can represent our corpus in another
way by gathering together the same words and by maintaining a counter, here represented in blue.
To understand how the training works, we consider this toy corpus composed of the following words:
huggingface, hugging, hug, hugger, etc. BPE is an algorithm that starts with an initial vocabulary
and then increases it to the desired size.
To build the initial vocabulary, we start by separating each word of the corpus
into a list of elementary units that compose them -here the characters.
We could also have chosen bytes as elementary units but it would have been less visual. We list
in our vocabulary all the characters that appear and that will constitute our initial vocabulary!
Let's now see how to increase it. We return to our split corpus, we will go through the words
one by one and count all the occurrences of token pairs. The first pair is composed of the token "h"
and "u", the second 'u' and "g", and we continue like that until we have the complete list.
Once we know all the pairs and their frequency of appearance, we will choose the one that
appears the most frequently: here it is the pair composed of the letters 'l' and 'e'.
We note our first merging rule and we add the new token to our vocabulary.
We can then apply this merging rule to our splits:
you can see that we have merged all the pairs of tokens composed of the tokens "l" and "e".
And now we just have to reproduce the same steps with our new splits:
we calculate the frequency of occurrence of each pair of tokens,
we select the pair with the highest frequency, we note it in our merge rules,
we add the new one to the vocabulary
and then we merge all the pairs of tokens composed of the token "le" and "a" into our splits.
And we can repeat this operation until we reach the desired vocabulary size.
Here we stopped when our vocabulary reached 21 tokens. We can see now that the words of our
corpus are now divided into far fewer tokens than at the beginning of the training. We can see that
our algorithm has learned the radicals "hug" and "learn" and also the verbal ending "ing".
Now that we have learned our vocabulary and our merging rules, we can tokenize new texts.
For example, if we want to tokenize the word
hugs: first we'll divide it into elementary units so it became a sequence of characters.
Then we'll go through our merge rules until we have one that we can apply.
Here we can merge the letters h and u. And here we can merge 2 tokens to get the new token hug.
When we get to the end of our merge rule the tokenization is finished.
ßAnd that's it, I hope that now the BPE algorithm has no more secret for you!
텍스트나 타임스탬프를 클릭하면 동영상의 해당 장면으로 바로 이동합니다
공유:
대부분의 자막은 5초 이내에 준비됩니다
원클릭 복사125개 이상의 언어내용 검색타임스탬프로 이동
YouTube URL 붙여넣기
YouTube 동영상 링크를 입력하면 전체 자막을 가져옵니다
자막 추출 양식
대부분의 자막은 5초 이내에 준비됩니다
Chrome 확장 프로그램 설치
YouTube를 떠나지 않고 자막을 즉시 가져오세요. Chrome 확장 프로그램을 설치하면 동영상 시청 페이지에서 바로 자막에 원클릭으로 접근할 수 있습니다.