YouTube 자막:
Byte Pair Encoding Tokenization

동영상을 끝까지 볼 필요 없이 전체 자막을 가져오고, 키워드를 검색하고, 한 번에 복사하세요.

AutoDub

YouTube 외국어 영상 이해하기

몰입형 YouTube 한국어 더빙

언어 장벽을 넘어 전 세계 양질의 콘텐츠를 즐기세요

무료로 사용

동영상 자막

동영상 요약

Summary

Core Theme

The Byte Pair Encoding (BPE) algorithm is a subword tokenization method that starts with character-level units and iteratively merges the most frequent adjacent pairs to build a vocabulary, enabling efficient representation of text for language models.

Mind Map

클릭해서 펼치기

클릭해서 인터랙티브 마인드맵 전체 보기

You are at the right place if you want to

understand what the Byte pair Encoding subword tokenization algorithm is, how to train it

and how the tokenization of a text is done with this algorithm.

The BPE algorithm was initially proposed as a text compression algorithm

but it is also very well suited as a tokenizer for your language models.

The idea of BPE is to divide words into a sequence of "subword units" which are units

that appear frequently in a reference corpus - that is, the corpus we used to train it.

How is a BPE tokenizer trained? First of all, we have to get a corpus of texts. We will not

train our tokenizer on this raw text but we will first normalize it then pre-tokenize it. As the

pre-tokenization divides the text into a list of words, we can represent our corpus in another

way by gathering together the same words and by maintaining a counter, here represented in blue.

To understand how the training works, we consider this toy corpus composed of the following words:

huggingface, hugging, hug, hugger, etc. BPE is an algorithm that starts with an initial vocabulary

and then increases it to the desired size.

To build the initial vocabulary, we start by separating each word of the corpus

into a list of elementary units that compose them -here the characters.

We could also have chosen bytes as elementary units but it would have been less visual. We list

in our vocabulary all the characters that appear and that will constitute our initial vocabulary!

Let's now see how to increase it. We return to our split corpus, we will go through the words

one by one and count all the occurrences of token pairs. The first pair is composed of the token "h"

and "u", the second 'u' and "g", and we continue like that until we have the complete list.

Once we know all the pairs and their frequency of appearance, we will choose the one that

appears the most frequently: here it is the pair composed of the letters 'l' and 'e'.

We note our first merging rule and we add the new token to our vocabulary.

We can then apply this merging rule to our splits:

you can see that we have merged all the pairs of tokens composed of the tokens "l" and "e".

And now we just have to reproduce the same steps with our new splits:

we calculate the frequency of occurrence of each pair of tokens,

we select the pair with the highest frequency, we note it in our merge rules,

we add the new one to the vocabulary

and then we merge all the pairs of tokens composed of the token "le" and "a" into our splits.

And we can repeat this operation until we reach the desired vocabulary size.

Here we stopped when our vocabulary reached 21 tokens. We can see now that the words of our

corpus are now divided into far fewer tokens than at the beginning of the training. We can see that

our algorithm has learned the radicals "hug" and "learn" and also the verbal ending "ing".

Now that we have learned our vocabulary and our merging rules, we can tokenize new texts.

For example, if we want to tokenize the word

hugs: first we'll divide it into elementary units so it became a sequence of characters.

Then we'll go through our merge rules until we have one that we can apply.

Here we can merge the letters h and u. And here we can merge 2 tokens to get the new token hug.

When we get to the end of our merge rule the tokenization is finished.

ßAnd that's it, I hope that now the BPE algorithm has no more secret for you!

텍스트나 타임스탬프를 클릭하면 동영상의 해당 장면으로 바로 이동합니다

대부분의 자막은 5초 이내에 준비됩니다

원클릭 복사125개 이상의 언어내용 검색타임스탬프로 이동

YouTube URL 붙여넣기

YouTube 동영상 링크를 입력하면 전체 자막을 가져옵니다

대부분의 자막은 5초 이내에 준비됩니다

Chrome 확장 프로그램 설치

YouTube를 떠나지 않고 자막을 즉시 가져오세요. Chrome 확장 프로그램을 설치하면 동영상 시청 페이지에서 바로 자막에 원클릭으로 접근할 수 있습니다.

Chrome에 추가 — 무료

YouTube, Coursera, Udemy 등 주요 교육 플랫폼 지원

자막을 바로 가져오려면: 주소창에서 도메인만 바꾸면 됩니다!

YouTube

←

→

↻

https://www.youtube.com/watch?v=UF8uR6Z6KLc

YoutubeToText

←

→

↻

https://youtubetotext.net/watch?v=UF8uR6Z6KLc

YouTube 자막결과를 준비하고 있습니다…

YouTube 자막:Byte Pair Encoding Tokenization