Notes on NLP – AV/IT Blog

Tokenization

Tokenization is the process of breaking down raw text data into smaller, meaningful units called tokens. These tokens are used as the basic building blocks for natural language processing (NLP) tasks such as language modeling, text classification, and machine translation. Tokenization can be performed using various techniques, such as whitespace tokenization, regular expression tokenization, and subword tokenization. The choice of tokenization technique depends on the specific use case and the language being processed. The resulting tokens are then used to create a vocabulary that maps each token to a unique integer index, which is used by machine learning algorithms to process text data.

Tokenization Methods

Word-based tokenization: This method splits text into words based on spaces and punctuations. It’s the most basic form of tokenization and is used in many NLP tasks. However, it can be problematic for languages where there are no clear word boundaries, such as Chinese or Japanese.
Subword-based tokenization: This method splits text into subwords based on their frequency of occurrence in the training data. Subwords are parts of words that frequently occur together, such as prefixes, suffixes, and common word fragments. This method can handle words that are not in the dictionary, and it’s commonly used in transformer-based models like BERT and GPT-2.
Character-based tokenization: This method splits text into individual characters. It’s useful for languages where there are no clear word boundaries, but it can produce longer sequences than word-based or subword-based tokenization.
Byte-pair encoding (BPE) tokenization: This method is similar to subword-based tokenization, but instead of using pre-defined subwords, it learns the subwords on the fly based on the input text. It starts by treating each character as a separate token and then iteratively merges the most frequent pairs of tokens until a maximum vocabulary size is reached.
Sentencepiece tokenization: This method is similar to BPE tokenization, but it uses a more sophisticated algorithm that takes into account the likelihood of a sequence of tokens occurring together. It can handle languages with complex writing systems like Chinese and Japanese.
Unigram tokenization: This method is similar to BPE tokenization, but it uses a different algorithm that learns the subwords based on the likelihood of a sequence of characters occurring together. It’s faster and produces smaller vocabularies than BPE, but it may not perform as well on languages with complex writing systems.

Building Dictionary Using Tokenizer Function of HuggingFace

from tokenizers import Tokenizer, trainers, pre_tokenizers, decoders, models

# Initialize a tokenizer with Byte-Pair Encoding
tokenizer = Tokenizer(models.BPE())

# Set special tokens
tokenizer.add_special_tokens(["<PAD>", "<UNK>", "<BOS>", "<EOS>"])

# Initialize a trainer with default settings and set max vocab size to 150
trainer = trainers.BpeTrainer(special_tokens=["<PAD>", "<UNK>", "<BOS>", "<EOS>"], vocab_size=150)

tokenizer.train(["sentences.txt"], trainer=trainer)

tokenizer.save("tokenizer.json")

# Encode a sentence
encoded = tokenizer.encode("This is an example sentence.")
print(encoded.tokens)
# return ['T', 'hi', 's ', 'is ', 'an', ' ', 'e', 'x', 'a', 'mp', 'l', 'e s', 'en', 't', 'enc', 'e', '.']

print(len(tokenizer.get_vocab()))
# return 150

from tokenizers import Tokenizer, trainers, pre_tokenizers, decoders, models

# Initialize a tokenizer with Byte-Pair Encoding

tokenizer = Tokenizer(models.BPE())

# Set special tokens

tokenizer.add_special_tokens(["<PAD>", "<UNK>", "<BOS>", "<EOS>"])

# Initialize a trainer with default settings and set max vocab size to 150

trainer = trainers.BpeTrainer(special_tokens=["<PAD>", "<UNK>", "<BOS>", "<EOS>"], vocab_size=150)

tokenizer.train(["sentences.txt"], trainer=trainer)

tokenizer.save("tokenizer.json")

# Encode a sentence

encoded = tokenizer.encode("This is an example sentence.")

print(encoded.tokens)

# return ['T', 'hi', 's ', 'is ', 'an', ' ', 'e', 'x', 'a', 'mp', 'l', 'e s', 'en', 't', 'enc', 'e', '.']

print(len(tokenizer.get_vocab()))

# return 150

Sentencepiece tokenization: This method is similar to BPE tokenization, but it uses a more sophisticated algorithm that takes into account the likelihood of a sequence of tokens occurring together. It can handle languages with complex writing systems like Chinese and Japanese.
Unigram tokenization: This method is similar to BPE tokenization, but it uses a different algorithm that learns the subwords based on the likelihood of a sequence of characters occurring together. It's faster and produces smaller vocabularies than BPE, but it may not perform as well on languages with complex writing systems.

Sentencepiece tokenization: This method is similar to BPE tokenization, but it uses a more sophisticated algorithm that takes into account the likelihood of a sequence of tokens occurring together. It can handle languages with complex writing systems like Chinese and Japanese.

Unigram tokenization: This method is similar to BPE tokenization, but it uses a different algorithm that learns the subwords based on the likelihood of a sequence of characters occurring together. It's faster and produces smaller vocabularies than BPE, but it may not perform as well on languages with complex writing systems.