Home » How To Tokenize Custom Text Data Using Bert Tokenizer

How To Tokenize Custom Text Data Using Bert Tokenizer

In this post, we’ll make a simple Python algorithm for tokenizing custom text data by using Bert tokenizer from Tensorflow. Furthermore, it is an indispensable tool for achiving best results possible with natural language processing tasks.

To explain, it’s main advantage is that it interpolates between word-based and character-based tokenization. In other words, it optimally cuts text apart so we end up with as few unique tokens as possible.

Setup

First of all, we’ll need a dataset, with which we can work with. Therefore, we’re going to download a Human Conversation training data from Kaggle. In addition, we’ll do it by calling Kaggle API and let our Python code do all the work.

So, before we get ahead of ourselves, we need to import all necessary Python packages into our script first.

In case you haven’t installed everything yet, here is a list of pip commands for all the packages we’ll need.

pip install pandas
pip install kaggle
pip install tensorflow
pip install tensorflow-text

import os
import pandas as pd
import tensorflow as tf
import tensorflow_text as text
from kaggle.api.kaggle_api_extended import KaggleApi
from tensorflow_text.tools.wordpiece_vocab import bert_vocab_from_dataset as bert_vocab

Next, we’re going to download the dataset and save it inside a data folder in the project directory.

ROOT = os.path.dirname(__file__)

api = KaggleApi()
api.authenticate()

api.dataset_download_files(
    'projjal1/human-conversation-training-data',
    path=os.path.join(ROOT, 'data'),
    unzip=True
)

Preparing the dataset

Before, we can start tokenizing our data, we need to wrangle it a little bit first. In our case here, we downloaded a .txt file, containing a conversation between 2 people. Furthermore, each line is annotated in the beginning with either Human 1 or Human 2.

We’ll need to separate these lines and it’s important that we don’t shuffle the data in the process.

with open(os.path.join(ROOT, 'data', 'human_chat.txt'), 'r') as data_file:
    data = data_file.readlines()
    human1, human2 = [], []

    for line in data:
        if line.startswith('Human 1'):
            human1.append(line.replace('Human 1:', '').replace('\n', ''))
        else:
            human2.append(line.replace('Human 2:', '').replace('\n', ''))

    total_lines = min(len(human1), len(human2))
    
train_examples = pd.DataFrame()
train_examples['human_1'] = human1[:total_lines]
train_examples['human_2'] = human2[:total_lines]

print(train_examples)

train_human1 = tf.data.Dataset.from_tensor_slices(train_examples['human_1'].values)
train_human2 = tf.data.Dataset.from_tensor_slices(train_examples['human_2'].values)

Creating Bert tokenizer instances

Now that we have our data ready, we need to create token vocabularies for the two speakers. Furthermore, we’ll save those vocabularies into .txt file and place it inside the data folder.

After, we have our vocabularies ready, we create instances of Bert tokenizers.

bert_tokenizer_params = dict(lower_case=True)
reserved_tokens = ['[PAD]', '[UNK]', '[START]', '[END]']

bert_vocab_args = dict(
    vocab_size=8000,
    reserved_tokens=reserved_tokens,
    bert_tokenizer_params=bert_tokenizer_params,
    learn_params={}
)

def write_vocab_file(filepath, vocab):
    with open(filepath, 'w') as f:
        for token in vocab:
            print(token, file=f)

human1_vocab = bert_vocab.bert_vocab_from_dataset(
    train_human1.batch(1000).prefetch(2),
    **bert_vocab_args
)

human1_vocab_path = os.path.join(ROOT, 'data', 'human1_vocab.txt')
write_vocab_file(human1_vocab_path, human1_vocab)

human2_vocab = bert_vocab.bert_vocab_from_dataset(
    train_human2.batch(1000).prefetch(2),
    **bert_vocab_args
)

human2_vocab_path = os.path.join(ROOT, 'data', 'human2_vocab.txt')
write_vocab_file(human2_vocab_path, human2_vocab)

human1_tokenizer = text.BertTokenizer(human1_vocab_path, **bert_tokenizer_params)
human2_tokenizer = text.BertTokenizer(human2_vocab_path, **bert_tokenizer_params)

Demonstrating Bert tokenizer

Alright, with our tokenizers ready, we can use them for training natural language processing models. In our case here, we’re just going to show how it processes information, which includes converting tokens into numerical data and back into words.

# demonstrate tokenizer
for p in train_human1.batch(3).take(1):
    for ex in p:
        print(ex.numpy())

    token_batch = human1_tokenizer.tokenize(p)
    token_batch = token_batch.merge_dims(-2, -1)

    for ex in token_batch.to_list():
        print(ex)

    txt_tokens = tf.gather(human1_vocab, token_batch)
    print(tf.strings.reduce_join(txt_tokens, separator=' ', axis=-1))
    words = human1_tokenizer.detokenize(token_batch)
    print(tf.strings.reduce_join(words, separator=' ', axis=-1))

Here’s the output:

b' Hi!'
b' one where I get to meet lots of different people.'
b' Hard to keep a count. Maybe 25.'
[79, 4]
[152, 148, 38, 158, 66, 120, 154, 145, 65, 72, 33, 175, 179, 179, 88, 150, 238, 12]
[317, 66, 40, 73, 73, 126, 30, 32, 110, 313, 77, 12, 42, 93, 300, 16, 348, 12]
tf.Tensor(
[b'hi !'
 b'one where i get to me ##et lot ##s of d ##i ##f ##f ##er ##ent people .'
 b'hard to k ##e ##e ##p a c ##o ##un ##t . m ##ay ##be 2 ##5 .'], shape=(3,), dtype=string)
tf.Tensor(
[b'hi !' b'one where i get to meet lots of different people .'
 b'hard to keep a count . maybe 25 .'], shape=(3,), dtype=string)

Here is also the entire code of this project:

import os
import pandas as pd
import tensorflow as tf
import tensorflow_text as text
from kaggle.api.kaggle_api_extended import KaggleApi
from tensorflow_text.tools.wordpiece_vocab import bert_vocab_from_dataset as bert_vocab

ROOT = os.path.dirname(__file__)

api = KaggleApi()
api.authenticate()

api.dataset_download_files(
    'projjal1/human-conversation-training-data',
    path=os.path.join(ROOT, 'data'),
    unzip=True
)

with open(os.path.join(ROOT, 'data', 'human_chat.txt'), 'r') as data_file:
    data = data_file.readlines()
    human1, human2 = [], []

    for line in data:
        if line.startswith('Human 1'):
            human1.append(line.replace('Human 1:', '').replace('\n', ''))
        else:
            human2.append(line.replace('Human 2:', '').replace('\n', ''))

    total_lines = min(len(human1), len(human2))
    
train_examples = pd.DataFrame()
train_examples['human_1'] = human1[:total_lines]
train_examples['human_2'] = human2[:total_lines]

print(train_examples)

train_human1 = tf.data.Dataset.from_tensor_slices(train_examples['human_1'].values)
train_human2 = tf.data.Dataset.from_tensor_slices(train_examples['human_2'].values)

bert_tokenizer_params = dict(lower_case=True)
reserved_tokens = ['[PAD]', '[UNK]', '[START]', '[END]']

bert_vocab_args = dict(
    vocab_size=8000,
    reserved_tokens=reserved_tokens,
    bert_tokenizer_params=bert_tokenizer_params,
    learn_params={}
)

def write_vocab_file(filepath, vocab):
    with open(filepath, 'w') as f:
        for token in vocab:
            print(token, file=f)

human1_vocab = bert_vocab.bert_vocab_from_dataset(
    train_human1.batch(1000).prefetch(2),
    **bert_vocab_args
)

human1_vocab_path = os.path.join(ROOT, 'data', 'human1_vocab.txt')
write_vocab_file(human1_vocab_path, human1_vocab)

human2_vocab = bert_vocab.bert_vocab_from_dataset(
    train_human2.batch(1000).prefetch(2),
    **bert_vocab_args
)

human2_vocab_path = os.path.join(ROOT, 'data', 'human2_vocab.txt')
write_vocab_file(human2_vocab_path, human2_vocab)

human1_tokenizer = text.BertTokenizer(human1_vocab_path, **bert_tokenizer_params)
human2_tokenizer = text.BertTokenizer(human2_vocab_path, **bert_tokenizer_params)

# demonstrate tokenizer
for p in train_human1.batch(3).take(1):
    for ex in p:
        print(ex.numpy())

    token_batch = human1_tokenizer.tokenize(p)
    token_batch = token_batch.merge_dims(-2, -1)

    for ex in token_batch.to_list():
        print(ex)

    txt_tokens = tf.gather(human1_vocab, token_batch)
    print(tf.strings.reduce_join(txt_tokens, separator=' ', axis=-1))
    words = human1_tokenizer.detokenize(token_batch)
    print(tf.strings.reduce_join(words, separator=' ', axis=-1))

Conclusion

To conclude, we made a simple Python algorithm, which preprocesses conversation dataset and uses Bert tokenizer to tokenize this data. I learned a lot while working on this project and I hope you find it helpful as well.