Home » Machine Learning » How to Load Text into Machine Learning Model

How to Load Text into Machine Learning Model

Today, we’ll learn how to load and preprocess text data for a machine learning model to accept it and learn from it efficiently.

Whenever we’re building such models, we need to prepare the data. In order for a model to accept it, we basically need to convert it into numbers. That is because machine learning models are in essence just complex formulas with adjustable variables.

In this tutorial, we’re going to work with Googles TensorFlow, which is an open-source machine learning framework. Furthermore, it’s widely popular for building machine learning models, including ones for processing text data.

Prepare text data for machine learning model

As we mentioned before, we need to convert text into numbers, which a model will be able to consume. Furthermore, we can go about this in a few different ways, like using one-hot encoding, bag of words and word embeddings.

First of all, we’ll need to import all the necessary python libraries.

import collections
import pathlib

import tensorflow as tf

from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow.keras import utils
from tensorflow.keras.layers import TextVectorization

import tensorflow_datasets as tfds
import tensorflow_text as tf_text

Step 1: Load the text data into TensorFlow machine learning library

We’re going to demonstrate this process with a multi-class classification model. Furthermore, we’ll be working with Stack Overflow example dataset, to tell which programming language is a question about.

First of all, we’re going to download the dataset file by using tf.keras.utils.get_file

data_url = 'https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz'

dataset_dir = utils.get_file(
    origin=data_url,
    untar=True,
    cache_dir='stack_overflow',
    cache_subdir='')

Next, we need to define and split this dataset into training, validation and testing sets.

Furthermore, we’re going to use text data in ‘train’ folder for training and validation by splitting it. Therefore, we’ll use 80% of the data for training and 20% for validation.

Because datasets tend to be large, we usually split them up into batches for better preformance. Or in some cases so that computer is even able to load and process them.

train_dir = dataset_dir/'train'

batch_size = 32
seed = 42

raw_train_ds = utils.text_dataset_from_directory(
    train_dir,
    batch_size=batch_size,
    validation_split=0.2,
    subset='training',
    seed=seed)

raw_val_ds = utils.text_dataset_from_directory(
    train_dir,
    batch_size=batch_size,
    validation_split=0.2,
    subset='validation',
    seed=seed)

For testing, we’re going to use text data that lies within ‘test’ folder.

test_dir = dataset_dir/'test'

# Create a test set.
raw_test_ds = utils.text_dataset_from_directory(
    test_dir,
    batch_size=batch_size)

Step 2: Preprocess text data

Second step in this process is to convert text data into numerical data in order to prepare it for a model. Therefore, we’re going to standardize, tokenize and vectorize the data to achieve that.

But what does it all mean anyway?

Firstly, standardization is a process which removes punctuations or HTML elements in order to simplify the data.

Secondly, tokenization takes care of splitting sentences into individual words or tokens.

And thirdly, vectorization is a process that converts these tokens into numbers, which we can use to train a neural network.

All of these tasks we mentioned above are included in a single layer tf.keras.layers.TextVectorization. There are also additional steps we need to take to adapt this layer to our dataset.

VOCAB_SIZE = 10000

MAX_SEQUENCE_LENGTH = 250

int_vectorize_layer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='int',
    output_sequence_length=MAX_SEQUENCE_LENGTH)

train_text = raw_train_ds.map(lambda text, labels: text)
int_vectorize_layer.adapt(train_text)

def int_vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return int_vectorize_layer(text), label

int_train_ds = raw_train_ds.map(int_vectorize_text)
int_val_ds = raw_val_ds.map(int_vectorize_text)
int_test_ds = raw_test_ds.map(int_vectorize_text)

And for the final step in preprocessing, we need to configure our dataset for optimal performance.

AUTOTUNE = tf.data.AUTOTUNE

def configure_dataset(dataset):
  return dataset.cache().prefetch(buffer_size=AUTOTUNE)

int_train_ds = configure_dataset(int_train_ds)
int_val_ds = configure_dataset(int_val_ds)
int_test_ds = configure_dataset(int_test_ds)

Conclusion

Overall, loading text data for machine learning models is quite a straightforward process with TensorFlow.

I hope this tutorial helped you gain a better understanding of how to import text data. And also maybe even inspire you to learn more about machine learning.