How To Use Extractive Question Answering Models Using Python
In this post, we’re going to make a Python script for demonstrating how extractive question answering models work. Furthermore, we’ll be using a sample dataset from Kaggle, containing various product descriptions.
We’re going to use transformers
module for getting the question answering model. However, you might come across some issues setting it up. Don’t worry tho, I’m going to do my best to guide you through the process.
In order to make the model run on your GPU, you’ll need to install CUDA drivers. Usually, this part was quite a pain to figure out. But these days you can simply install it along Tensorflow with a pip command.
pip3 install tensorflow[and-cuda]
In case you run into an exception while importing pipeline from transformers, you should install a specific version.
pip install transformers==4.11.3
I also use Python version 3.9.16, in case this factors in to make this thing work.
Download and import dataset
As we mentioned before, we’re going to download and use a dataset from Kaggle. Furthermore, we’ll handle this part within script. In case you’re not familiar how Kaggle API works, you should check out my guide on downloading Kaggle datasets.
But before we get to coding the method for fetching the data, we need to import all the necessary modules and methods for the whole project.
import os
import pandas as pd
from transformers import pipeline
from kaggle.api.kaggle_api_extended import KaggleApi
from tqdm import tqdm
import warnings
warnings.simplefilter('ignore')
Now that we handled this part, we’re ready to define a method that will download the product descriptions dataset and import it into a dataframe object using pandas.
def get_data():
api = KaggleApi()
api.authenticate()
api.dataset_download_file(
'cclark/product-item-data',
file_name='sample-data.csv',
path=ROOT
)
df = pd.read_csv(os.path.join(ROOT, 'sample-data.csv'))
df.set_index('id', inplace=True)
return df
Assessing text with extractive question answering
Alright, with our dataset ready, we can assess each product description with a question answering model. Furthermore, we’ll give it 4 questions for each description and save the output information into a pandas dataframe.
And at the end we’ll calculate average score values, so we get one confidence score for each product. To clarify, this confidence score indicates how well the answers match to the question that’s being asked.
def get_eqa_scores(data, questions):
nlp = pipeline("question-answering")
df_items = pd.DataFrame(columns=['product_id', 'question', 'answer', 'score'])
for index, row in tqdm(data.iterrows()):
text = row['description'][:512]
for question in questions:
result = nlp(question=question, context=text)
item = {
'product_id': index,
'question': question,
'answer': result['answer'],
'score': round(result['score'], 4)
}
df_items = pd.concat([df_items, pd.DataFrame([item])], ignore_index=True)
print(df_items.head(50))
df_results = df_items.groupby('product_id').agg(
avg_score=('score', 'mean')
)
return df_results
Putting it all together
And finally, we can use the methods above in the main thread along with the questions we prepared for each product.
if __name__ == '__main__':
questions = [
"What is this product for?",
"Why will it benefit me?",
"What is it made from?",
"What is special about this product?"
]
data = get_data()
results = get_eqa_scores(data, questions)
print(results.head(10))
Entire code for Extractive Question Answering Project
I’m also including a link to the GitHub repository, where you can check this project out. And here you can check out the entire code for this project as well.
import os
import pandas as pd
from transformers import pipeline
from kaggle.api.kaggle_api_extended import KaggleApi
from tqdm import tqdm
import warnings
warnings.simplefilter('ignore')
ROOT = os.path.dirname(__file__)
def get_data():
api = KaggleApi()
api.authenticate()
api.dataset_download_file(
'cclark/product-item-data',
file_name='sample-data.csv',
path=ROOT
)
df = pd.read_csv(os.path.join(ROOT, 'sample-data.csv'))
df.set_index('id', inplace=True)
return df
def get_eqa_scores(data, questions):
nlp = pipeline("question-answering")
df_items = pd.DataFrame(columns=['product_id', 'question', 'answer', 'score'])
for index, row in tqdm(data.iterrows()):
text = row['description'][:512]
for question in questions:
result = nlp(question=question, context=text)
item = {
'product_id': index,
'question': question,
'answer': result['answer'],
'score': round(result['score'], 4)
}
df_items = pd.concat([df_items, pd.DataFrame([item])], ignore_index=True)
print(df_items.head(50))
df_results = df_items.groupby('product_id').agg(
avg_score=('score', 'mean')
)
return df_results
if __name__ == '__main__':
questions = [
"What is this product for?",
"Why will it benefit me?",
"What is it made from?",
"What is special about this product?"
]
data = get_data()
results = get_eqa_scores(data, questions)
print(results.head(10))
Conclusion
To conclude, we made a simple Python project for gauging how relevant are the product descriptions for potential buyers by using extractive question answering model from Hugging Face. I learned a lot while working on this project and I hope you will find it helpful as well.