How To Store Embeddings Into Pinecone Vector Database
In this post, we’ll make a simple Python script for storing to and querying from a Pinecone vector database. Additionally, we’ll include web scraping to get the contents of Twitch terms of service. Therefore, we’ll be able to get quick and clear answers from this database, without scrolling through the entire TOS text.
In case you’re not familiar with Pinecone, it’s a service that allows us to store vector embeddings into vector database. In other words, we can store numeric values of tokens from sentences into a database.
This allows us to lookup and return concrete data, without the risk of GPT model making up false information. In addition to that, we only need to get the embeddings for a certain document once. This is significant for larger documents, where generating embeddings might cost you a lot of money.
Importing libraries and getting data
First thing we need to do, like for all other python projects, is to import all the necessary libraries. The following include Beautiful Soup for web scraping task, Pinecone for connecting to the actual service and OpenAI for generating embedding.
In case you haven’t installed these packages yet, you can do so with pip with following commands:
pip install beautifulsoup4
pip install pinecone-client
pip install openai
import os
import json
import httplib2
from bs4 import BeautifulSoup
from tqdm import tqdm
import pinecone
import openai
from openai.embeddings_utils import get_embedding
To use OpenAI API and Pinecone API, we need to use their access tokens, which we can get from our accounts on these services. Therefore, I made a simple function that will fetch these tokens from a local json file. Reason for this is to not expose them inside the script in case you’ll want to share it with other people.
ROOT = os.path.dirname(__file__)
def get_token(token_name):
with open(os.path.join(ROOT, 'auth.json'), 'r') as auth_file:
auth_data = json.load(auth_file)
token = auth_data[token_name]
return token
openai.api_key = get_token('openai-token')
Next, we’ll make a simple function, which will fetch the HTML markup of the webpage we want to scrape.
def get_soup(url):
http = httplib2.Http()
status, response = http.request(url)
soup = BeautifulSoup(response, 'html.parser')
return soup
Now we’re ready to scrape the Twitch terms of service webpage and get the text inside the paragraph tags.
def extract_twitch_tos():
url = 'https://www.twitch.tv/p/en/legal/terms-of-service/'
soup = get_soup(url)
text_chunks = []
try:
content = soup.find(class_='legal-content')
p_tags = content.find_all('p')
for p in p_tags:
for a in p.find_all('a'):
a.extract()
text = p.get_text(strip=True)
if len(text.split()) >= 10:
text_chunks.append(text)
except:
print('Scraping failed.')
return
return text_chunks
Generating embeddings and saving them in Pinecone vector database
In the following part of this tutorial, we’ll get the embedding vectors from the text we scraped above. For this task, we’ll use the text-embedding-ada-02 model, which is also one of the most popular ones.
def get_tos_embeddings():
text_chunks = extract_twitch_tos()
if not text_chunks:
print('Failed to retrieve text chunks...')
return
chunks_with_embeddings = []
for chunk in tqdm(text_chunks):
embedding = get_embedding(chunk, engine='text-embedding-ada-002')
chunks_with_embeddings.append({'text': chunk, 'embedding': embedding})
return chunks_with_embeddings
Now that we have the vector embeddings, we’ll need to store them with Pinecone. For this to work, you’ll need to get the environment and API key values from API Keys tab inside the Pinecone dashboard.
chunks_with_embeddings = get_tos_embeddings()
pinecone_tokens = get_token('pinecone-tokens')
pinecone.init(
api_key=pinecone_tokens['api_key'],
environment=pinecone_tokens['environment']
)
index_name = 'twitch-tos'
if index_name not in pinecone.list_indexes():
pinecone.create_index(index_name, dimension=1536)
index = pinecone.Index(index_name)
batch_size = 64
for i in tqdm(range(0, len(chunks_with_embeddings), batch_size)):
data_batch = chunks_with_embeddings[i: i + batch_size]
i_end = min(i + batch_size, len(chunks_with_embeddings))
text_batch = [item['text'] for item in data_batch]
ids_batch = [str(n) for n in range(i, i_end)]
embeds = [item['embedding'] for item in data_batch]
meta = [{'text': text_batch} for text_batch in zip(text_batch)]
to_upsert = zip(ids_batch, embeds, meta)
index.upsert(vectors=list(to_upsert))
Fetching data from Pinecone vector database
Lastly, we need to make one function that will search for data inside the vector database, one for creating a prompt and one that will use a GPT model to give us the answer.
def search_docs(query):
xq = openai.Embedding.create(input=query, model='text-embedding-ada-002')['data'][0]['embedding']
res = index.query([xq], top_k=5, include_metadata=True)
return res['matches']
def construct_prompt(query):
matches = search_docs(query)
chosen_text = []
for match in matches:
chosen_text.append(match['metadata']['text'][0])
prompt = """Answer the question as truthfully as possible using context below, and if the answer is not within context, say 'I don\'t know'"""
prompt += "\n\n"
prompt += "Context: " + "\n".join(chosen_text)
prompt += "\n\n"
prompt += "Question: " + query
prompt += "\n"
prompt += "Answer:"
return prompt
def answer_question(query):
prompt = construct_prompt(query)
res = openai.Completion.create(
prompt=prompt,
model='text-davinci-003',
max_tokens=500,
temperature=0.0
)
return res.choices[0].text
Finally, we’re ready to start retrieving data from Pinecone vector database with questions we might have. Furthermore, this will allow us to quickly get the data we’re looking for, without scrolling through the entire terms of service.
print(answer_question('What I can\'t do on Twitch?'))
For the question I asked above, GPT model will return the following answer.
You may not sell, rent, lease, transfer, share, or provide access to your account to anyone else, including without limitation, charging anyone for access to administrative rights on your account. You may not Simulcast on any other “Twitch-like Service” without advance written permission from Twitch. You agree that you will not violate Twitch’s Community Guidelines.
Here is also the code for the entire project.
import os
import json
import httplib2
from bs4 import BeautifulSoup
from tqdm import tqdm
import pinecone
import openai
from openai.embeddings_utils import get_embedding
ROOT = os.path.dirname(__file__)
def get_token(token_name):
with open(os.path.join(ROOT, 'auth.json'), 'r') as auth_file:
auth_data = json.load(auth_file)
token = auth_data[token_name]
return token
openai.api_key = get_token('openai-token')
def get_soup(url):
http = httplib2.Http()
status, response = http.request(url)
soup = BeautifulSoup(response, 'html.parser')
return soup
def extract_twitch_tos():
url = 'https://www.twitch.tv/p/en/legal/terms-of-service/'
soup = get_soup(url)
text_chunks = []
try:
content = soup.find(class_='legal-content')
p_tags = content.find_all('p')
for p in p_tags:
for a in p.find_all('a'):
a.extract()
text = p.get_text(strip=True)
if len(text.split()) >= 10:
text_chunks.append(text)
except:
print('Scraping failed.')
return
return text_chunks
def get_tos_embeddings():
text_chunks = extract_twitch_tos()
if not text_chunks:
print('Failed to retrieve text chunks...')
return
chunks_with_embeddings = []
for chunk in tqdm(text_chunks):
embedding = get_embedding(chunk, engine='text-embedding-ada-002')
chunks_with_embeddings.append({'text': chunk, 'embedding': embedding})
return chunks_with_embeddings
chunks_with_embeddings = get_tos_embeddings()
pinecone_tokens = get_token('pinecone-tokens')
pinecone.init(
api_key=pinecone_tokens['api_key'],
environment=pinecone_tokens['environment']
)
index_name = 'twitch-tos'
if index_name not in pinecone.list_indexes():
pinecone.create_index(index_name, dimension=1536)
index = pinecone.Index(index_name)
batch_size = 64
for i in tqdm(range(0, len(chunks_with_embeddings), batch_size)):
data_batch = chunks_with_embeddings[i: i + batch_size]
i_end = min(i + batch_size, len(chunks_with_embeddings))
text_batch = [item['text'] for item in data_batch]
ids_batch = [str(n) for n in range(i, i_end)]
embeds = [item['embedding'] for item in data_batch]
meta = [{'text': text_batch} for text_batch in zip(text_batch)]
to_upsert = zip(ids_batch, embeds, meta)
index.upsert(vectors=list(to_upsert))
def search_docs(query):
xq = openai.Embedding.create(input=query, model='text-embedding-ada-002')['data'][0]['embedding']
res = index.query([xq], top_k=5, include_metadata=True)
return res['matches']
def construct_prompt(query):
matches = search_docs(query)
chosen_text = []
for match in matches:
chosen_text.append(match['metadata']['text'][0])
prompt = """Answer the question as truthfully as possible using context below, and if the answer is not within context, say 'I don\'t know'"""
prompt += "\n\n"
prompt += "Context: " + "\n".join(chosen_text)
prompt += "\n\n"
prompt += "Question: " + query
prompt += "\n"
prompt += "Answer:"
return prompt
def answer_question(query):
prompt = construct_prompt(query)
res = openai.Completion.create(
prompt=prompt,
model='text-davinci-003',
max_tokens=500,
temperature=0.0
)
return res.choices[0].text
print(answer_question('What I can\'t do on Twitch?'))
Conclusion
To conclude, we made a simple algorithm that will scrape Twitch’s terms of service webpage and save the vector embeddings from the scrapped text into a Pinecone vector database. I learned a lot while working on this project and I hope it proves helpful to you as well.