Embeddings Use Cases

Introduction

When I first heard about text embeddings, I was perplexed. My tech lead handed me an 800-page Natural Language Processing textbook, but I did not finish reading it. Now, with the capabilities of OpenAI models, it’s time to learn and broadly leverage embeddings to accomplish many tasks that were hard be achieved by human labor or simple data analysis, thanks to easy access to these embeddings.

Let’s start exploring:

Clustering
Search
Recommendations
Classification
Anomaly detection

Goals

The ultimate goal is to become familiar with the framework of leveraging embeddings to accomplish applicable tasks. The best way to learn and improvise is first by getting your hands dirty with the data and tools. Optimization and deep-diving happen later naturally.

Summarize key terminology, concept, and usage of text embeddings
Build the basic framework and standardized iPython notebooks for each use case that could benefit from text embeddings
Brainstorm business use cases and ideas for improvements

What do you need to know about embeddings?

An embedding is a vector (list) of floating point numbers, representing a text string.
The distance between two embeddings (vectors) measures their relatedness. The smaller the distance, the higher they are related, vice versa.
Dimensionality reduction methods:
- t-SNE, a non-linear dimension reduction method, which stands for t-distributed Stochastic Neighbor Embedding. The ML algorithm calculates the similarity in both high dimensional sapce and low dimensional space, then the similarity difference in both spaces is minimized using an optimization method, for instance, gradient descend. It was developed by Laurens van der Maaten and Geoffrey Hinton in 2008. Here’s a brief overview of what t-SNE does and how it work.
- PCA, a linear dimension reduction method, where the data in high dimensional space is mapped linearly into low dimensional space while maximizing the variance of the data.

Use Cases

Clustering

Can we identify clusters among movie reviews and their themes? It’s going to be difficult to review through all reviews and identify clusters of movie reviews. With AI, we can achieve that and I’ll demo it here. Let’s use Rotten Tomatoes dataset. To obtain text embeddings, let’s use OpenAI’s embeddings API and the model text-embedding-ada-002 is recommended. Note: all the code and data are open to public.

# Ensure you have your API key set in your environment per the README: https://github.com/openai/openai-python#usage
import openai
openai.api_key = '[openai_key]'

import pandas as pd
import numpy as np
import tiktoken
import sys
from typing import List, Optional
from sklearn.manifold import TSNE
from ast import literal_eval

# Now try to import your module again
def get_embedding(text: str, engine="text-similarity-davinci-001", **kwargs) -> List[float]:

    # replace newlines, which can negatively affect performance.
    text = text.replace("\n", " ")

    return openai.Embedding.create(input=[text], engine=engine, **kwargs)["data"][0]["embedding"]

# Embedding model parameters
embedding_model = "text-embedding-ada-002"
embedding_encoding = "cl100k_base"  # this the encoding for text-embedding-ada-002
max_tokens = 8000  # the maximum for text-embedding-ada-002 is 8191

from datasets import load_dataset

dataset = load_dataset("rotten_tomatoes")
df = dataset['train'].to_pandas()
print("{x} rows in df.".format(x=len(df)))

This is the sample data. The label is either 1 for positive sentiment or 0 for negative sentiment for the movie reviews.

text	label
the rock is destined to be the 21st century’s …	1
the gorgeously elaborate continuation of " the…	1

Here, we get embeddings for the top 5000 move reviews.

top_n = 5000

encoding = tiktoken.get_encoding(embedding_encoding)

# omit reviews that are too long to embed
df["n_tokens"] = df['text'].apply(lambda x: len(encoding.encode(x)))
df = df[df.n_tokens <= max_tokens].tail(top_n)

# get embeddings and save them
df["embedding"] = df['text'].apply(lambda x: get_embedding(x, engine=embedding_model))
df.to_csv("./rotten_tomatoes_with_embeddings_{x}.csv".format(x=top_n))

Apply Kmeans algorithm to the embeddings to identify clusters.

import numpy as np
from sklearn.cluster import KMeans

# Convert to a list of lists of floats
matrix = np.vstack(df.embedding.apply(literal_eval).to_list())

n_clusters = 5

kmeans = KMeans(n_clusters=n_clusters, init='k-means++', random_state=42)
kmeans.fit(matrix)
df['cluster'] = kmeans.labels_

df.groupby("cluster").label.mean().sort_values()

We can see that the mean of label values varies by cluster, suggesting that the sentiments are separated by the clustering, espectially cluster (2), cluster (1, 0, 4), and cluster (3).

Cluster	Mean
2	0.010241
1	0.091311
0	0.099548
4	0.113420
3	0.480874

To visualize the clusters, transform the embeddings of high-dimension 1536 to 2D by t-SNE.

from sklearn.manifold import TSNE
import matplotlib
import matplotlib.pyplot as plt
import plotly.express as px

tsne = TSNE(n_components=2, perplexity=15, random_state=42, init="random", learning_rate=200)
vis_dims2 = tsne.fit_transform(matrix)

# Define the color map for the clusters
color_map = {
    0: 'firebrick',         # Cluster 0
    1: 'orangered',  # Cluster 1
    2: 'gold',      # Cluster 2
    3: 'limegreen',       # Cluster 3
    4: 'maroon'       # Cluster 4
}

# Map the cluster labels to colors
df['color'] = df['cluster'].map(color_map)

# Convert cluster to a categorical type with the specified order
df['cluster'] = pd.Categorical(df['cluster'], categories=[0, 1, 2, 3, 4], ordered=True)

# Create a DataFrame from the t-SNE results
df_tsne = pd.DataFrame({'Dimension 1': vis_dims2[:, 0], 
                        'Dimension 2': vis_dims2[:, 1], 
                        'Cluster': df['cluster'],
                        'Color': df['color']})

# Create the 2D scatter plot using Plotly Express
fig = px.scatter(
    df_tsne, x='Dimension 1', y='Dimension 2',
    color='Cluster', labels={'Cluster': 'Cluster'},
    title='Clusters identified visualized in 2D using t-SNE',
    color_discrete_map=color_map # Apply the color map
)

# Define the category orders for the legend to make it discrete
fig.update_traces(marker=dict(size=3), selector=dict(mode='markers'))  # Adjust marker size
fig.update_layout(
    legend=dict(
        traceorder='normal',
        title_text='Cluster',
        title_font=dict(size=14)
    )
)


# Show the plot
fig.show()

2D visualization from t-SNE

We can identify the clusters by the differently colored dense cores.
2D visualization of clusters

2D visualization from PCA

It’s exploratory, so it’s worthwhile to see PCA 2D results too. We label positive reviews green and negative red. 2D visualization of clusters

3D visualization from t-SNE

Sometimes, it’s helpful to identify clusters in 3D visualization as there might be more than two major forces critical in classifying the move reviews. You may wonder why are the movie reviews in each cluster grouped together? When we go on to the next section, we may see that the green cluster is for positive reviews and the other four are quite negative.

Cluster theme

Furthermore, we can leverage the model text-davinci-003 to summarize the theme for each cluster of movie reviews on Rotten Tomatoes.

Let’s summarize the clusters, the mean sentiment, and the theme. Cluster 2 is very negative, expressing disappointment with the movies. In contrast, cluster 3 is very positive, with praise. Finally, clusters 1, 0, and 4 are closer to negative reviews but are not as disappointed as cluster 2.

Cluster	Mean Sentiment	Theme
2	0.010241	Disappointed with the quality of the movie.
1	0.091311	The reviews are all negative and critical of the movie.
0	0.099548	All of the reviews are negative and express dissatisfaction with the product or experience.
4	0.113420	Disappointment with the quality of the product or experience.
3	0.480874	All of the reviews are positive and praise the movie for its unique qualities, such as its surreal sense of humor, technological finish, insightful writing, delicate performances, and character-driven storytelling.

Cluster 0 Theme:  All of the reviews are negative and express dissatisfaction with the product or experience.
1, now as a former gong show addict , i'll admit it , my only complaint i
0, there's just no currency in deriding james bond for being a clichéd , 
0, ecks this one off your must-see list .
0, skip this turd and pick your nose instead because you're sure to get m
0, this movie is about the worst thing chan has done in the united states
----------------------------------------------------------------------------------------------------
Cluster 1 Theme:  The reviews are all negative and critical of the movie.
0, the type of dumbed-down exercise in stereotypes that gives the [teen c
0, just not campy enough
0, not so much farcical as sour .
0, a one-trick pony whose few t&a bits still can't save itself from being
0, cuba gooding jr . valiantly mugs his way through snow dogs , but even 
----------------------------------------------------------------------------------------------------
Cluster 2 Theme:  Disappointed with the quality of the movie.
0, its generic villains lack any intrigue ( other than their funny accent
0, one of the most highly-praised disappointments i've had the misfortune
0, . . . with the candy-like taste of it fading faster than 25-cent bubbl
0, for all its impressive craftsmanship , and despite an overbearing seri
0, the script by vincent r . nebrida . . . tries to cram too many ingredi
----------------------------------------------------------------------------------------------------
Cluster 3 Theme:  All of the reviews are positive and praise the movie for its unique qualities, such as its surreal sense of humor, technological finish, insightful writing, delicate performances, and character-driven storytelling.
1, what elevates the movie above the run-of-the-mill singles blender is i
0, at least it's a fairly impressive debut from the director , charles st
1, insightfully written , delicately performed
1, one of those exceedingly rare films in which the talk alone is enough 
1, a stylish but steady , and ultimately very satisfying , piece of chara
----------------------------------------------------------------------------------------------------
Cluster 4 Theme:  Disappointment with the quality of the product or experience.
0, qualities that were once amusing are becoming irritating .
0, everything's serious , poetic , earnest and -- sadly -- dull .
1, what's infuriating about full frontal is that it's too close to real l
0, befuddled in its characterizations as it begins to seem as long as the
0, human nature talks the talk , but it fails to walk the silly walk that
----------------------------------------------------------------------------------------------------

# Reading a review which belong to each group.
rev_per_cluster = n_clusters

for i in range(n_clusters):
    print(f"Cluster {i} Theme:", end=" ")

    reviews = "\n".join(
        df[df['cluster'] == i]
        .text.str.replace("Title: ", "")
        .str.replace("\n\nContent: ", ":  ")
        .sample(rev_per_cluster, random_state=42)
        .values
    )
    response = openai.Completion.create(
        engine="text-davinci-003",
        prompt=f'What do the following customer reviews have in common?\n\nCustomer reviews:\n"""\n{reviews}\n"""\n\nTheme:',
        temperature=0,
        max_tokens=64,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0,
    )
    print(response["choices"][0]["text"].replace("\n", ""))

    sample_cluster_rows = df[df.cluster == i].sample(rev_per_cluster, random_state=42)
    for j in range(rev_per_cluster):
        print(sample_cluster_rows.label.values[j], end=", ")
        print(sample_cluster_rows.text.str[:70].values[j])

    print("-" * 100)

Insight

The Rotten Tomatoes movie reviews can be clearly classified into 2 categories: positive reviews (cluster 3) and negative reviews (cluster 2, 1, 0, & 4). With the 2d/3d visuals, we can zoom in and see that the orange postive reviews can be distinguished from the other three negative review clusters.

Further Exploration

Other use cases and their application

Learning & Thought

Clustering Business Use Cases

[Assistant] Categorization tasks
- Categorize documents (Doc2Vec)
- Identify collaborators
- Discover mood patterns from notes & diary
- Organize bookmarks
[Personalization]
- Understand user behavior - recommendations, search, mood
[Operations]
- Identify fraudulent users
- Triage and tag tickets or reports

Introduction#

Goals#

What do you need to know about embeddings?#

Use Cases#

Clustering#

2D visualization from t-SNE#

2D visualization from PCA#

3D visualization from t-SNE#

Cluster theme#

Insight#

Further Exploration#

Learning & Thought#

Clustering Business Use Cases#