Find Keyword Cannibalization Using OpenAI’s Text Embeddings With Examples

Jul 26, 2024 05:30 PM - 4 months ago 100006

This caller bid of articles focuses connected moving pinch LLMs to standard your SEO tasks. We dream to thief you merge AI into SEO truthful you tin level up your skills.

We dream you enjoyed nan previous article and understand what vectors, vector distance, and text embeddings are.

Following this, it’s clip to flex your “AI knowledge muscles” by learning really to usage matter embeddings to find keyword cannibalization.

We will commencement pinch OpenAI’s matter embeddings and comparison them.

Model Dimensionality Pricing Notes
text-embedding-ada-002 1536 $0.10 per 1M tokens Great for astir usage cases.
text-embedding-3-small 1536 $0.002 per 1M tokens Faster and cheaper but little accurate
text-embedding-3-large 3072 $0.13 per 1M tokens More meticulous for analyzable agelong text-related tasks, slower

(*tokens tin beryllium considered arsenic words words.)

But earlier we start, you request to instal Python and Jupyter connected your computer.

Jupyter is simply a web-based instrumentality for professionals and researchers. It allows you to execute analyzable information study and instrumentality learning exemplary improvement utilizing immoderate programming language.

Don’t interest – it’s really easy and takes small clip to decorativeness nan installations. And remember, ChatGPT is your friend erstwhile it comes to programming.

In a nutshell:

  • Download and install Python.
  • Open your Windows bid statement aliases terminal connected Mac.
  • Type this commands pip instal jupyterlab and pip instal notebook
  • Run Jupiter by this command: jupyter lab

We will usage Jupyter to research pinch matter embeddings; you’ll spot really nosy it is to activity with!

But earlier we start, you must motion up for OpenAI’s API and group up billing by filling your balance.

Open AI Api Billing settingsOpen AI Api Billing settings

Once you’ve done that, group up email notifications to pass you erstwhile your spending exceeds a definite magnitude nether Usage limits.

Then, get API keys nether Dashboard > API keys, which you should support backstage and ne'er stock publicly.

OpenAI API keysOpenAI API keys

Now, you person each nan basal devices to commencement playing pinch embeddings.

  • Open your machine bid terminal and type jupyter lab.
  • You should spot thing for illustration nan beneath image popular up successful your browser.
  • Click connected Python 3 nether Notebook.
jupyter labjupyter lab

In nan opened window, you will constitute your code.

As a mini task, let’s group akin URLs from a CSV. The sample CSV has 2 columns: URL and Title. Our script’s task will beryllium to group URLs pinch akin semantic meanings based connected nan title truthful we tin consolidate those pages into 1 and hole keyword cannibalization issues.

Here are nan steps you request to do:

Install required Python libraries pinch nan pursuing commands successful your PC’s terminal (or successful Jupyter notebook)

pip instal pandas openai scikit-learn numpy unidecode

The ‘openai’ room is required to interact pinch nan OpenAI API to get embeddings, and ‘pandas’ is utilized for information manipulation and handling CSV record operations.

The ‘scikit-learn’ room is basal for calculating cosine similarity, and ‘numpy’ is basal for numerical operations and handling arrays. Lastly, unidecode is utilized to cleanable text.

Then, download nan sample expanse arsenic a CSV, rename nan record to pages.csv, and upload it to your Jupyter files wherever your book is located.

https://www.searchenginejournal.com/wp-content/uploads/2024/06/group-urls-using-open-ai-text-embedding.mp4

Set your OpenAI API cardinal to nan cardinal you obtained successful nan measurement above, and copy-paste nan codification beneath into nan notebook.

Run nan codification by clicking nan play triangle icon astatine nan apical of nan notebook.

import pandas arsenic pd import openai from sklearn.metrics.pairwise import cosine_similarity import numpy arsenic np import csv from unidecode import unidecode # Function to cleanable text def clean_text(text: str) -> str: # First, switch known problematic characters pinch their correct equivalents replacements = { '–': '–', # en dash '’': '’', # correct azygous quotation mark '“': '“', # near double quotation mark '”': '”', # correct double quotation mark '‘': '‘', # near azygous quotation mark 'â€': '—' # em dash } for old, caller successful replacements.items(): matter = text.replace(old, new) # Then, usage unidecode to transliterate immoderate remaining problematic Unicode characters matter = unidecode(text) return text # Load nan CSV record pinch UTF-8 encoding from guidelines files of Jupiter task folder df = pd.read_csv('pages.csv', encoding='utf-8') # Clean nan 'Title' file to region unwanted symbols df['Title'] = df['Title'].apply(clean_text) # Set your OpenAI API key openai.api_key = 'your-api-key-goes-here' # Function to get embeddings def get_embedding(text): consequence = openai.Embedding.create(input=[text], engine="text-embedding-ada-002") return response['data'][0]['embedding'] # Generate embeddings for each titles df['embedding'] = df['Title'].apply(get_embedding) # Create a matrix of embeddings embedding_matrix = np.vstack(df['embedding'].values) # Compute cosine similarity matrix similarity_matrix = cosine_similarity(embedding_matrix) # Define similarity threshold similarity_threshold = 0.9 # since period is 0.1 for dissimilarity # Create a database to shop groups groups = [] # Keep way of visited indices visited = set() # Group akin titles based connected nan similarity matrix for one successful range(len(similarity_matrix)): if one not successful visited: # Find each akin titles similar_indices = np.where(similarity_matrix[i] >= similarity_threshold)[0] # Log comparisons print(f"\nChecking similarity for '{df.iloc[i]['Title']}' (Index {i}):") print("-" * 50) for j successful range(len(similarity_matrix)): if one != j: # Ensure that a title is not compared pinch itself similarity_value = similarity_matrix[i, j] comparison_result = 'greater' if similarity_value >= similarity_threshold other 'less' print(f"Compared pinch '{df.iloc[j]['Title']}' (Index {j}): similarity = {similarity_value:.4f} ({comparison_result} than threshold)") # Add these indices to visited visited.update(similar_indices) # Add nan group to nan list group = df.iloc[similar_indices][['URL', 'Title']].to_dict('records') groups.append(group) print(f"\nFormed Group {len(groups)}:") for point successful group: print(f" - URL: {item['URL']}, Title: {item['Title']}") # Check if groups were created if not groups: print("No groups were created.") # Define nan output CSV file output_file = 'grouped_pages.csv' # Write nan results to nan CSV record pinch UTF-8 encoding with open(output_file, 'w', newline='', encoding='utf-8') arsenic csvfile: fieldnames = ['Group', 'URL', 'Title'] writer = csv.DictWriter(csvfile, fieldnames=fieldnames) writer.writeheader() for group_index, group successful enumerate(groups, start=1): for page successful group: cleaned_title = clean_text(page['Title']) # Ensure nary unwanted symbols successful nan output writer.writerow({'Group': group_index, 'URL': page['URL'], 'Title': cleaned_title}) print(f"Writing Group {group_index}, URL: {page['URL']}, Title: {cleaned_title}") print(f"Output written to {output_file}")

This codification sounds a CSV file, ‘pages.csv,’ containing titles and URLs, which you tin easy export from your CMS aliases get by crawling a customer website utilizing Screaming Frog.

Then, it cleans nan titles from non-UTF characters, generates embedding vectors for each title utilizing OpenAI’s API, calculates nan similarity betwixt nan titles, groups akin titles together, and writes nan grouped results to a caller CSV file, ‘grouped_pages.csv.’

In nan keyword cannibalization task, we usage a similarity period of 0.9, which intends if cosine similarity is little than 0.9, we will see articles arsenic different. To visualize this successful a simplified two-dimensional space, it will look arsenic 2 vectors pinch an perspective of astir 25 degrees betwixt them.

<span class=

In your case, you whitethorn want to usage a different threshold, for illustration 0.85 (approximately 31 degrees betwixt them), and tally it connected a sample of your information to measure nan results and nan wide value of matches. If it is unsatisfactory, you tin summation nan period to make it much strict for amended precision.

You tin instal ‘matplotlib’ via terminal.

pip instal matplotlib

And usage nan Python codification beneath successful a abstracted Jupyter notebook to visualize cosine similarities successful two-dimensional abstraction connected your own. Try it; it’s fun!

import matplotlib.pyplot arsenic plt import numpy arsenic np # Define nan perspective for cosine similarity of 0.9. Change present to your desired value. theta = np.arccos(0.9) # Define nan vectors u = np.array([1, 0]) v = np.array([np.cos(theta), np.sin(theta)]) # Define nan 45 grade rotation matrix rotation_matrix = np.array([ [np.cos(np.pi/4), -np.sin(np.pi/4)], [np.sin(np.pi/4), np.cos(np.pi/4)] ]) # Apply nan rotation to some vectors u_rotated = np.dot(rotation_matrix, u) v_rotated = np.dot(rotation_matrix, v) # Plotting nan vectors plt.figure() plt.quiver(0, 0, u_rotated[0], u_rotated[1], angles='xy', scale_units='xy', scale=1, color='r') plt.quiver(0, 0, v_rotated[0], v_rotated[1], angles='xy', scale_units='xy', scale=1, color='b') # Setting nan crippled limits to only affirmative ranges plt.xlim(0, 1.5) plt.ylim(0, 1.5) # Adding labels and grid plt.xlabel('X-axis') plt.ylabel('Y-axis') plt.grid(True) plt.title('Visualization of Vectors pinch Cosine Similarity of 0.9') # Show nan plot plt.show()

I usually usage 0.9 and higher for identifying keyword cannibalization issues, but you whitethorn request to group it to 0.5 erstwhile dealing pinch aged article redirects, arsenic aged articles whitethorn not person astir identical articles that are fresher but partially close.

It whitethorn besides beryllium amended to person nan meta explanation concatenated pinch nan title successful lawsuit of redirects, successful summation to nan title.

So, it depends connected nan task you are performing. We will reappraisal really to instrumentality redirects successful a abstracted article later successful this series.

Now, let’s reappraisal nan results pinch nan 3 models mentioned supra and spot really they were capable to place adjacent articles from our information sample from MCP’s articles.

Data SampleData Sample

From nan list, we already spot that nan 2nd and 4th articles screen nan aforesaid taxable connected ‘meta tags.’ The articles successful nan 5th and 7th rows are beautiful overmuch nan aforesaid – discussing nan value of H1 tags successful SEO – and tin beryllium merged.

The article successful nan 3rd statement doesn’t person immoderate similarities pinch immoderate of nan articles successful nan database but has communal words for illustration “Tag” aliases “SEO.”

The article successful nan 6th statement is again astir H1, but not precisely nan aforesaid arsenic H1’s value to SEO. Instead, it represents Google’s sentiment connected whether they should match.

Articles connected nan 8th and 9th rows are rather adjacent but still different; they tin beryllium combined.

text-embedding-ada-002

By utilizing ‘text-embedding-ada-002,’ we precisely recovered nan 2nd and 4th articles pinch a cosine similarity of 0.92 and nan 5th and 7th articles pinch a similarity of 0.91.

Screenshot from Jupyter log showing cosine similaritiesScreenshot from Jupyter log showing cosine similarities

And it generated output pinch grouped URLs by utilizing nan aforesaid group number for akin articles. (colors are applied manually for visualization purposes).

Output expanse pinch grouped URLsOutput expanse pinch grouped URLs

For nan 2nd and 3rd articles, which person communal words “Tag” and “SEO” but are unrelated, nan cosine similarity was 0.86. This shows why a precocious similarity period of 0.9 aliases greater is necessary. If we group it to 0.85, it would beryllium afloat of mendacious positives and could propose merging unrelated articles.

text-embedding-3-small

By utilizing ‘text-embedding-3-small,’ rather surprisingly, it didn’t find immoderate matches per our similarity period of 0.9 aliases higher.

For nan 2nd and 4th articles, cosine similarity was 0.76, and for nan 5th and 7th articles, pinch similarity 0.77.

To amended understand this exemplary done experimentation, I’ve added a somewhat modified type of nan 1st statement pinch ’15’ vs. ’14’ to nan sample.

  1. “14 Most Important Meta And HTML Tags You Need To Know For SEO”
  2. “15 Most Important Meta And HTML Tags You Need To Know For SEO”
Example which shows text-embedding-3-small resultsAn illustration which shows text-embedding-3-small results

On nan contrary, ‘text-embedding-ada-002’ gave 0.98 cosine similarity betwixt those versions.

Title 1 Title 2 Cosine Similarity
14 Most Important Meta And HTML Tags You Need To Know For SEO 15 Most Important Meta And HTML Tags You Need To Know For SEO 0.92
14 Most Important Meta And HTML Tags You Need To Know For SEO Meta Tags: What You Need To Know For SEO 0.76

Here, we spot that this exemplary is not rather a bully fresh for comparing titles.

text-embedding-3-large

This model’s dimensionality is 3072, which is 2 times higher than that of ‘text-embedding-3-small’ and ‘text-embedding-ada-002′, pinch 1536 dimensionality.

As it has much dimensions than nan different models, we could expect it to seizure semantic meaning pinch higher precision.

However, it gave nan 2nd and 4th articles cosine similarity of 0.70 and nan 5th and 7th articles similarity of 0.75.

I’ve tested it again pinch somewhat modified versions of nan first article pinch ’15’ vs. ’14’ and without ‘Most Important’ successful nan title.

  1. “14 Most Important Meta And HTML Tags You Need To Know For SEO”
  2. “15 Most Important Meta And HTML Tags You Need To Know For SEO”
  3. “14 Meta And HTML Tags You Need To Know For SEO”
Title 1 Title 2 Cosine Similarity
14 Most Important Meta And HTML Tags You Need To Know For SEO 15 Most Important Meta And HTML Tags You Need To Know For SEO 0.95
14 Most Important Meta And HTML Tags You Need To Know For SEO 14 Most Important Meta And HTML Tags You Need To Know For SEO 0.93
14 Most Important Meta And HTML Tags You Need To Know For SEO Meta Tags: What You Need To Know For SEO 0.70
15 Most Important Meta And HTML Tags You Need To Know For SEO 14 Most Important  Meta And HTML Tags You Need To Know For SEO 0.86

So we tin spot that ‘text-embedding-3-large’ is underperforming compared to ‘text-embedding-ada-002’ erstwhile we cipher cosine similarities betwixt titles.

I want to statement that nan accuracy of ‘text-embedding-3-large’ increases pinch nan magnitude of nan text, but ‘text-embedding-ada-002’ still performs amended overall.

Another attack could beryllium to portion distant extremity words from nan text. Removing these tin sometimes thief attraction nan embeddings connected much meaningful words, perchance improving nan accuracy of tasks for illustration similarity calculations.

The champion measurement to find whether removing extremity words improves accuracy for your circumstantial task and dataset is to empirically trial some approaches and comparison nan results.

Conclusion

With these examples, you person learned really to activity pinch OpenAI’s embedding models and tin already execute a wide scope of tasks.

For similarity thresholds, you request to research pinch your ain datasets and spot which thresholds make consciousness for your circumstantial task by moving it connected smaller samples of information and performing a quality reappraisal of nan output.

Please statement that nan codification we person successful this article is not optimal for ample datasets since you request to create matter embeddings of articles each clip location is simply a alteration successful your dataset to measure against different rows.

To make it efficient, we must usage vector databases and shop embedding accusation location erstwhile generated. We will screen really to usage vector databases very soon and alteration nan codification sample present to usage a vector database.

More resources: 

  • Avoiding Keyword Cannibalization Between Your Paid and Organic Search Campaigns
  • How Do I Stop Keyword Cannibalization When My Products Are All Similar?
  • Leveraging Generative AI Tools For SEO

Featured Image: BestForBest/Shutterstock

More