This caller bid of articles focuses connected moving pinch LLMs to standard your SEO tasks. We dream to thief you merge AI into SEO truthful you tin level up your skills.
We dream you enjoyed nan previous article and understand what vectors, vector distance, and text embeddings are.
Following this, it’s clip to flex your “AI knowledge muscles” by learning really to usage matter embeddings to find keyword cannibalization.
We will commencement pinch OpenAI’s matter embeddings and comparison them.
text-embedding-ada-002 | 1536 | $0.10 per 1M tokens | Great for astir usage cases. |
text-embedding-3-small | 1536 | $0.002 per 1M tokens | Faster and cheaper but little accurate |
text-embedding-3-large | 3072 | $0.13 per 1M tokens | More meticulous for analyzable agelong text-related tasks, slower |
(*tokens tin beryllium considered arsenic words words.)
But earlier we start, you request to instal Python and Jupyter connected your computer.
Jupyter is simply a web-based instrumentality for professionals and researchers. It allows you to execute analyzable information study and instrumentality learning exemplary improvement utilizing immoderate programming language.
Don’t interest – it’s really easy and takes small clip to decorativeness nan installations. And remember, ChatGPT is your friend erstwhile it comes to programming.
In a nutshell:
- Download and install Python.
- Open your Windows bid statement aliases terminal connected Mac.
- Type this commands pip instal jupyterlab and pip instal notebook
- Run Jupiter by this command: jupyter lab
We will usage Jupyter to research pinch matter embeddings; you’ll spot really nosy it is to activity with!
But earlier we start, you must motion up for OpenAI’s API and group up billing by filling your balance.
Once you’ve done that, group up email notifications to pass you erstwhile your spending exceeds a definite magnitude nether Usage limits.
Then, get API keys nether Dashboard > API keys, which you should support backstage and ne'er stock publicly.
Now, you person each nan basal devices to commencement playing pinch embeddings.
- Open your machine bid terminal and type jupyter lab.
- You should spot thing for illustration nan beneath image popular up successful your browser.
- Click connected Python 3 nether Notebook.
In nan opened window, you will constitute your code.
As a mini task, let’s group akin URLs from a CSV. The sample CSV has 2 columns: URL and Title. Our script’s task will beryllium to group URLs pinch akin semantic meanings based connected nan title truthful we tin consolidate those pages into 1 and hole keyword cannibalization issues.
Here are nan steps you request to do:
Install required Python libraries pinch nan pursuing commands successful your PC’s terminal (or successful Jupyter notebook)
The ‘openai’ room is required to interact pinch nan OpenAI API to get embeddings, and ‘pandas’ is utilized for information manipulation and handling CSV record operations.
The ‘scikit-learn’ room is basal for calculating cosine similarity, and ‘numpy’ is basal for numerical operations and handling arrays. Lastly, unidecode is utilized to cleanable text.
Then, download nan sample expanse arsenic a CSV, rename nan record to pages.csv, and upload it to your Jupyter files wherever your book is located.
Set your OpenAI API cardinal to nan cardinal you obtained successful nan measurement above, and copy-paste nan codification beneath into nan notebook.
Run nan codification by clicking nan play triangle icon astatine nan apical of nan notebook.
This codification sounds a CSV file, ‘pages.csv,’ containing titles and URLs, which you tin easy export from your CMS aliases get by crawling a customer website utilizing Screaming Frog.
Then, it cleans nan titles from non-UTF characters, generates embedding vectors for each title utilizing OpenAI’s API, calculates nan similarity betwixt nan titles, groups akin titles together, and writes nan grouped results to a caller CSV file, ‘grouped_pages.csv.’
In nan keyword cannibalization task, we usage a similarity period of 0.9, which intends if cosine similarity is little than 0.9, we will see articles arsenic different. To visualize this successful a simplified two-dimensional space, it will look arsenic 2 vectors pinch an perspective of astir 25 degrees betwixt them.
In your case, you whitethorn want to usage a different threshold, for illustration 0.85 (approximately 31 degrees betwixt them), and tally it connected a sample of your information to measure nan results and nan wide value of matches. If it is unsatisfactory, you tin summation nan period to make it much strict for amended precision.
You tin instal ‘matplotlib’ via terminal.
And usage nan Python codification beneath successful a abstracted Jupyter notebook to visualize cosine similarities successful two-dimensional abstraction connected your own. Try it; it’s fun!
I usually usage 0.9 and higher for identifying keyword cannibalization issues, but you whitethorn request to group it to 0.5 erstwhile dealing pinch aged article redirects, arsenic aged articles whitethorn not person astir identical articles that are fresher but partially close.
It whitethorn besides beryllium amended to person nan meta explanation concatenated pinch nan title successful lawsuit of redirects, successful summation to nan title.
So, it depends connected nan task you are performing. We will reappraisal really to instrumentality redirects successful a abstracted article later successful this series.
Now, let’s reappraisal nan results pinch nan 3 models mentioned supra and spot really they were capable to place adjacent articles from our information sample from MCP’s articles.
From nan list, we already spot that nan 2nd and 4th articles screen nan aforesaid taxable connected ‘meta tags.’ The articles successful nan 5th and 7th rows are beautiful overmuch nan aforesaid – discussing nan value of H1 tags successful SEO – and tin beryllium merged.
The article successful nan 3rd statement doesn’t person immoderate similarities pinch immoderate of nan articles successful nan database but has communal words for illustration “Tag” aliases “SEO.”
The article successful nan 6th statement is again astir H1, but not precisely nan aforesaid arsenic H1’s value to SEO. Instead, it represents Google’s sentiment connected whether they should match.
Articles connected nan 8th and 9th rows are rather adjacent but still different; they tin beryllium combined.
text-embedding-ada-002
By utilizing ‘text-embedding-ada-002,’ we precisely recovered nan 2nd and 4th articles pinch a cosine similarity of 0.92 and nan 5th and 7th articles pinch a similarity of 0.91.
And it generated output pinch grouped URLs by utilizing nan aforesaid group number for akin articles. (colors are applied manually for visualization purposes).
For nan 2nd and 3rd articles, which person communal words “Tag” and “SEO” but are unrelated, nan cosine similarity was 0.86. This shows why a precocious similarity period of 0.9 aliases greater is necessary. If we group it to 0.85, it would beryllium afloat of mendacious positives and could propose merging unrelated articles.
text-embedding-3-small
By utilizing ‘text-embedding-3-small,’ rather surprisingly, it didn’t find immoderate matches per our similarity period of 0.9 aliases higher.
For nan 2nd and 4th articles, cosine similarity was 0.76, and for nan 5th and 7th articles, pinch similarity 0.77.
To amended understand this exemplary done experimentation, I’ve added a somewhat modified type of nan 1st statement pinch ’15’ vs. ’14’ to nan sample.
- “14 Most Important Meta And HTML Tags You Need To Know For SEO”
- “15 Most Important Meta And HTML Tags You Need To Know For SEO”
On nan contrary, ‘text-embedding-ada-002’ gave 0.98 cosine similarity betwixt those versions.
Title 1 | Title 2 | Cosine Similarity |
14 Most Important Meta And HTML Tags You Need To Know For SEO | 15 Most Important Meta And HTML Tags You Need To Know For SEO | 0.92 |
14 Most Important Meta And HTML Tags You Need To Know For SEO | Meta Tags: What You Need To Know For SEO | 0.76 |
Here, we spot that this exemplary is not rather a bully fresh for comparing titles.
text-embedding-3-large
This model’s dimensionality is 3072, which is 2 times higher than that of ‘text-embedding-3-small’ and ‘text-embedding-ada-002′, pinch 1536 dimensionality.
As it has much dimensions than nan different models, we could expect it to seizure semantic meaning pinch higher precision.
However, it gave nan 2nd and 4th articles cosine similarity of 0.70 and nan 5th and 7th articles similarity of 0.75.
I’ve tested it again pinch somewhat modified versions of nan first article pinch ’15’ vs. ’14’ and without ‘Most Important’ successful nan title.
- “14 Most Important Meta And HTML Tags You Need To Know For SEO”
- “15 Most Important Meta And HTML Tags You Need To Know For SEO”
- “14 Meta And HTML Tags You Need To Know For SEO”
Title 1 | Title 2 | Cosine Similarity |
14 Most Important Meta And HTML Tags You Need To Know For SEO | 15 Most Important Meta And HTML Tags You Need To Know For SEO | 0.95 |
14 Most Important Meta And HTML Tags You Need To Know For SEO | 14 Most Important Meta And HTML Tags You Need To Know For SEO | 0.93 |
14 Most Important Meta And HTML Tags You Need To Know For SEO | Meta Tags: What You Need To Know For SEO | 0.70 |
15 Most Important Meta And HTML Tags You Need To Know For SEO | 14 Most Important Meta And HTML Tags You Need To Know For SEO | 0.86 |
So we tin spot that ‘text-embedding-3-large’ is underperforming compared to ‘text-embedding-ada-002’ erstwhile we cipher cosine similarities betwixt titles.
I want to statement that nan accuracy of ‘text-embedding-3-large’ increases pinch nan magnitude of nan text, but ‘text-embedding-ada-002’ still performs amended overall.
Another attack could beryllium to portion distant extremity words from nan text. Removing these tin sometimes thief attraction nan embeddings connected much meaningful words, perchance improving nan accuracy of tasks for illustration similarity calculations.
The champion measurement to find whether removing extremity words improves accuracy for your circumstantial task and dataset is to empirically trial some approaches and comparison nan results.
Conclusion
With these examples, you person learned really to activity pinch OpenAI’s embedding models and tin already execute a wide scope of tasks.
For similarity thresholds, you request to research pinch your ain datasets and spot which thresholds make consciousness for your circumstantial task by moving it connected smaller samples of information and performing a quality reappraisal of nan output.
Please statement that nan codification we person successful this article is not optimal for ample datasets since you request to create matter embeddings of articles each clip location is simply a alteration successful your dataset to measure against different rows.
To make it efficient, we must usage vector databases and shop embedding accusation location erstwhile generated. We will screen really to usage vector databases very soon and alteration nan codification sample present to usage a vector database.
More resources:
- Avoiding Keyword Cannibalization Between Your Paid and Organic Search Campaigns
- How Do I Stop Keyword Cannibalization When My Products Are All Similar?
- Leveraging Generative AI Tools For SEO
Featured Image: BestForBest/Shutterstock