How I Found Internal Linking Opportunities With Vector Embeddings

Oct 03, 2024 02:00 PM - 4 months ago 152498

What you’ll request to get started

To transportation retired this process, I utilized the following:

  • Screaming Frog
  • OpenAI API Key
  • Google Sheets aliases Excel

By the end, I had a broad spreadsheet that included:

  • Every important URL from my tract listed successful file A (target URL)
  • The URL of each page that links to the target URL (excluding navigation)
  • URLs to the apical 5 astir intimately related pages based connected cosine similarity
  • Opportunities wherever 1 aliases much of those 5 URLs are not linking to the target URL

Example Spreadsheet

This is the illustration I utilized successful the screenshots below, and it will look thing for illustration this:

Example spreadsheet showing soul nexus opportunities pinch URLs, related URLs, and missing links highlighted successful pinkish cells.

Pink cells bespeak wherever the related page doesn’t nexus to the target page.

Step 1: Get an OpenAI API key

I started by heading complete to OpenAI’s website, clicked the fastener to create a caller concealed cardinal and copied that API cardinal to usage successful Screaming Frog.

Step 2: Set up Screaming Frog

Next, I opened Screaming Frog and followed these steps:

  • Navigated to Configuration > Custom > Custom JavaScript.

This should show thing like:

Step 3: Export vector embeddings and each inlinks

Export “All Inlinks” from Screaming Frog

I started by exporting the “All Inlinks” information from Screaming Frog. This record contains each soul nexus connected the tract and tin beryllium rather large. For example, my file, all_inlinks.csv, was astir 52 MB and represented 1,428 URLs.

Step 4: Create spreadsheets

I utilized Google Sheets for this tutorial, but you tin travel the aforesaid process successful Excel. If needed, you tin set the formulas utilizing ChatGPT arsenic a guide.

Import the 2 files you exported from Screaming Frog

  • Import the all_inlinks.csv record into 1 expanse and file.csv into another.
  • You tin usage the aforesaid workbook, but retrieve that CSV files only prevention a azygous tab of information erstwhile exporting.

Clean the data

This portion is essential. I had to region errors from the vector embeddings, simplify the soul nexus information to the bare essentials, and rename a fewer columns.

Clean the Custom JS (i.e., vector embeddings) and prevention complete file.csv
 

  • Sort the “(ChatGPT) Extract embeddings from page content” file from Z to A
  • Delete immoderate statement wherever that file is not a drawstring of numbers (e.g., cells branded “timeout” aliases “error”)
  • Verify each URLs person a position codification 200, past delete the “Status Code” and “Status” columns. Remove immoderate rows that don’t meet this criterion.
  • Rename the remaining columns to “URL” and “Embeddings” (capitalization matters).
  • Export this tab and prevention it arsenic “file.csv.”

Clean up each inlinks

This measurement was a spot much progressive but good worthy the effort.

  • Sort file A (“Type”) and delete immoderate rows that aren’t “Hyperlink.” Once verified, delete this column. This should make “Source” the first column.
  • Sort file F (“Status Code”) and delete immoderate rows that don’t person a 200 status. Then, delete the “Status Code” and “Status” columns
  • Delete the pursuing columns:
    • Size (Bytes)
    • Follow
    • Target
    • Rel
    • Path Type
    • Link Path
    • Link Origin
  • Sort by Link Position
  • Delete immoderate rows wherever the nexus is from navigation, header, aliases footer. This should time off you pinch “Content” and perchance “Aside.”
  • Sort by the “Source” column. Delete rows containing:
    • Home page URLs
    • Blog scale page URLs
    • Category/tag scale pages
    • Paginated URLs
    • Sitemap URLs
    • Any different non-unique contented pages (e.g., soul hunt results, non-canonical URLs)
  • Sort by the “Destination” file and repetition the cleaning process you did for the “Source” column.
  • Sort by the “Alt Text” file (Z to A). Copy the alt matter to the adjacent “Anchor” file and past delete the “Alt Text” column. 

Remove self-linking URLs

  • Create a caller file called “links to self” betwixt “destination” and “anchor,” making it file C.  I’ve included a screenshot for reference.
  • Copy and paste this look into C2
    • =IF(A2=B2, "Match", "No Match")
  • Copy it down for each rows and benignant file C from A-Z to bring up rows marked “Match.”
  • Delete these rows arsenic they correspond root URLs linking to themselves.
  • Finally, delete the “Match” file altogether.

Screenshot demonstrating really to cleanable up information successful Google Sheets, showing the file sorting process

Column C helps you get free of Source URLs that nexus to themselves.

After this cleanup, my original all_inlinks.csv record went from complete 50 MB pinch 136,873 rows to a overmuch leaner 2 MB record pinch 11,338 rows and 4 columns.

Step 5: Turn the vector embeddings into adjuvant accusation (i.e., related URLs)

Access Google Colab

To process the vector embeddings, I utilized Google Colab. Here’s what I did: Visit the Google Colab notebook created by Gus Pelogia and click connected “File” > “Save a transcript successful Drive.” This notebook is fundamentally Python moving successful your browser, truthful you don’t request to instal anything.
Next, I sewage a transcript of Gus’s Python script, which uses Pandas, Numpy, and Scikit-learn to process the file.csv I generated pinch Screaming Frog and the OpenAI API.

Running the script

If you’ve cleaned your information decently and your CSV record is named and formatted correctly, you should beryllium capable to:

  • Press the Play fastener successful the Colab notebook.

Screenshot of the Google Colab notebook interface, wherever the Python book for processing vector embeddings is run

Upload your file.csv record (the 1 pinch the “URL” and “Embeddings” columns).

Wait for it to process without leaving the browser window.

Do 2 find-and-replace operations:
Remove each azygous quotes (')
Remove each correct brackets (])

Screenshot showing a Find and Replace cognition successful Google Sheets.

This image shows a near bracket, but you’ll beryllium looking for correct brackets ].

I deleted the original “Related URLs” file (column B), leaving maine pinch six columns: URL and Related URLs 1-5.
Here’s what it looks like:

Now we’re fresh to put this accusation to applicable use.

Step 6: Pull inlink information from the “all_inlinks” tab

Setting up columns and pulling inlinks data
Insert a caller file betwixt “URL” and “Related URL 1” and sanction it “Links to Target URL.” It should beryllium successful file B. Next, usage this look successful compartment B2 to propulsion inlink data:

=TEXTJOIN(", ", TRUE, FILTER(all_inlinks!A:A, all_inlinks!B:B = A2))

This look gathers each URLs from the “all_inlinks” tab that nexus to the target URL successful file A. Here’s what the consequence looks like:

Step 7: Find unlinked related pages

Identify missing links
It’s clip to cheque if the related pages are linking to my target page.
I utilized this look successful compartment D2 and copied it down:

=IF(ISNUMBER(SEARCH(C2, B2)), "Exists", "Not Found")

It should look for illustration this pinch either “Not Found” aliases “Exists” successful each compartment successful file D (URL 1 links to A?):
 

Do the aforesaid point for each consequent “URL # links to A?” rows.
The reference to file B “Links to Target URL” isn’t going to change, but the reference to the related URL file will. For example:
In F2 (“URL 2 links to A?”) you will beryllium looking for the E2 URL wrong the database of URLs successful B2:

=IF(ISNUMBER(SEARCH(E2, B2)), "Exists", "Not Found")

Copy this look down file F. In H2 you will beryllium looking for the G2 URL wrong the database of URLs successful B2:

=IF(ISNUMBER(SEARCH(G2, B2)), "Exists", "Not Found")

Copy this look down file H. Repeat this process for each of the “URL # links to A?” columns.

Highlight missing links for easy review

  • I selected columns D:L and went to Format -> Conditional Formatting successful Google Sheets (or Excel).
  • I group a norm to format cells containing “Not Found” successful pinkish for easy identification.

This made it easy to spot wherever the soul links were missing.

Validate the data

I double-checked a fewer entries manually to guarantee everything was accurate. Now, I person a complete database that shows each target URL successful file A, the apical 5 related URLs, and whether those URLs are linking backmost to the target URL.
My last spreadsheet looked for illustration this, pinch “Exists” aliases “Not Found” indicating whether each related URL was linking backmost to the target URL:

Step 8: Build soul links

Now comes the last and astir actionable portion — building those soul links.
Identify the opportunities: I utilized the pinkish cells arsenic indicators of wherever soul links were missing. Each pinkish compartment represented a related page that wasn’t linking to the target URL, moreover though it should.
Add the links: I went to each related page (from the pinkish cells) and edited the contented to see a applicable soul nexus to the target URL. I made judge to usage a descriptive anchor matter that aligns pinch the contented connected the target page.
Prioritize: I started pinch the highest-priority pages first, specified arsenic those pinch the astir traffic.

Concluding thoughts: Create a cohesive soul linking building pinch vector embeddings

Take the clip to build, analyze, and refine your soul nexus structure. This step-by-step guideline transformed my soul linking process into a data-driven strategy pinch the powerfulness of vector embeddings. The effort will salary disconnected successful improved rankings, amended personification experience, and ultimately, much integrated traffic. It besides improves SEO capacity by ensuring your astir valuable pages are connected successful a measurement that hunt engines and your users understand. 
After moving this process connected a client’s site, I was surprised. I thought we’d done a awesome occupation astatine soul linking, but location were hundreds of opportunities we’d missed. And I don’t conscionable mean that the keyword we want to nexus from appears connected the page. I mean opportunities to nexus pages that hunt engines would spot arsenic highly applicable to each other. In doing so, I was capable to disambiguate intimately related concepts and hole a fewer unnoticed keyword cannibalization issues arsenic well.

More