For years, SEOs person relied connected Domain Authority (DA) arsenic a benchmark for assessing a website’s authority. While Moz has consistently stated that DA is not a Google ranking factor, the metric has remained a cardinal constituent of chat successful the industry.
New investigation from Ziff Davis sheds much ray connected really Domain Authority correlates pinch LLM contented preferences, suggesting that the early mightiness not beryllium truthful different from the present.
Why did Ziff Davis behaviour this study?
Ziff Davis, a awesome patient pinch brands for illustration PCMag, Mashable, IGN, and Moz, faces the aforesaid challenges arsenic different media companies. They fishy that Large Language Models (LLMs) are training connected their contented without licensing agreements. Hence, it’s difficult to find which contented is being favored.
The study group retired to reside this issue. Researchers analyzed datasets for illustration Common Crawl, C4, OpenWebText, and OpenWebText2 to understand really LLMs are trained, what types of contented they prefer, and really these choices power AI behaviour and output.
You can read the afloat study study here.
Key takeaways from the Ziff Davis LLM Study
If you want to skip the remainder of the article, I’ve summarized the cardinal findings below:
- LLMs spot a precocious weighting connected heavily-curated, high-quality datasets supra different earthy web data
- Authoritative publishers predominate these curated datasets
- OpenWebText and OpenWebText2 characteristic a overmuch higher proportionality of high-DA contented compared to uncurated datasets
- LLM developers prioritize commercialized patient content, reflecting a penchant for value and credibility
Which datasets were analyzed?
The Ziff Davis study examined 4 cardinal datasets that are important successful training ample connection models:
- Common Crawl: An uncurated repository of web matter scraped from the full net pinch minimal value control.
- C4: A cleaned type of Common Crawl that focuses connected English pages and excludes duplicates and low-quality text. It offers a much refined dataset without strict curation.
- OpenWebText: A proxy for OpenAI’s WebText, emphasizing high-quality contented linked from Reddit pinch a minimum upvote threshold.
- OpenWebText2: A follow-up to OpenWebText featuring an expanded and updated dataset while maintaining the aforesaid quality-focused approach.
It’s important to statement that these datasets aren’t created equal. More curated datasets, for illustration OpenWebText and OpenWebText2, incorporate a higher proportionality of authoritative content, while unfiltered sources for illustration Common Crawl propulsion from a overmuch wider but lower-quality excavation of web pages. The quality successful dataset impacts really LLMs study and make responses.
How were publishers chosen for the study?
The study utilized Comscore’s web postulation to find which publishers to analyze. Researchers focused connected the apical 15 portfolio publishers successful the Media class arsenic of August 2020, representing the astir wide visited news and media organizations.
The action process excluded single-property publishers, non-media tech firms, and user-generated contented platforms successful favour of much established commercialized publishers.
Which metric was used?
The study used Moz’s Domain Authority (DA) to measurement the power and value of web contented successful LLM training datasets. While DA is not a hunt ranking factor, it’s a recognized metric that predicts a website’s likelihood to rank successful SERPs based connected factors for illustration backlinks, domain history, and tract size.
To analyse LLM contented preferences, the study compiled Moz DA scores for each URLs recovered successful Common Crawl, OpenWebText, OpenWebText2, and C4. The findings revealed a beardown relationship betwixt dataset curation and DA distribution. Meanwhile, uncurated datasets contained mostly low-DA sites, while curated datasets were heavy weighted toward high-DA publishers.
Access the integer wellness of immoderate website
With Moz DA/PA metrics
What did we study from the Ziff Davis Study?
Most datasets are curated to amended the value of AI output
The Ziff Davis study makes it clear that while these models whitethorn scrape everything indiscriminately, they spot a higher weighting connected curated datasets to prioritize quality.
Curation shapes really LLMs process and make content. Raw datasets for illustration Common Crawl propulsion from the unfastened web pinch a operation of precocious and low-quality sources. In contrast, curated datasets for illustration OpenWebText and OpenWebText2 select retired low-quality contented to create a higher attraction of reliable information.
This intentional, selective process improves exemplary accuracy, consequence quality, and contented relevance. It besides explains why high-authority websites predominate AI outputs.
LLMs for illustration high-quality contented from commercialized publishers pinch precocious Domain Authority
LLMs don’t dainty each web contented equally. The Ziff Davis study confirms that high-DA commercialized publishers predominate curated datasets.
We utilized a operation of Moz API and Google Collab to tally a bulk DA study for each URLs featured successful the study.
You tin position the civilization book here.
84.2% of analyzed publishers had an mean DA of 60 aliases higher, showing a clear penchant toward established media brands. As datasets go much curated, the proportionality of high-DA contented increases, pinch publishers for illustration The New York Times and News Corp appearing much frequently.
Scale your investigation pinch bulk SEO information and metrics from Moz API
An emerging inclination of AI companies partnering pinch awesome publishers
Nothing is free successful life, and AI companies cognize it. The backlash from publishers complete copyrighted contented has forced AI companies to agent exclusive licensing deals pinch a prime group of publishers like News Corp and Axel Springer. Many of these publishers person seemingly used robots.txt rules arsenic leverage successful these negotiations.
Click here to download the schematic arsenic a PDF and research the root links.
Does this mean that publishers pinch licensing agreements characteristic more?
No. While publishers pinch AI partnerships look much often successful OpenWebText2 than successful the WebText apical 1000, the relationship isn’t absolute.
Three of the apical 5 publishers successful OpenWebText 2 (NYT, Advance, and Gannett) do not person licensing agreements pinch OpenAI. Also, the WebText apical 1000 contains a higher percent of these publishers than OpenWebText2 (13.47% vs. 12.04%). Suffice it to opportunity that AI partnerships do not guarantee higher dataset representation. It’s besides worthy noting that the NYTimes blanket blocks almost each AI crawlers successful its robots.txt, truthful its beingness successful this dataset is an denotation that the makers of these datasets wanted to usage NYTimes content, but not that they were capable to do so.
What does the Ziff Davis study mean for SEO?
Content is still king
Every awesome patient thrives on high-quality content—from breaking news and investigative publicity to data-led reports and master analysis. Looking astatine the apical publishers featured successful the Ziff Davis study, we spot family names like:
- The New York Times (nytimes.com)
- Buzzfeed, Inc. (buzzfeed.com, huffpost.com)
- Condé Nast (wired.com, newyorker.com, vogue.com)
- News Corp (wsj.com, thesun.co.uk, nypost.com)
These publishers predominate search, earn backlinks naturally, and are often utilized successful LLM training datasets, reinforcing their credibility.
Despite volatile SERPs and the emergence of AI-generated answers, contented remains the instauration of a website’s authority.
Moz's DA metric is directionally meticulous for gauging a website's authority
While Moz’s Domain Authority (DA) isn’t a ranking factor, the Ziff Davis study confirms it’s a beardown directional parameter of tract authority, which aligns pinch the high-quality sources favored successful LLM training.
In a Moz roundup connected the Google Leaks, Rand Fishkin pointed out, “Google has been misleading marketers for years erstwhile saying they don’t usage immoderate shape of website authority.” Supporting this statement, a study by Tom Pool on Google's Helpful Content Update (HCU) recovered that websites pinch higher DA scores were much apt to beryllium HCU winners.
While building authority is simply a operation of different elements, the cardinal tenets stay the same:
- Helpful content from thought leaders that demonstrates a individual acquisition pinch the problem
- Topically applicable backlinks from charismatic websites
- Strong UX and engagement signals that show contented is adjuvant to users
- Positive off-page signals that reinforce brand spot and authority
AI models look the aforesaid challenges pinch identifying charismatic sources arsenic Google and whitethorn good lick them successful the aforesaid way.
Building backlinks from charismatic sources strengthens tract authority
If LLMs favour high-authority websites, then backlinks from these sites transportation weight—not conscionable successful Google hunt rankings but perchance successful generative AI visibility.
But the reality is that nexus building is getting harder. Spammy outreach and low-value links don’t move the needle. Instead, attraction connected creating contented that people attracts media attraction and citations.
High-value assets include:
- Industry reports pinch exclusive investigation and data
- Original surveys and lawsuit studies that supply unsocial insights
- Thought activity content from recognized experts successful your niche
- Interactive devices that connection a ton of worth for users
While not mentioned, astir of these publishers person a higher Brand Authority than most
Brand Authority is shaping up to beryllium conscionable arsenic important arsenic Domain Authority. The numbers don’t lie—57.9% of the publishers successful the Ziff Davis study had a Brand Authority people of 40 aliases higher. Moz’s Jonathan Berthold used a operation of Moz API and a civilization Google Collab script to do a bulk URL study for Brand Authority score.
The numbers align pinch Tom Capper’s study findings, which showed that sites pinch beardown marque signals were much apt to use from Google’s algorithm changes, while weaker brands struggled to compete.
According to Amanda Milligan, a fewer strategies that activity for Brand Authority include:
- Creating newsworthy reports and studies
- Leveraging in-house experts to create content
- Highlight impervious of expertise connected your website and content
- Co-marketing pinch vertical charismatic brands
- Give worth worthy its weight successful gold
How beardown is your brand?
Calculate your Brand Authority pinch precision successful Moz Pro
Conclusion: High-quality contented and Domain Authority are important elements to optimize for generative search
I’m not judge anyone is amazed astir the result of the Ziff Davis study, arsenic it confirms what we’ve agelong suspected. However, it’s important to statement that these websites and publishers didn’t go giants overnight. They spent years investing successful high-quality content, earning backlinks, and building reliable brands. To optimize for generative AI search, SEOs should travel the aforesaid playbook: people unsocial contented that people attracts applicable backlinks and establishes topical authority.