Google Content Warehouse API Leak

, , ,

This article overs some key takeaways from the recent Google Content Warehouse API Leak.


Search Engine Optimization (SEO) & User Experience (UX) need to work closer together

  • You need to drive more successful clicks using a broader set of queries and earn more link diversity if you want to continue to rank.
    • A focus on driving more qualified traffic to a better user experience will send signals to Google that your page deserves to rank.
  • With NavBoost, Google is valuing clicks as one of the most important features, but we need to understand what session success means.
    • A search that yields a click on a result where the visitor does not perform another search can be a success even if they did not spend a lot of time on the site. This may indicate that the visitor found what they were looking for.
    • A search that yields a click and a visitor spends 5 minutes on a page before coming back to Google is also a success. Takeaway, create more successful sessions.
  • SEO is about driving visitors to the page, UX is about getting visitor to what they want on the page. Pay closer attention to how components are structured to get visitors to the content that they are explicitly looking for and give visitors a reason to stay on the site.
    • In essence provide the visitor with the exact information and entice the visitor to remain on the page with something additionally compelling.

NavBoost, which is also called “Glue” is a Google ranking factor that was uncovered during Google’s antitrust trial with the U.S Department of Justice. The algorithm is focused on improving search results for navigation queries. It uses various signals, including user clicks, to determine the most relevant results. NavBoost remembers past clicks for queries up to 13 months old and segregates results based on characteristics like localization and device type (mobile or desktop).

Pay more attention to click metrics

We tend to treat Search Analytics data as outcomes, but Google’s ranking systems treats them as Diagnostic Features.

If you rank highly and you have a ton of impressions and no clicks you likely have a problem. What we have learned is that there is a threshold of expectation for performance based on position. When you fall below that threshold you can lose that ranking position.

Content needs to be more focused

We have learned definitively that Google uses vector embeddings to determine how far off given a page is from the rest of what you talk about.

This indicates that it will be challenging to go far into upper funnel content successfully without a structured expansion or without authors who have demonstrated expertise in that subject area.

Encourage your authors to cultivate expertise in what they publish across the web and treat their bylines like the gold standard that it is.

SEO should always be experiment-driven

Due to the variability of the ranking systems, you cannot take best practices at face value for every space. You need to test, learn and build experimentation in every SEO program.

Large sites leveraging products like SEO split testing tools are already on the right track, but even small sites should test how they structure and position their content and metadata to encourage stronger click metrics.

In other words, we need to actively test the Search Engine Results Pages (SERP), not just the site.

Pay attention to what happens after visitors leave

We now have verification that Google is using data from Chrome as part of the search experience. There is value in reviewing the clickstream data from SimilarWeb and SEMRush.

Trends show you where people are going next and how you can give them that information without them leaving your site.

Build keyword & content strategy around SERP format diversity

Google limits the number of pages of certain content types ranking in the SERP, so checking the SERPs should become part of your keyword research.

Don’t align formats with keywords if there’s no reasonable possibility of ranking. But then again, it all depends if there are other benefits of doing so.

Google can specify a limit of results per content type.

In other words, they can specify only X number of blog posts or Y number of news articles that can appear for a given SERP.

Having a sense of these diversity limits could help us decide which content formats to create when selecting keywords to target.

For instance, if we know that the limit for blog posts is three and we don’t think we can outrank any of them, then maybe a video is a more viable format for that keyword.

Page Titles still matter

Google has a feature called titlematchScore that is believed to measure how well a page title matches a query.

Page Title length doesn’t matter

We now have further evidence that the 60-70 character limit is a myth.

In my own experience we have experimented with appending more keyword-driven elements to the title and it has yielded more clicks because Google has more to choose from when it rewrites the title.

Use fewer authors on more content

Rather than using an array of freelance authors, you should work with fewer that are more focused on subject matter expertise and also write for other publications.

Focus on link relevance from sites with traffic

Link value is higher from pages that prioritized higher in the index. Pages that get more clicks are pages that are likely to appear in Google’s flash memory.

We also learned that Google highly values relevance. We need to stop going after link volume and solely focus on relevance.

Default to originality instead of long form

We now know originality is measured in multiple ways and can yield a boost in performance.

Some queries don’t require a 5,000-word blog post. Focus on originality and layer more information in your updates as competitors begin to copy you.

Make sure all dates associated with a page are consistent

It’s common for dates in schema to be out of sync with dates on the page and dates in the XML sitemap. All of these need to be synced to ensure Google has the best understanding of how hold the content is.

As you refresh your decaying content, make sure every date is aligned so Google gets a consistent signal.

Use old domains with extreme care

If you’re looking to use an old domain, it’s not enough to buy it and slap your new content on its old URLs. You need to take a structured approach to updating the content to phase out what Google has in its long-term memory.

You may even want to avoid there being a transfer of ownership in registrars until you’ve systematically established the new content.

Make gold-standard documents

We now have evidence that quality raters are doing feature engineering for Google engineers to train their classifiers. You want to create content that quality raters would score as high quality so your content has a small influence over the next core update.

Brand matters!

If there was one universal piece of advice I had for marketers seeking to broadly improve their organic search rankings and traffic, it would be: Build a notable, popular, well-recognized brand in your space, outside of Google search.

Freshness matters!

Google looks at dates in the byline (bylineDate), URL (syntacticDate) and on-page content (semanticDate).

Core topics

To determine whether a document is or isn’t a core topic of the website, Google vectorizes pages and sites, then compares the page embeddings (siteRadius) to the site embeddings (siteFocusScore).


Google stores domain registration information (RegistrationInfo).

Font sizes

Google measures the average weighted font size of terms in documents (avgTermWeight) and anchor text.

The significance of page updates is measured.

The significance of a page update impacts how often a page is crawled and potentially indexed.

Previously, you could simply change the dates on your page and it signaled freshness to Google, but this feature suggests that Google expects more significant updates to the page.

Toxic backlinks are indeed a thing

We’ve heard that “toxic backlinks” are a concept that simply used to sell SEO software. Yet there is a badbacklinksPenalized feature associated with documents.

There’s a blog copycat score

In the blog BlogPerDocData module there is a copycat score without a definition but is tied to the docQualityScore. My assumption is that it is a measure of duplication specifically for blog posts.

Google detects how commercial a page is

We know that intent is a heavy component of Search, but we only have measures of this on the keyword side of the equation.

Google scores documents this way as well, and this can be used to stop a page from being considered for a query with informational intent.



Bottom line, keep doing what works and dump what’s not. While there is no text-to-code ratio in these recent Google Content Warehouse API Leak documents, several of your SEO tools will tell you your site is falling apart because of it.

I hope this article on troubleshooting the recent Google Content Warehouse API Leak has helped you. I welcome your thoughts, questions or suggestions regarding this article.

You may support my work and future improvements by sending me a tip using your Brave browser or by sending me a one time donation using your credit card.

Let me know if you found any errors within my article or if I may further assist you by answering any additional questions you may have.