GoogleAPI Content Warehouse Leak – An Ongoing Analysis

This is an ongoing analysis of the leak of the Google API content warehouse leak. The docs can be found here.

Let’s start with the most important Module in the API. The CompositeDoc model.

The CompositeDoc collects all information about a document. This is what SEOs would call the On-page optimization for pages. This clearly defines all attributes of On-page SEO and what is and is not important for a URL (document).

The CompressedQualitySignals model is another model that feeds into Google systems.

  • Mustang is a Google system that scores, ranks and returns search results. This system works based on input of a number of models discussed below. This system has an underlying algorithm that helps with ranking called AScorer
  • TeraGoogle is an indexing system that stores the recent corpus of processed document on storage for optimal retreival mechanisms.
  • Amarna referenced in SafesearhVideoContentSignals
  • Raffia – some references indicate it is a indexing system.

Here is a screenshot. I recommend opening this in a new tab to continue reading or simply opening the link from hexdocs.pm site.

PerDocData model

The PerDocData model is the most interesting and probably deserve a page of its own. The module is basically a protocol buffer used for indexing and serving.

Some notes from a cursory analysis

  • An attribute that classifies if a document is scienceDoctype or not. <0 means not a Science document. 0 means it is Science doc fully visible and >0 implies Science doc with limited visibility where the number represents the number of visible terms.
  • uacSpamScore is a value between 0 to 127 where anything above 64 represents page indicative of spam.
  • spamtokensContentScore basically a number that measures a page with User Generated Content spam. This is used in the SiteBoostTwiddler.
  • OriginalContentScore – Pages that have little content have this score from 0 to 127. The actual original content score ranges from 0 to 512.
 Pro Tip
There is no data that clarifies what is “little content”. However, there are a number of other attributes like bodyWordsToTokensRatioTotal, bodyWordsToTokensRatioBegin and others that indicate that the measure of “little content” may be relative.

Originality may be rewarded and closely related to topical authority.

Another concept with a lot of attributes is content freshness.

  • freshboxArticleScores – Stores scores of freshness-related classifiers: freshbox article score, live blog score and host-level article score.
  • semanticDateInfo – Its of the type integer and contains confidence scores for day/month/year components as well as various meta data required by the freshness twiddlers.
  • datesInfo – Stores dates info and asceraatins if the page is old based on the date annotation.
  • freshnessEncodedSignals – this field is deprecated but points to a single type string that stores freshness and aging related data.
  • lastSignificantUpdate – records the last significant update on the page.
 Pro Tip
Content freshness looks like a key factor based on a number or parameters inside the PerDocData model. it may be a good idea to update your content regularly to take advantage of the content freshness boost.
  • smartphoneData indicates additional metadata is stored for smartphones optimized documents.

This field pointed to another module called the SmartphonePerDocData which had a number of other important factors being measured.

  • isSmartphoneOptimized checks for mobile responsiveness
  • isWebErrorMobileContent checks if the page serves an error to desktop crawler but is fine on the mobile ?
  • maximumFlashRatio checks if there is flash on the render area
  • violatesMobileInterstitialPolicy checks for violation for mobile interstitials and demotes the page if found. Its of the type boolean so only stores 1/0
 Pro Tip
There is an obvious boost for mobile responsiveness. If you have been using mobile interstitials , you would likely be demoted and there is no way to know if and when this score refreshes. This is particularly bad for sites using ads like the Google adsense program.

There are atleast 4 pag rank attributes in the docs with no or low descriptions. These include

  • Pagerank – deprecated
  • Pagerank0 – no description
  • Pagerank1 – no description
  • Pagerank2- no description
 Pro Tip
Some comments are as new as late 2022 and so it can be concluded that pagerank is well and alive

There is at least one signal that relates to a hacked site alternatively titled a spam Muppet.

  • spamMuppetSignals which leads to GoogleApi.ContentWarehouse.V1.Model.SpamMuppetjoinsMuppetSignals stores hackedDateNautilus, hackedDateRaiden, raidenScore, site
 Pro Tip
Score is used during the query time joins. Likely demotes the pages when scoring the perdocdata. Keep your softwares updated and avoid being a spam Muppet.

Site Signals

An important attribute nsrDataProto leads to a QualityNSR model. These are stripped site-level signals. These include

titlematchScore signals how well the titles match the user query.

 Pro Tip
Titles are very important. Matching titles to user query is extremely important

site2vecEmbeddingEncoded represents compressed site embedding. These are liekly used for site’s content characteristic and determining the theme of the site. These attributes are likely mapped into an embedding space and are used to measure topical similarity and clustering abilities.

siteFocusScore & siteRadius further help determine the authoritative nature of the site. A higher siteFocusScore signifies a greater concentration on a topic. A siteRadius score helps determine how far a page deviates from the overall topic of the site.

 Pro Tip
There are a number of tips to be taking away from these signals.

  • Use topical keyword research to not stray away from the main area of focus of your site.
  • Higher concentration leads to higher authority.
  • Prune content that deviates largely from your websites core focus area. It maybe not that hard to come up with a tool to measure topic deviations.

smallPersonalSite is a number type which stores score of small personal site promotion. It llikely refers to promotion of small site blogs.

Domain Sandbox

hostAge – The earliest firstseen date of all pages on this host/domain. These data are used in twiddler to “sandbox fresh spam in serving time“. A number of people have argued that domain age plays a role in ranking but this parameter indicates that the hostAge is simply used to determine fresh spam and sandbox the spam. Since this attribute is a part of the perDocData model , this score is likely calculated on the URL as opposed to the entire domain. This does not imply that a domain goes through a sandbox.

 Pro Tip
This attribute has been the poster child of all arguments for the existence of a sandbox. If we judge the attribute comment here, it is my personal opinion that it lacks substance to validate the idea that there does exist a sandbox for newer domains.

If this url’s host_age == domain_age, then omit domain_age implies that if you do change the host, you start again from scratch. Be wary of changing hosts unless absolutely necessary.

Bad SSL Certificates

badSslCertificate checks if the website has a bad SSL certificate or is in a redirect chain. It further links to a Model which checks the score on a per URL basis.

This field feeds back into the CompositeDoc.

 Pro Tip
Keep on top of your SSL certificates and ensure you check the HTTPS reports in the search console. Use tools like screaming frog or other crawlers to check for redirect chains.

TOFU Score

tofu of type number measures site-level TOFU score. This is a site quality predictor based on the content of the website.

 Pro Tip
Keep top of funnel users happy ?

Impressions

impressions of type number likely stores total impression your site gets. This is directly fed back into the perDocData which feeds back into the CompositeDoc. This indicates that impressions play a role in rankings.

 Pro Tip
While this tip will likely contradict the authority or concentration tip – the volume and breadth of content will likely yield more impressions. Alternatively, to increase the total number of impressions focus on writing content with higher volume.

Clutter Score

clutterScore is a Delta site-level signal which looks for clutter on the site. The definition of distracting/annoying resources is missing but assuming this is interstitials, video players, scripts and so on.

 Pro Tip
Declutter your site. There is an explicit penalty for clutter. See – “penalizing sites with a large number of distracting/annoying resources

LLM Content

From the QualityNsrPQData , we have an intresting attribute

  • contentEffort – LLM-based effort estimation for article pages
  • rhubarb – Site-URL delta signals based quality score computed in Goldmine via the Rhubarb model
  • unversionedRhubarb – of the type number is the delta score of the URL-level quality predictor.

This is particularly quirky because it is hard to predict whether the contentEffort is a score to check if the page is generated via LLM ? or more likley a LLM-based processor that scores how much effort was required to create the page.

LocalWWWInfo model

The CompositeDoc includes a model LocalWWWInfo. This includes attributes like address, brickAndMortarStrength, geotopicality, isLargeChain, isLargeLocalwwwinfo, siteSiblings.

LocalWWWInfo Considerations

This implies that Google considers the following

  • Do you have a local business address and Phone number
  • brickAndMortarStrength is a float so some level of scoring is involved
  • geotopicality measures information of the geo location and is independent of the business or URL in question
  • isLargeChain is a boolean ( 1 / 0 ) and checks whether the business is a part of a large chain
  • isLargeLocalwwwinfo also a boolean checks whether the website is a large website or not
  • siteSiblings is of the type integer and very likely checks whether the site has other sibling sites. The notes say that these signals are per-document signals independent of any particular address.

Finally a WrapptorItem keeps the address footprint. This includes an address, phone number and the business name. This essentially validates the NAP.

 Pro Tip
One take away from this module is isLargeLocalwwwinfo attribute which likely provides a boost for larger sites. Improve and/or fix your NAP