This is an ongoing analysis of the leak of the Google API content warehouse leak. The docs can be found here.
Let’s start with the most important Module in the API. The CompositeDoc model.
The CompositeDoc collects all information about a document. This is what SEOs would call the On-page optimisation for pages. This clearly defines all attributes of On-page SEO and what is and is not important for a URL (document).
The CompressedQualitySignals model is another model that feeds into Google systems.
- Mustang is a Google system that scores, ranks and returns search results. This system works based on input of a number of models discussed below. This system has an underlying algorithm that helps with ranking called AScorer
- TeraGoogle is an indexing system that stores the recent corpus of processed document on storage for optimal retreival mechanisms.
- Amarna referenced in SafesearhVideoContentSignals
- Raffia – some references indicate it is a indexing system.
Here is a screenshot. I recommend opening this in a new tab to continue reading or simply opening the link from hexdocs.pm site.
PerDocData model
The PerDocData model is the most interesting and probably deserve a page of its own. The module is basically a protocol buffer used for indexing and serving.
Some notes from a cursory analysis
- An attribute that classifies if a document is scienceDoctype or not. <0 means not a Science document. 0 means it is Science doc fully visible and >0 implies Science doc with limited visibility where the number represents the number of visible terms.
- uacSpamScore is a value between 0 to 127 where anything above 64 represents page indicative of spam.
- spamtokensContentScore basically a number that measures a page with User Generated Content spam. This is used in the SiteBoostTwiddler.
- OriginalContentScore – Pages that have little content have this score from 0 to 127. The actual original content score ranges from 0 to 512.
Originality may be rewarded and closely related to topical authority.
Another concept with a lot of attributes is content freshness.
- freshboxArticleScores – Stores scores of freshness-related classifiers: freshbox article score, live blog score and host-level article score.
- semanticDateInfo – Its of the type integer and contains confidence scores for day/month/year components as well as various meta data required by the freshness twiddlers.
- datesInfo – Stores dates info and asceraatins if the page is old based on the date annotation.
- freshnessEncodedSignals – this field is deprecated but points to a single type string that stores freshness and aging related data.
- lastSignificantUpdate – records the last significant update on the page.
- smartphoneData indicates additional metadata is stored for smartphones optimised documents.
This field pointed to another module called the SmartphonePerDocData which had a number of other important factors being measured.
- isSmartphoneOptimised checks for mobile responsiveness
- isWebErrorMobileContent checks if the page serves an error to desktop crawler but is fine on the mobile ?
- maximumFlashRatio checks if there is flash on the render area
- violatesMobileInterstitialPolicy checks for violation for mobile interstitials and demotes the page if found. Its of the type boolean so only stores 1/0
There are atleast 4 pag rank attributes in the docs with no or low descriptions. These include
- Pagerank – deprecated
- Pagerank0 – no description
- Pagerank1 – no description
- Pagerank2- no description
There is at least one signal that relates to a hacked site alternatively titled a spam Muppet.
- spamMuppetSignals which leads to GoogleApi.ContentWarehouse.V1.Model.SpamMuppetjoinsMuppetSignals stores hackedDateNautilus, hackedDateRaiden, raidenScore, site
Site Signals
An important attribute nsrDataProto leads to a QualityNSR model. These are stripped site-level signals. These include
titlematchScore signals how well the titles match the user query.
site2vecEmbeddingEncoded represents compressed site embedding. These are liekly used for site’s content characteristic and determining the theme of the site. These attributes are likely mapped into an embedding space and are used to measure topical similarity and clustering abilities.
siteFocusScore & siteRadius further help determine the authoritative nature of the site. A higher siteFocusScore signifies a greater concentration on a topic. A siteRadius score helps determine how far a page deviates from the overall topic of the site.
- Use topical keyword research to not stray away from the main area of focus of your site.
- Higher concentration leads to higher authority.
- Prune content that deviates largely from your websites core focus area. It maybe not that hard to come up with a tool to measure topic deviations.
smallPersonalSite is a number type which stores score of small personal site promotion. It llikely refers to promotion of small site blogs.
Domain Sandbox
hostAge – The earliest firstseen date of all pages on this host/domain. These data are used in twiddler to “sandbox fresh spam in serving time“. A number of people have argued that domain age plays a role in ranking but this parameter indicates that the hostAge is simply used to determine fresh spam and sandbox the spam. Since this attribute is a part of the perDocData model , this score is likely calculated on the URL as opposed to the entire domain. This does not imply that a domain goes through a sandbox.
If this url’s host_age == domain_age, then omit domain_age implies that if you do change the host, you start again from scratch. Be wary of changing hosts unless absolutely necessary.
Bad SSL Certificates
badSslCertificate checks if the website has a bad SSL certificate or is in a redirect chain. It further links to a Model which checks the score on a per URL basis.
This field feeds back into the CompositeDoc.
TOFU Score
tofu of type number measures site-level TOFU score. This is a site quality predictor based on the content of the website.
Impressions
impressions of type number likely stores total impression your site gets. This is directly fed back into the perDocData which feeds back into the CompositeDoc. This indicates that impressions play a role in rankings.
Clutter Score
clutterScore is a Delta site-level signal which looks for clutter on the site. The definition of distracting/annoying resources is missing but assuming this is interstitials, video players, scripts and so on.
LLM Content
From the QualityNsrPQData , we have an intresting attribute
- contentEffort – LLM-based effort estimation for article pages
- rhubarb – Site-URL delta signals based quality score computed in Goldmine via the Rhubarb model
- unversionedRhubarb – of the type number is the delta score of the URL-level quality predictor.
This is particularly quirky because it is hard to predict whether the contentEffort is a score to check if the page is generated via LLM ? or more likley a LLM-based processor that scores how much effort was required to create the page.
LocalWWWInfo model
The CompositeDoc includes a model LocalWWWInfo. This includes attributes like address, brickAndMortarStrength, geotopicality, isLargeChain, isLargeLocalwwwinfo, siteSiblings.
LocalWWWInfo Considerations
This implies that Google considers the following
- Do you have a local business address and Phone number
- brickAndMortarStrength is a float so some level of scoring is involved
- geotopicality measures information of the geo location and is independent of the business or URL in question
- isLargeChain is a boolean ( 1 / 0 ) and checks whether the business is a part of a large chain
- isLargeLocalwwwinfo also a boolean checks whether the website is a large website or not
- siteSiblings is of the type integer and very likely checks whether the site has other sibling sites. The notes say that these signals are per-document signals independent of any particular address.
Finally a WrapptorItem keeps the address footprint. This includes an address, phone number and the business name. This essentially validates the NAP.