Penguin Predictor at 94% via Open Penguin Data Project


It has been a lot of work since I first began this project immediately following Penguin 2.0. I am sure many of you have missed my labor of love, the Open Penguin Data project. The project aimed to pull together industry data sources and build a model for making a "reasonable assessment of a URL's risk of being penalized by Penguin 2.0". 

The process was straightforward. We use rankings data sets to build a list of URLs caught by Penguin 2.0, aggregate a series of features which we believe may be useful in spotting link manipulation, and then finally build a classification model.  

The list of features used to predict Penguin 2.0 victims can be found here, but they essentially boiled down to the following independent variables:

  • Trust and Citation Flow and Link Velocity from MajesticSEO
  • MozRank, MozTrust and Authority metrics from Moz
  • Link type analysis by Virante
  • and commercial anchor text metrics from GrepWords

The end result has been quite stunning. Using gradient boosting model with scikit, we are now able to classify Penguin affected sites correctly 94% of the time. We built the initial data set with AuthorityLabs and have cross-validated against the STAT Search Analytics codex. Our final step will be to aggregate penalty data via SerpMetrics over the period of a week and see if the model is effective there. If it is not effective, we will have strong evidence that the model works independently on Penguin penalties. If not, the model more blindly represents likelihood of penalization in general, not specific to Penguin. This is not necessarily a bad thing, but will require more testing if that is the case.

The best thing about the Open Penguin Data project is anyone can contribute and anyone can download the data and try and build a better model. The data is completely open, and reflects a great deal upon the providers at Majestic, Moz, AuthorityLabs, GrepWords, STAT and SerpMetrics for helping this project come together.


We haven't missed the labor

We haven't missed the labor of your love btw ;)

Need more visibility?

Small businesses desperately need guidance in what to do. Ask ten people and you're likely to get ten strategies for dealing with penalties - or knowing what to do if you think you may end up penalized. This needs more visibility outside the SEO world where those who need you can find out about it. Case studies, guest posts - I'm volunteering to assist in that.

Exactly GrowMap! It is not so

Exactly GrowMap! It is not so much identifying if you are hit by penguin, but what needs to happen next to get your website out of it.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.