It has been a lot of work since I first began this project immediately following Penguin 2.0. I am sure many of you have missed my labor of love, the Open Penguin Data project. The project aimed to pull together industry data sources and build a model for making a "reasonable assessment of a URL's risk of being penalized by Penguin 2.0".
The process was straightforward. We use rankings data sets to build a list of URLs caught by Penguin 2.0, aggregate a series of features which we believe may be useful in spotting link manipulation, and then finally build a classification model.
The list of features used to predict Penguin 2.0 victims can be found here, but they essentially boiled down to the following independent variables:
- Trust and Citation Flow and Link Velocity from MajesticSEO
- MozRank, MozTrust and Authority metrics from Moz
- Link type analysis by Virante
- and commercial anchor text metrics from GrepWords
The end result has been quite stunning. Using gradient boosting model with scikit, we are now able to classify Penguin affected sites correctly 94% of the time. We built the initial data set with AuthorityLabs and have cross-validated against the STAT Search Analytics codex. Our final step will be to aggregate penalty data via SerpMetrics over the period of a week and see if the model is effective there. If it is not effective, we will have strong evidence that the model works independently on Penguin penalties. If not, the model more blindly represents likelihood of penalization in general, not specific to Penguin. This is not necessarily a bad thing, but will require more testing if that is the case.
The best thing about the Open Penguin Data project is anyone can contribute and anyone can download the data and try and build a better model. The data is completely open, and reflects a great deal upon the providers at Majestic, Moz, AuthorityLabs, GrepWords, STAT and SerpMetrics for helping this project come together.