From the academic community but I'm sure the engines are up to speed (maybe not MSN ;)).
"The volunteers were given these guidelines and asked to classify a set of hosts as normal, spam or borderline"....
Web Spam Collection
http://aeserver.dis.uniroma1.it/webspam/info/
Here is the meat...
http://aeserver.dis.uniroma1.it/webspam/info/