SERP Scraping for Fun and Profit Case Study: Facebook

by rishil on March 30, 2011

Learn About Scraping

Learn About Scraping

So a little while ago I pointed out that Facebook is running a massive Grayhat Strategy to Rank for Longtail in the SERPs, essentially carrying out SERP Sniffing to an insanely large scale, with the view of potentially building a community driven Content Farm. Some really interesting questions popped to mind:

  • How many keywords are they ranking for?
  • How well are they ranking?
  • What sort of traffic volumes would they get?
  • What would the commercial value of that traffic be?

Not all of these are easy to answer, and may or may not be of interest. So I expanded my questions with:

  • How can I check their Rankings?

So I had a rough answer. Scrape and Rank. In reality, this would not be possible without the Dark Art of Scraping Google Search Results.

Which kind of lead me to think, this is a pretty simple, yet good method of referencing rankings. However scraping need not be bad. Like any SEO technique, it can be used for both,good and evil.

  • How can I use that technique for blackhat purposes?
  • How can I use that technique for whitehat purposes?

So I have figured a way to answer SOME of my questions. I then asked myself:

Can I create a simple and rough version of my technique so that ANYONE can use and verify?

The answer was, yes, to an extent. The recipe would include:

  1. A Scraping Script,
  2. A Rank Checker,
  3. A Spreadsheet
  4. A little time.

Summary: In the post below I demonstrate

  • The potential extent to which Facebook is gaming the SERPs with crappy content
  • The ability to use Scraping for both white and black SEO techniques

The Technique:

Scraping AssbookStage 1:

Stage 2

  • Grab the keywords and Rank Check for google .com
  • Compile and sort by everything in positions:
    • Pos 1
    • Pos 2-5
    • Pos 6-10
    • Page 2
    • Pos 21-50
    • Pos 51-100
    • Pos 101-200
    • Not ranked
  • Sort into pie chart, and analyse.

The Results

I scraped the first 600 results of the query, with the help of the awesome Richard Shove.  Well I had to strip 8 keywords for non eligibility / repeat.  Also, when ranking, I only considered ranks up to the first 200 results. And I define a good rank to average rank as anything in the top 50.

What did I find? The ranking data makes good reading, even for such a small dataset:

Google SERP Rankings for Facebook

Google SERP Rankings for Facebook

The data shows that of my keywords selected, only 25% weren’t ranked in the top 200 positions. Of the same data set, 17% ranked on page 1 of the SERPs.

Theoretical Extrapolation

Now lets look at that in theoretical terms. The query I used to generate this list indicates 119,000,000 results found.(That is One hundred and Nineteen MILLION results!!!)

Site Query Equals 119 000 000 Results

Site Query Equals 119 000 000 Results

This is nowhere near the real number indexed.  Why do I say that? Well you will get different indexed results depending on how quick google serves up the data to you… So refreshing the search gives me a different number indexed.

Site Query Equals 219 000 000 Results

Site Query Equals 219 000 000 Results

Lets just say for simplicities sake we have 100,000,000 pages indexed. If the data set above forms a working sample, then you expect 75% of these results ranking for the keywords necessary. 17% of the same  dataset ranks in page one.

BIG picture? If (and note I say IF) my hypothesis holds, then these community pages that rank on page one equal in the region of 17,000,000. That is Seventeen MILLION folks. What traffic wold you hope to generate with 17 Million results on page 1 in google?

Not finished yet. The three most common variations on these pages title tags include:

  • Community
  • Interest
  • Topic

Now cross reference any keywords that have these extentions added, in order to make a mid range keyphrase. I can only guess at the volume of page one rankings for those keyphrase variations.

At this juncture, I must point out that my friend Branko who is an awesome SEO Scientist, highlighted the fact that my methodology  needs verification:

If I was you, I would take 2 more random sets of 600 queries and see if the distribution is similar. That way you can get an idea of how much fluctuations you have and how solid your data is. If you want to take it to the next level, there are statistical tests that can tell you whether your sample is representative of the population.

Now I was too lazy to do that to be honest. However I am sure those of you who are better at analysing and manipulating sata may be tempted. If so, let me know!

Singing in the SERPs, I am Singing in the SERPS…

Now this is a massive massive invasion of the SERPs. And if I had the full X Million KW list to hand, I would love to have dug through it. But even my unscientific approach indicated something we British would call “corkers” . Danny Sullivan taught me a neat trick years ago, of comparing a high volume KW against another to get an indication of popularity. My list indicates that Facebook ranks well for “Singing”. Now lets use Danny’s technique:

Singing vs Viagra

Singing vs Viagra

How cool is that? Singing gets more searches than Viagra. and here is a scraper page auto built by Facebook ranking for it. I include this because it indicates the potential volume of traffic that can be had by a large scale site that gets involved in Gray SEO techniques.

SERP Scraping for Whitehat Purposes

Whitehat Scraping

Whitehat Scraping

When working on large scale SEO projects, for example Ecommerce SEO, competitive analysis is key in successful Long to Mid range strategy. Often we as SEOs tend to hone in on the “money” words and forget that the long tail not only exists, but is highly profitable.

I am not saying that we DONT target long tail keywords, but I dont think we competitively analyse this data.

So how do you use SERP Scraping for comptitive analysis? Do I have to really tell you? :P

  • First off, scrape all the pages indexed for the competitor in question.
  • Second, most common ecommerce SEO set ups use keyword splitters in the title tag. (They could be pipes, commas, arrows etc etc.) So use that knowledge to pull off keywords from your scraped data.
  • Third, run a rank check on the full list.
  • Four, compare against your own data. Where are the gaps?
  • Profit.

SERP Scraping for Blackhat Purposes

Blackhat Scraping

Blackhat Scraping

Again, this is quite a simple use of that SERP rank data.

  • Find a number of large sized sites with average or poor SEO
  • Scrape all the pages indexed.
  • You can use title tags, headings, etc to compile keywords from your scraped data.
  • Third, run a rank check on the full list.
  • Four, sort the data. Start with one word, two word, and continue sequencing till you get to 4-5 word combinations.
  • Now you have an insanely large keyword list with rankings – sorted into  keyword tails.
  • I would then further sort these into “common” themes, e.g all car and auto related words / phrases together…
  • Use an autoblog tool such as WPRobot to create thematic microsites which automatically pump out content based around your keyword sets. Add links where necessary.
  • Profit?!

A serious Blackhat would probably know how to use Scraped SERPs for much much more than what I suggest above. However, I have chosen to demonstrate the lowest common denominator :)   After all, this is a similar technique to the one I would say Mahalo used

The Scraper

Free Scraper

So there are a bunch of tools you can use to scrape, but to show you how easy it can be, see the link to the Google Scraping Script. (A big thank you to Dan Harrison of Wordpress Doctors and William Vicary of Semto).

In case you think this is the only way to do this,  here is another variant of the scraping script (thanks to Yousaf Sekander of  Elevate Local).

This is just a limited version of the scraper and will take ages to pull out industrial strength data. If you are looking for something much more robust, try:

The Data

Free Data

I have no doubt that I have probably made a few mistakes. So here is the data – use it as you will.

http://bit.ly/FacebookScrapingData

Share and Enjoy:
  • Twitter
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • FriendFeed
  • Sphinn
  • LinkedIn
  • PDF
  • StumbleUpon
  • Suggest to Techmeme via Twitter
  • Yahoo! Buzz

Rishi Lakhani is an independent Online Marketing Consultant specialising in SEO, PPC, Affiliate Marketing and Social Media. Explicitly.Me is his Blog. Google Profile

{ 1 trackback }

Reference.com has millions of junk pages in Google | Luke's blog
April 1, 2011 at 2:26 pm

{ 18 comments… read them below or add one }

Anchor text- Taken Off March 30, 2011 at 5:37 pm

Some good data, and very nice scraping tips and links – however, with the Facebook keyword data, a huge amount of those keywords are the sort of terms people would never search for. This is especially true with the keywords which rank highly – for example, the first ten terms in the spreadsheet I think are terms people would never search for – as you get further towards the bottom of the list, where Facebook ranks more like Position 100+ or not at all, the much more generic keywords start to appear.

So while the data is interesting and has some purpose, it’d be nice to see the results with pure-generic keywords, unless I’m missing something obvious to do with the keyword list.

Reply

rishil March 30, 2011 at 6:12 pm

Even at 1 search a month per KW, they are delivering over 10 million organic visits…

Reply

Anchor text- Taken Off March 30, 2011 at 6:21 pm

I realise that, but traffic =/= keyword positions. I feel the graph is a bit inflated; take your “17% ranked on page 1 of the SERPs” – not having a dig, but they would do when the majority of that 17% are terms like “How do I delete a company fan page I created” and “om sri muneeswaran” – the data and article is good, but I think it’d be more accurate if terms like that were removed so we only saw a ‘true’ generic keyword list.

Reply

rishil March 30, 2011 at 8:07 pm

I am a lazy researcher to be honest. And this a completely small snapshot… However any one with industrial strength rank and scrape can check to see if it’s possible…

Reply

Keydaq March 30, 2011 at 7:42 pm

Nice article Rishi.

I second that. 1 click makes it worthwhile to spin a page. You also wouldn’t believe what people search for and all the typo variations you can expect to see.

I’ve been having some good fun on my site with exactly this kind of behaviour and some even nastier referrer parsing automation. Even with really questionable thin content and few backlinks, you can still rank highly enough to generate an income stream.

Reply

Nico Roddz March 30, 2011 at 5:52 pm

Awesome post. I just installed scrappy on my server to perform some competitive analysis.

Thanks for sharing this valuable knowledge!

Reply

Kieran Flanagan March 30, 2011 at 6:57 pm

really awesome post, really awesome !!
I do agree that a lot of the keywords Facebook are ranking for wouldn’t seem to be those carrying search volume. Do you think Facebook are purposely building a Content Farm ?
I have used WPRobot for microsites/autoblogging to fill in the gaps like above, it can be a little soul destroying as you know you are filling up the net with crap, so I stopped all that a year or so ago.
The grey hat use is good. Given me a white hat idea :)

Reply

Alan Bleiweiss March 31, 2011 at 9:10 am

It really pisses me off that Facebook gets away with this crap. Their “list of cities by GDP” page (position 4 in Google) is a scrape of the Wikipedia page of the same phrase. Except it’s a crappy scrape that doesn’t actually grab the spreadsheet listing the cities. Just the plain text.

If I had my way, Facebook would be Google Slapped faster than you could say WTF. But the reality is Google doesn’t care one wit.

By the way Rishi – I REALLY have been enjoying your articles lately!

Reply

Liam - Zaddle Marketing March 31, 2011 at 9:15 am

You make my brain hurt :)

Reply

Jey Pandian March 31, 2011 at 6:14 pm

Amen. I’ve hella been enjoying your articles of late too. Definitely a value add and prioritized in my list of emails to read.

Reply

andymurd April 1, 2011 at 3:09 am

Nice one, Rishi. And special thanks for maling the scripts and spreadsheet available – the data shows just how little Face book care about the quality of content on their site.

Now can you figure out how to scrape Google Insights for Search, please?

Reply

Gary April 1, 2011 at 10:31 am

Great article and thanks for sharing. I’m tryiing your script and can’t seem to get it working.

I’ve changed the $query = urlencode(”site%with search query here”);

I get empty results

Any help much appreciated

Thanks

Reply

rishil April 1, 2011 at 10:39 am

Have you tried both? If so, hit up the debs I linked to, sure they will help…

Reply

Gary April 1, 2011 at 11:40 am

This is what I’ve changed

FROM:
$query = urlencode(ENTER QUERY HERE);

TO:
$query = urlencode(”site%3Afacebook.com%2F+”Community+Pages+are+not+affiliated+with%2C+or+endorsed+by%2C+anyone+associated+with+the+topic”);

Haven’t changed anything else in the script. My csv file just gets

title URL description in 1 column

Thanks for your help

Just added your RSS feed to my reader, useful blog

Reply

Gary April 4, 2011 at 6:17 am

Any thoughts on that script Rishal?

Many thanks
Gary

Reply

WordPress Doctors April 5, 2011 at 8:29 am

Hi Gary

Where you have:
$query = urlencode(ENTER QUERY HERE);

“ENTER QUERY HERE” should be a normal Google Query. e.g. site:example.com test

Having the % and +, etc will definitely break the script, since urlencode() is designed to put those characters into the string for you. By putting in the % and + yourself, urlencode() will completely break that query.

So just put a normal human-readable query into “ENTER QUERY HERE”, and try again.

Kind Regards
Dan
(one of the devs who worked on the above script).

Reply

Yousaf Sekander April 5, 2011 at 8:31 am

Hi Gary,

You are getting the syntax wrong i.e. you got unescaped quote -> “Community

Reply

Gary April 7, 2011 at 11:22 am

Dan you’re a legend. I’m sure I tried that but probably messed about with the script too much. After downloading the script once more and simply using the above line as example (stupid me). It worked a treat.

Thanks again for sharing this.

Reply

Leave a Comment

Previous post: Thousands of Sites that Ban you from Linking to them.

Next post: Content Spinning : Article Spinning