Noobies Guide on How to Scrape: Part 5 – A Basic Scraper

Thursday, June 11, 2009 20:04
Posted in category Noobie Scraping Guide

Now that we are up to speed on the data we want to collect, and how cURL works, a basic scraper it’s really just a hop, skip, and a jump away.

Getting Data

The only other point we haven’t covered was how to effectively pull data from our page.   For example say we want to grab the value of a link on a page, it’d look something like this:

<a href=”this-is-a-link.com“></a>

How do we easily remove the link value (this-is-a-link.com)?

For such a feat I have long ago wrote my own function:


function getValue($item, $query, $end){

$item = stristr($item, $query);
$item = substr($item, strlen($query));
$stop = stripos($item, $end);
$val = substr($item, 0, $stop);
return $val;
}

It works pretty simply:

  • $item : The data we want to get our value from
  • $query:  The character(s) immediately in front of the value we want
  • $end : The character(s) immediately behind the value we want

The function naturally returns the value in between the first $query and first $end after $query from $item (you catch all of that).  So in the above example (the one with the link) it was be wrote like this:


getValue('<a href="this-is-a-link.com"></a>', '''href="', '">');

Now the second and third fields I used more characters than what I needed just for illustrative purposes.  However the value of this-is-a-link.com would have been returned.

A Note About Firebug

One last mention is when working with the Firebug plug-in for Firefox, make sure you double check the source code.  I had run into it, twice, this scraper where the values shown inside Firebug where not right.  It was putting quotes around some values that didn’t have any quotes around them for example in the source code below, line 47, it had shown <li class=g> as <li class=”g”>, which in the scraping world are WAY different.

The Scraping Process

This is a pretty basic scraper and I’m going to quickly go over how it works:

  • lines 3-13: The getValue function.
  • lines 16-25: Setting up generic and search variables.  Notice to change your search term you change the vlue on line 23.
  • lines 28-44:  Setting all the cURL values, and execting it on line 44.
  • line 47: Chop off all the code up until our first search result.
  • line 50:  Explode all the results into an array.
  • line 53:  Setup and loop through all the results.
  • line 55:  Find out if it’s a search result and not a video or image result.
  • lines 56-68:  It it was a normal search result, pull our data.
  • lines 71-74:  Put our data on screen.
  • line 80: close cURL.

It’s only setup to scrape the first page of results from Google.  It’s plenty commented it would be pretty easy to add a little code to get it to get more pages.  Patience, you must walk before you run.

Sourcecode

I didn’t spell check any of the comments in this code, so if you find any just ignore it.

<?php

function getValue($item, $query, $end){
//$item is where we want to search
//$query is something unique just before the value we want
//$end is the something unique just after the value we want
//function returns the first value that is inbetween $query and $end inside $item
$item = stristr($item, $query);
$item = substr($item, strlen($query));
$stop = stripos($item, $end);
$val = substr($item, 0, $stop);
return $val;
}

//path to my cookie file
$cookie = '/tmp/cookie.txt';
//my user agent
$agent = 'User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.3) Gecko/2008100922 Ubuntu/8.04 (hardy) Firefox/3.0.3';
//my base url
$url = 'http://www.google.com/search?q=%searchterm%&hl=en&start=';
//my search term
$searchTerm = 'huge fish';
$searchTerm = urlencode($searchTerm);
//create a search url
$searchURL = str_replace('%searchterm%', $searchTerm, $url);

//initialize new cURL resource
$ch = curl_init();
//set our general options
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, FALSE);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);

//cookies
curl_setopt($ch, CURLOPT_COOKIESESSION, TRUE);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie);

//set our URL
curl_setopt($ch, CURLOPT_URL, $searchURL);
//query the url, get our data
$rawdata = curl_exec($ch);

//find first <li class="g"> return all data after it
$rawdata = strstr($rawdata, '<li class=g>');

//explode the rest of data broken up by search result
$results = explode('<li class=g>', $rawdata);

//setup a loop
foreach($results as $value){
//check to see if it's an actual search entry and not image or video
if(strstr($value, '<div class="s">')){
//it is a valid search entry

//move up to fist link in entry
$data = strstr($value, '<a');

//get url
$url = getValue($data, 'href="', '"');
//get title
$title = getValue($data, '>', '</a>');
//get description
$description = getValue($data, '<div class="s">', '<br>');
//get cite
$cite = getValue($data, '<cite>', '</cite>');

//Do any other processing HERE
echo 'URL: ' . $url . '<br/>';
echo 'Title: ' . strip_tags($title) . '<br/>';
echo 'Description: ' . strip_tags($description) . '<br/>';
echo 'Cite: ' . strip_tags($cite) . '<br/><br/>';

}
//if it's not a valid search entry or we are just done with the entry, will move on to the next until we have them all
}

curl_close($ch);

?>

This section I think would be the most complex yet so if you have questions feel free to leave them in the comments. Now go make monies.

You can leave a response, or trackback from your own site.
Hotmail


14 Responses to “Noobies Guide on How to Scrape: Part 5 – A Basic Scraper”

  1. Joshua says:

    June 12th, 2009 at 8:52 am

    The Issue after diving into deeper pages in Google would be how to work around them temp. cbanning IP’s for automated programs.

  2. Trevor Nash-Keller says:

    June 12th, 2009 at 3:19 pm

    Nice ongoing tutorial. I love stuff like this!

  3. Brad says:

    June 13th, 2009 at 12:50 am

    @Joshua: My goal is to provide programmers with a solid base for writing scrapers and automated tools. Working around IP bans may get coved at one point, but it’s not in the immediate or foreseeable future.

    I’ve spent too long learning and actually programming over the years to just give that stuff away. Good programmers are going to know where to go with it.

    Honestly I’d take forever to teach someone from scratch many of the more complex automation and scraping methods because you really have to understand PHP, HTML. and JavaScript sometimes to put it all together (not to mention how to debug when you get stuck).

    The people actually taking something away from these tutorials need to get off their asses and start coding and messing around and making mistakes. It’s the only way to thoroughly learn the concepts.

  4. secret cash blueprint says:

    June 21st, 2009 at 11:35 pm

    Great post. I really enjoy it when there is screen shots and a step by step process. Most people just talk about how to do something and don’t actually show you.

    Keep em coming!

  5. Steven says:

    July 5th, 2009 at 3:24 pm

    Hey Brad, I couldn’t find any contact info under about so was wondering if you could shoot me an email. Talk to you soon.

  6. Parody says:

    July 10th, 2009 at 2:12 am

    Awesome man, it worked like a charm, i changed it to do several over websites and it works pecfectly! Seriously your a champ :P

    Now all i need to do is figure out how to automate this to do the whole websites.. :P

    Also it is possible to grab pictures as well yeah ?

  7. justin says:

    August 1st, 2009 at 5:17 am

    Nice tutorial, getting over the ip bans is pretty simple.

    1. Write a script using the nice tutorial here that will scrape a huge number of proxy sites and constantly update a database.
    2. Write a script that checks the proxies and removes bad ones from the db.
    3. make your G scraper pull a fresh proxy from your db each time one gets blocked.

    This whole script is pretty easy to write or you can find scripts that have most of the needed functionality already built in. ie. a proxy scraping script and proxy checking script then simply create a db and automate with cron.

  8. Email Marketing Blog says:

    September 15th, 2009 at 3:11 am

    This is quite amazing and I am happy to see that you have shared about the source code for that.

  9. ?????? says:

    April 29th, 2010 at 2:43 am

    ???.

    ? ????? ??????? ????????? ??????????, ?? ? ?????? ????? ???? ??????????? ???????????...

  10. ?????? says:

    May 27th, 2010 at 7:05 am

    ???.

    ? ???

  11. ????? says:

    May 30th, 2010 at 3:53 pm

    ???.

    ? ???

  12. Frankie says:

    June 17th, 2010 at 12:25 pm

    ?????? ????! jake@avtogazik.ru” rel=”nofollow”>……

    ? ??….

  13. CAMERON says:

    June 24th, 2010 at 5:20 am

    Medicamentspot.com International Legal RX Medications. Special Internet Prices (up to 40% off average US price). NO PRIOR PRESCRIPTION REQUIRED!…

    Combivir@buy.online” rel=”nofollow”>.…

  14. Apple Introduces the iPad says:

    January 17th, 2011 at 8:39 pm

    Apple Introduces the iPad…

    The Magic Bullet System…

Leave a Reply