Noobies Guide on How to Scrape: Part 5 – A Basic Scraper
Thursday, June 11, 2009 20:04Now that we are up to speed on the data we want to collect, and how cURL works, a basic scraper it’s really just a hop, skip, and a jump away.
Getting Data
The only other point we haven’t covered was how to effectively pull data from our page. For example say we want to grab the value of a link on a page, it’d look something like this:
<a href=”this-is-a-link.com“></a>
How do we easily remove the link value (this-is-a-link.com)?
For such a feat I have long ago wrote my own function:
function getValue($item, $query, $end){
$item = stristr($item, $query);
$item = substr($item, strlen($query));
$stop = stripos($item, $end);
$val = substr($item, 0, $stop);
return $val;
}
It works pretty simply:
- $item : The data we want to get our value from
- $query: The character(s) immediately in front of the value we want
- $end : The character(s) immediately behind the value we want
The function naturally returns the value in between the first $query and first $end after $query from $item (you catch all of that). So in the above example (the one with the link) it was be wrote like this:
getValue('<a href="this-is-a-link.com"></a>', '''href="', '">');
Now the second and third fields I used more characters than what I needed just for illustrative purposes. However the value of this-is-a-link.com would have been returned.
A Note About Firebug
One last mention is when working with the Firebug plug-in for Firefox, make sure you double check the source code. I had run into it, twice, this scraper where the values shown inside Firebug where not right. It was putting quotes around some values that didn’t have any quotes around them for example in the source code below, line 47, it had shown <li class=g> as <li class=”g”>, which in the scraping world are WAY different.
The Scraping Process
This is a pretty basic scraper and I’m going to quickly go over how it works:
- lines 3-13: The getValue function.
- lines 16-25: Setting up generic and search variables. Notice to change your search term you change the vlue on line 23.
- lines 28-44: Setting all the cURL values, and execting it on line 44.
- line 47: Chop off all the code up until our first search result.
- line 50: Explode all the results into an array.
- line 53: Setup and loop through all the results.
- line 55: Find out if it’s a search result and not a video or image result.
- lines 56-68: It it was a normal search result, pull our data.
- lines 71-74: Put our data on screen.
- line 80: close cURL.
It’s only setup to scrape the first page of results from Google. It’s plenty commented it would be pretty easy to add a little code to get it to get more pages. Patience, you must walk before you run.
Sourcecode
I didn’t spell check any of the comments in this code, so if you find any just ignore it.
<?php
function getValue($item, $query, $end){
//$item is where we want to search
//$query is something unique just before the value we want
//$end is the something unique just after the value we want
//function returns the first value that is inbetween $query and $end inside $item
$item = stristr($item, $query);
$item = substr($item, strlen($query));
$stop = stripos($item, $end);
$val = substr($item, 0, $stop);
return $val;
}
//path to my cookie file
$cookie = '/tmp/cookie.txt';
//my user agent
$agent = 'User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.3) Gecko/2008100922 Ubuntu/8.04 (hardy) Firefox/3.0.3';
//my base url
$url = 'http://www.google.com/search?q=%searchterm%&hl=en&start=';
//my search term
$searchTerm = 'huge fish';
$searchTerm = urlencode($searchTerm);
//create a search url
$searchURL = str_replace('%searchterm%', $searchTerm, $url);
//initialize new cURL resource
$ch = curl_init();
//set our general options
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, FALSE);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
//cookies
curl_setopt($ch, CURLOPT_COOKIESESSION, TRUE);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie);
//set our URL
curl_setopt($ch, CURLOPT_URL, $searchURL);
//query the url, get our data
$rawdata = curl_exec($ch);
//find first <li class="g"> return all data after it
$rawdata = strstr($rawdata, '<li class=g>');
//explode the rest of data broken up by search result
$results = explode('<li class=g>', $rawdata);
//setup a loop
foreach($results as $value){
//check to see if it's an actual search entry and not image or video
if(strstr($value, '<div class="s">')){
//it is a valid search entry
//move up to fist link in entry
$data = strstr($value, '<a');
//get url
$url = getValue($data, 'href="', '"');
//get title
$title = getValue($data, '>', '</a>');
//get description
$description = getValue($data, '<div class="s">', '<br>');
//get cite
$cite = getValue($data, '<cite>', '</cite>');
//Do any other processing HERE
echo 'URL: ' . $url . '<br/>';
echo 'Title: ' . strip_tags($title) . '<br/>';
echo 'Description: ' . strip_tags($description) . '<br/>';
echo 'Cite: ' . strip_tags($cite) . '<br/><br/>';
}
//if it's not a valid search entry or we are just done with the entry, will move on to the next until we have them all
}
curl_close($ch);
?>
This section I think would be the most complex yet so if you have questions feel free to leave them in the comments. Now go make monies.






















