Noobies Guide on How to Scrape: Part 5 – A Basic Scraper

Thursday, June 11, 2009 20:04
Posted in category Noobie Scraping Guide

Now that we are up to speed on the data we want to collect, and how cURL works, a basic scraper it’s really just a hop, skip, and a jump away.

Getting Data

The only other point we haven’t covered was how to effectively pull data from our page.   For example say we want to grab the value of a link on a page, it’d look something like this:

<a href=”this-is-a-link.com“></a>

How do we easily remove the link value (this-is-a-link.com)?

For such a feat I have long ago wrote my own function:


function getValue($item, $query, $end){

$item = stristr($item, $query);
$item = substr($item, strlen($query));
$stop = stripos($item, $end);
$val = substr($item, 0, $stop);
return $val;
}

It works pretty simply:

  • $item : The data we want to get our value from
  • $query:  The character(s) immediately in front of the value we want
  • $end : The character(s) immediately behind the value we want

The function naturally returns the value in between the first $query and first $end after $query from $item (you catch all of that).  So in the above example (the one with the link) it was be wrote like this:


getValue('<a href="this-is-a-link.com"></a>', '''href="', '">');

Now the second and third fields I used more characters than what I needed just for illustrative purposes.  However the value of this-is-a-link.com would have been returned.

A Note About Firebug

One last mention is when working with the Firebug plug-in for Firefox, make sure you double check the source code.  I had run into it, twice, this scraper where the values shown inside Firebug where not right.  It was putting quotes around some values that didn’t have any quotes around them for example in the source code below, line 47, it had shown <li class=g> as <li class=”g”>, which in the scraping world are WAY different.

The Scraping Process

This is a pretty basic scraper and I’m going to quickly go over how it works:

  • lines 3-13: The getValue function.
  • lines 16-25: Setting up generic and search variables.  Notice to change your search term you change the vlue on line 23.
  • lines 28-44:  Setting all the cURL values, and execting it on line 44.
  • line 47: Chop off all the code up until our first search result.
  • line 50:  Explode all the results into an array.
  • line 53:  Setup and loop through all the results.
  • line 55:  Find out if it’s a search result and not a video or image result.
  • lines 56-68:  It it was a normal search result, pull our data.
  • lines 71-74:  Put our data on screen.
  • line 80: close cURL.

It’s only setup to scrape the first page of results from Google.  It’s plenty commented it would be pretty easy to add a little code to get it to get more pages.  Patience, you must walk before you run.

Sourcecode

I didn’t spell check any of the comments in this code, so if you find any just ignore it.

<?php

function getValue($item, $query, $end){
//$item is where we want to search
//$query is something unique just before the value we want
//$end is the something unique just after the value we want
//function returns the first value that is inbetween $query and $end inside $item
$item = stristr($item, $query);
$item = substr($item, strlen($query));
$stop = stripos($item, $end);
$val = substr($item, 0, $stop);
return $val;
}

//path to my cookie file
$cookie = '/tmp/cookie.txt';
//my user agent
$agent = 'User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.3) Gecko/2008100922 Ubuntu/8.04 (hardy) Firefox/3.0.3';
//my base url
$url = 'http://www.google.com/search?q=%searchterm%&hl=en&start=';
//my search term
$searchTerm = 'huge fish';
$searchTerm = urlencode($searchTerm);
//create a search url
$searchURL = str_replace('%searchterm%', $searchTerm, $url);

//initialize new cURL resource
$ch = curl_init();
//set our general options
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, FALSE);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);

//cookies
curl_setopt($ch, CURLOPT_COOKIESESSION, TRUE);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie);

//set our URL
curl_setopt($ch, CURLOPT_URL, $searchURL);
//query the url, get our data
$rawdata = curl_exec($ch);

//find first <li class="g"> return all data after it
$rawdata = strstr($rawdata, '<li class=g>');

//explode the rest of data broken up by search result
$results = explode('<li class=g>', $rawdata);

//setup a loop
foreach($results as $value){
//check to see if it's an actual search entry and not image or video
if(strstr($value, '<div class="s">')){
//it is a valid search entry

//move up to fist link in entry
$data = strstr($value, '<a');

//get url
$url = getValue($data, 'href="', '"');
//get title
$title = getValue($data, '>', '</a>');
//get description
$description = getValue($data, '<div class="s">', '<br>');
//get cite
$cite = getValue($data, '<cite>', '</cite>');

//Do any other processing HERE
echo 'URL: ' . $url . '<br/>';
echo 'Title: ' . strip_tags($title) . '<br/>';
echo 'Description: ' . strip_tags($description) . '<br/>';
echo 'Cite: ' . strip_tags($cite) . '<br/><br/>';

}
//if it's not a valid search entry or we are just done with the entry, will move on to the next until we have them all
}

curl_close($ch);

?>

This section I think would be the most complex yet so if you have questions feel free to leave them in the comments. Now go make monies.



Don’t upgrade to Wordpress 2.8

Thursday, June 11, 2009 15:25
Posted in category Rant

The visual editor in Wordpress 2.8 is fucked for the time being (javascript errors), at least if you use anything besides IE from what I understand.

So hold off on it, was going to do a post today, but now my mojo is gone, maybe later.



Noobies Guide on How to Scrape: Part 3 - Basics of Assessing Your Target

Sunday, June 7, 2009 13:29
Posted in category Noobie Scraping Guide

This post is a re-write.  I didn’t think the last version was very good, was too long, and looks like it even went out of date. This version is shorter, up to date (as of right now), and easier to follow I think.

For the sake of teaching, I’m going to pick a fairly easy target.  We are going to write a pretty simple Google scraper.   A Google scraper is just complex enough to get your dick wet, while at the same time posing a few easy to solve problems.  We are not interested in scraping all the information from the page we are going to focus on some general information: URL, Title, Description, and Cite.

In these beginning stages I tend to write things down, usually just in something like notepad.  Things like format of URLs, where certain information your trying to grab,  any other problems you may run into.

The rest of this article is just going to be my observations and notes.  I’ll be using Firebug to pull the page apart.

URL

The general Google url looks something like this:

http://www.google.com/search?q=huge+fish&hl=en&start=10&sa=N

Variables:

  • q - Your search query
  • hl - Language
  • start - Starting result number.  Not needed for first page results.  Goes up in increments of 10.
  • sa -  Not sure, but you don’t actually need it.

I used the search term huge fish because it gives us all the types of results that are possible, video, image, and search.

Search Entry Information

shot

Page Information (Click To Enlarge)

To make things easier I colored and numbered the code we are looking at.  Red boxes and numbers are the actuall ad.  Green boxes and numbers are the corosponding code.

  1. The Title
  2. The Description
  3. The Citation

You’ll also notice two light blue boxes around the <li class=”g> and the <div class=”s>.  These are both importantant to notice as they represent an actuall search entry (not video or image).  Both needed to determain if it is indeed a search entry.  We need to make note of this when building the scraper.

The only other thing we haven’t located was the URL.  It’s in the line above the green #1 box:

<a class=”l onmousedown=”return clk(this.href,”,”,’res’,'3′,”) href=”http://fishosaur.com/ realurl=”http://fishosaur.com/>

We’ll just pull it out of that href field.

That ends part 3.  We know where all our data is, now we just need to get down to coding.  Next up, the backbone of a scraper, cURL.



Tags:

Affiliates Being Subpeonad

Thursday, June 4, 2009 17:04
Posted in category News

This story broke yesterday on WF by Jon:

BREAKING NEWS!

This is not a rumor. This is 100% confirmed.

Today there were as many as 4-5 subpeonas served to AFFILIATES who promoted Acai Berry offers and are based in the state of Illinois.

UPDATED 06/04/09: All of the affiliates that were served with subpeonas are Illinois based, and were allegedly using Oprah in their ads, on Facebook only. All were shown information that could have only been attained via a subpeona of Facebook. The Illinois attorney general subpeona has requests for financial information, campaign information, tracking information, networks and advertisers tied to the campaigns, etc. So they are obviously following the typical AG investigating protocol. Still strange enough to see them going after affiliates first and then working their way up, as its usually the opposite. But this is shaping up to become a big mess as I’ve been told by other state AG’s that they are very interested in following this investigation and very likely will begin their own or a possible joint investigation (lets hope it doesn’t get to that). The affiliates involved have asked to remain anonymous, so please keep people’s names to yourselves if you can, and respect their privacy wishes for a change. Bottom line, this is being fueled by Oprah/her lawyers, Illinois AG, Facebook + FTC partnership.

There have been rumors from the Harpo camp close to the Facebook/Acai advertiser subpeonas that Myspace/Newscorp was to be subpeonad for advertiser account information too, but this has not been confirmed or commented on by anyone from Newscorp or the Illinois AG as of yet.

For privacy reasons I will not name any of the affiliates who were subpeonad today, but this came as a pretty big shock as the Illinois Attorney General had some pretty in-depth knowledge of the people involved, more so than usual.

UPDATED 06/04/09: I still strongly believe there is an industry insider ratting, but its my opinion and I’ll just keep it to myself now as I’m not out to destroy anyone’s reputation without 100% definitive proof. Lets see how it plays out unfortunately.

This is the first time that a state attorney general has gone after ONLY the largest affiliates, one by one, as if they were reading it off of a numbered hit list.

Before you even ask, yes, most of the guys on the list are members here. If they want to come out and speak about it, that’s up to them, so please don’t turn this into a rumor mill or witch hunt. I’ve got most of their names confirmed and confirmation from an anonymous source with the Illinois Attorney Generals office that this is in fact quite real.

The only good side of it is that they don’t seem to be seeking out any jail time, just civil action. I can absolutely assure you that this is not going to be the last time we see any of this happen, as there is an unconfirmed suit by another state AG finishing up and getting ready to go after a handful of advertisers and networks, but no affiliates that I know of.

UPDATED 06/04/09: Boutique affiliate/agency Bloosky was hit with a suit from the AG in Utah on Tuesday over their Google bizopp (whatever the name is, they all sound the same), so not Acai related, but still interesting, will make a seperate thread about this later. It doesn’t look like anything very serious either and will most likely be settled out of court quickly. I know for a fact that there is no mention of affiliates in this at all, it seems to be more about customer complaints who couldn’t figure out how to cancel/refund (typical, they know how to file with the AG but can’t figure out how to call for a refund).



Why don’t my links work right in FBSpy? Why am I still getting duplicate ads?

Wednesday, May 27, 2009 14:37
Posted in category FBSpy

There’s always questions when things change.  I know lots of you aren’t coders, and have no idea what changed on Facebook this week.  I always get annoyed when I get a bunch of people asking the same question.  I want everyone to be happy, or at least understand what happened this weekend and why some of the features don’t work necessarily as they once did.

As you undoubtedly know, this last Sunday Facebook rolled out a new adboard.  It seems to contain more ads (I’m sure so they can drive up more impressions - or their trying to help us out), and the actual code format was altered.  With some quick analysis and MAD skills FBSpy once again was back.  However due to changes FACEBOOK made, things just aren’t the same with FBSpy.

Why when I click on an ad in FBSpy it takes me to 404 page on Facebook?

In this newest update done to the ad wall they changed how the links look.  They use to be just the ads ID plugged into a GET variable, but now the actual variable is a random mix of alpha numeric characters around 1000 characters long.

Before (v1.6 or less) when you clicked on an ad it would take you to Facebook and one of two things would happen.

  1. If you weren’t logged in, it redirected you to a login page.
  2. If you are logged in, you’ll end up on the LP.

The newest version (1.7)  works somewhat the same way:

  1. If you weren’t logged in, you no longer get redirected to login.  Instead you end up on a 404 page.
  2. If you are logged in, you’ll end up on the LP.

So all you have to do is login before hand then click the ad.

But that’s not a total truth.

That’s right, even more things changed with the whole system.  After a certain amount of time(I’m not sure how long - but it’s not that long) you’ll not be able to use the link anymore (I’m talking about click on an ad from inside FBSpy).  I’m pretty confident that they are randomizing ad IDs, and ad URL links, that only have an actual value for a small amount of time.

This could be for several reasons:

  1. They don’t want outside sources tracking individual ads.
  2. They don’t want you to be able to click an ad that may no longer be running, or may not even exist.
  3. They don’t want you to be able to tell the difference between identical ads.  You could see two identical ads with different LP’s, but now we have no idea if the ads are different or if someone’s split testing.

It’s actually a really good system, and the way they should have been doing it from the get go.  Sucks for us.

Why am I still getting duplicate ads, when I have it set to not get them?

The code that’s been used to determine if you have a duplicate ad in your database checks to see if you have a matching ad ID in your database.  An ad ID use to be a unique “fingerprint” to that ad.  We no longer have any known “fingerprint” to go off of.  This was not known when bringing FBSpy up to 1.7 on Sunday.

The function in essence is obsolete.  Ad IDs are randomly generated it seems to the chance your going to get two ad IDs the same are slim, and even more super slim that they will in fact be the same ad.

That should clear up all loose hanging fruit as of now.

Now go make monies.



Tags:

FBSpy: Empty reply from server

Monday, May 25, 2009 23:45

There’s more and more of this cropping up.  With the latest update you should be able to tell if Facebook is blocking your IP.  You’ll get a curl error along the lines of ” cURL errno: curl_errno. cURL error: Empty reply from server”.

This to the best of my knowledge means Facebook is blocking your IP from accessing the adwall.  There is nothing I can do about this.  It’s part of the game of getting and analyzing ads.  I know most of you would like to think not, but that just means your not that smart.  Start using an HTTP proxy, or be ready to move FBSpy to another IP.  Be ready to create more accounts.  So on and so forth.

I do not control how/what Facebook does to try and stop you.  YOU have to be able to adapt and overcome.  I do not want anymore e-mail on this subject because they will go unanswered.

There’s really two sides to this, because we (the user of FBSpy) are driving up the impression (and if you click on an ad, the cost) of ads,  it’s logical that FB would try and block IP’s from the adboard.  At the same time, this app should allow FB advertisers to get new ideas and spend more money with FB… not that FB is concerned with making money or anything like that, because they aren’t.

So this is the end of that discusion, your on your own.  Find hosing that isn’t blocked or use proxies because it’s on your end, and not a coding issue.



FBSpy 1.7 [critical update!]

Sunday, May 24, 2009 19:04
Posted in category FBSpy

Today some of you will know, that your FBSpy app stopped working.  This post goes over the update.

What Changed?

  • The Facebook adwall changed, so parts of the scraper had to be re-coded or adjusted.
  • Seems like the adwall shows more ads now, which means you’ll get more ads in a pull.
  • The database field ‘url’ has been added to the back end because the ad url is no longer just the ads ID.
  • Added a couple error checks for when grabbing ads.

How to update

  • Download the FBSpy 1.7 Update Here.
  • Unpack, and copy over your current installation, except maybe for config.php.
  • Create a new database [in case you want to port something over from the last installation].  Double check/update your config.php.
  • Run install.php to come up to date for version 1.7.
  • It’s ok to copy over your user data, nothing changed there (I’m sure you’ll love that viper crasher :P).  However all old ad data wont work in 1.7.

or

  • Delete everything, move your old licence.lic file into your new installation, setup like you would when you first installed.

A word on hosting

I’d recommend going with a VPS.  Seems like some people running FBSpy from shared hosting are not receiving an ad board back, I’m guessing they are starting to block IP’s.  I don’t have any definitive proof but it seems like a trend.  Known hosts so far effected, HostGator (new accounts), SEO Hosting, Asmallorange.

If you use proxies you shouldn’t have a problem.

Also don’t run FBSpy from GoDaddy hosting, it just wont work.  Reason being GoDaddy sucks a big fat cock (for hosting and domains), and if you use them you should go jump off a fucking bridge (figuratively).

How to Purchase

Current price is $49. Questions, e-mail: fbspy (at) maddpcc [dot] com.


After the transaction is complete you will receive the files via e-mail, to the e-mail you paid with, usually within 24hrs.



Profit Kings Media

Saturday, May 23, 2009 21:32
Posted in category Interview

Yesterday I had the pleasure to talk to Yousif Yalda, who is the owner of Profit Kings Media, a brand new network.  Yousif Yalda is a successful affiliate marketer himself, so you can expect that he knows what the typical affiliate goes through, and how he can make a better network.  If you hang out of WickedFire his screen name is GrindHard.

Enough of my yapping, lets get to it:

MADPPC: First Question, how did you get into affiliate/internet marketing?

PKMYousif: First of all we would like to thank you for taking the time to talk to us. I got started into affiliate marketing rather by surprise. I was browsing a forum one day and I stumbled upon a guy who responded to my thread — and the topic wasn’t even relevant. I hit him up one day to thank him for giving me such a valuable response and we got into talking about what we both do for a living. At the moment, I was attending school full-time and working on various web application security ventures. He explained to me how much he made, and I immediately thought to myself that he was specializing in stocks or some other industry I had no insight about. He later explained to me in brief what affiliate marketing was all about, and so I was interested and started doing my research. A few weeks later with some search and asking questions, I was ready to get started! Pretty random, it turned out really well!

MADPPC: Now I understand you have your own network now?

PKMYousif: Yes, I do. It’s called Profit Kings Media and you can check us out at http://www.ProfitKingsMedia.com ! We’re a brand new network and taking off really well. We’re changing up the industry one network at a time, and we believe that overtime we’ll be the solution that all will turn to for real affiliate marketing. We understand from experience the flaws that exist for such a long time with cash flow, greedy networks taking HUGE margins, and more — and the way we operate shows these solutions in plain sight.

MADPPC: Now from what you told me already, this network sounds like a great opportunity for anybody whether your a seasoned pro or just getting started in the game. Many people when they first start out can’t do the standard $1000 a week to get wires that most networks want. Can you tell us a little bit more about how you payout?

PKMYousif: Sure, there’s always been a struggle with cash flow since most networks can’t seem to offer a better payout. This is a business , and what business do you know pays it’s employees on a 30 day basis? Let’s get real for a second here. The entire CPA model is by default in the advertisers favor, and we know from experience evaluating leads does not take such a lengthy time. I understand the need for that cap as only ’serious affiliates’ should be allowed this sort of payout structure, but this is a disadvantage for any affiliate to get started and progress. We understand that you need cash flow to help with re-fueling and resuming investments with paid traffic or any other resources that are needed. With Net 30, you really limit the affiliates performance. This is our intake, it differs greatly with other networks. I won’t go into the business side of why most networks do this, as it has its benefits but we’re financially stable to do this and believe in it. We pay out weekly via Check, Wire Transfer, and ACH Deposit.

MADPPC: And that’s no matter how much you do in that week?

PKMYousif: Whether you’re making $30 or $300,000 - we will pay you weekly.

MADPPC: Now that’s something I have NEVER heard of a network doing. What tracking platform do you run on?

PKMYousif: Well, we plan on killing one network at a time, and this is just one of the many strategies we use to operate. I have a few friends who run their own networks and share the same visions as Profit Kings Media and the directions we are taking to move forward. We are using Linktrust as our tracking system.

MADPPC: About how many offers do you offer right now?

PKMYousif: We have over 200 offers under testing, and about 30 offers live. We only carry offers that are truly reporting high ROI and convert well. Many networks publicize hundreds of offers, but how many are truly worth promoting? We are adding new offers daily and you can expect to turn serious loot.

MADPPC: Those kind of networks I typically favored. Understandably your a new network, but how many affiliate managers do you have right now? Also, what kind of help can they give to enhance the affiliates experience, and most importantly, their bottom line?

PKMYousif: At the present time, I don’t feel the need of hiring anyone as I can handle the work load thus far. On a side note, I do have 2 upcoming affiliate managers who will join on board with PKM soon. PKM as a whole helps with bidding, keywords, LP’s, adcopys, images, traffic source sources, and more. We work over 16 hours a day, 7 day a week, serving you with guidance and advice that will help you make more money.  We are affiliates ourselves and we know where to run offers and how.

MADPPC: One more thing I wanted to touch on before we wrap this up, lets say I have some sort of offer I’d like to promote on your network as the merchant. What are your requirements?

PKMYousif: We love exclusives, and work closely to provide full-blown management to help launch whatever product you have in mind. In fact, we have 3+ exclusives that have yet to hit the industry and have chosen us to be their primary partner in helping to generate the most revenue possible with accurate reporting in quality and volume. You have an idea, now let’s put it out there!

MADPPC: Lastly, for the sake of all the referral whores, do you have a referral system?

PKMYousif: Unfortunately, not at this time! With some extensive feedback, we figured that most affiliates like to keep 100% commission, and I believe that if you’re in this industry to earn an actual income and are serious, then you shouldn’t depend on referrals. That’s just one way to look at it. It’s a nice chunk of change, but at this time it’s not needed as this is the decision we’ve made based on opinions from affiliates we work with. In the future, we might consider offering a life-time referral program.

[End of Interview]

I wish Yousif Yalda and Profit Kings Media the best of luck.  It seems like some of the best networks right now, where founded by affiliates (Ads4Dough, Max Bounty, Convert2Media) so hopefully we get another one.  As Yousif had mentioned they test offers to see what converts the best and use them, out of the 20 or so offers he hits many of the more popular verticals: Dating (several offers), Crush, Education, Acai, Colon, Anti-Wrinkle, Google Cash, Insurance, Credit Reports.

Some of you will be bummed out with the no-referral, but in honestly you can’t get the best possible payout if someone has a referral on you.  What you thought that 2-5% came out of the networks pocket?  Fuck no, it comes out of the affiliates pocket.

Also another huge point plus for smaller affiliates is WEEKLY payouts - no matter how much you make.

BTW I get nothing for doing this interview, just in case some of you might think that.  Check them out!



FBSpy 1.6 Release

Wednesday, May 13, 2009 13:40
Posted in category FBSpy

Update time kiddies!  Why?  Because you guys send me too many support e-mails because your hosting sucks.  That’s the truth.

What’s changed?

  • Actual error messages will output whenever there is an SQL error when getting ads.  There has been this pain in the ass MySQL “The server has gone away” error that is happening like crazy.  Now you’ll be able to rightfully get on there asses of those whos problems it is, your hosting.
  • At request of some guy who has plugged over 300 hundred accounts and 60 some odd proxies into FBSpy, if there is a cURL error it should output an error and skip that account and move on.
  • If you had duplacte ads not allowed for an account, and you clicked ‘Get Ads’ you could see this weird “False Triggered” error.  The actuall error message has been removed.  That was really pissing me off, as well as I could imagin you guys also didn’t like it.

How to update from 1.4->1.6

  • Download the FBSpy 1.6 UPDATE HERE.
  • Unpack and just copy [and overwrite] all the files into your current installation [except maybe config.php].  No SQL updates this time.
  • If you want to install 1.6 separately, you’ll have to move your license.lic file over into the new installation.

Places causing major headaches:

  • HostGator/SEO Hosting - wondering if there has been an IP ban, because it seems like at least new shared hosting accounts are having problems.  They can get FBSpy to login correctly but when requesting the ad wall nothing is returned.  My shared hosting account I got like a little more than a year ago works fine though - WEIRD.
  • GoDaddy - enough said.  If your actually using GoDaddy for anything you should just jump off a bidge.
  • Asmallorange -  Still investigating, looks ast to be the same as HostGator/SEO Hosting.

I’d recomend getting a VPS just to be on the safe side.

How to Purchase

Current price is $49. Questions, e-mail: fbspy (at) maddpcc [dot] com.

After the transaction is complete you will receive the files via e-mail, to the e-mail you paid with, usually within 24hrs.



Tags:

Noobies Guide on How to Scrape: Part 4 - cURL

Monday, May 11, 2009 13:01

Now we get the idea of POST and GET.  We found our target, we know it’s url structure, we know where the data is, but how do we use PHP to fetch the webpages?

Luckily we have what is call cURL (from PHP.net):

PHP supports libcurl, a library created by Daniel Stenberg, that allows you to connect and communicate to many different types of servers with many different types of protocols. libcurl currently supports the http, https, ftp, gopher, telnet, dict, file, and ldap protocols. libcurl also supports HTTPS certificates, HTTP POST, HTTP PUT, FTP uploading (this can also be done with PHP’s ftp extension), HTTP form based upload, proxies, cookies, and user+password authentication.

Basically when it comes to interacting with the net, there’s not much cURL can’t do.  I like to think about it as a completely controllable web browser.

Now chances are when you first start using cURL you’ll look at the manual on PHP.net and go bonkers.  Where do I start?  What should I know?  Well, get ready for a cURL crash course.  At the end of this will be a little cURL template that you can use to save you some time in future projects.

I like to break cURL up into a few little segments: Initialization/Closure, General Options, Cookies, GET/PUT, URL Execution and Retreival.

Initialization / Closure:

To initialize a cURL resource it’s as easy as:


$ch = curl_int();

$ch becomes our cURL resource.  Why the variable $ch?  Because that’s what the manual uses, and what I use.  And to close a cURL resource:


curl_close($ch);

General Options:

In between curl_int and curl_close there lots of options that can be set using curl_setop.  A few I use on nearly every scraper are:

CURLOPT_SSL_VERIFYPEER - Used to tell cURL whether or not to verify a peers certificate.  TRUE by default.  I always use FALSE.


curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);


CURLOPT_RETURNTRANSFER
-  Very important.  FALSE will have cURL just dump all the code onto the page, TRUE will allow us to capture all the info into a variable fo further processing.


curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);

CURLOPT_HEADER - TRUE to output the header.  Doesn’t say if FALSE is the default value, but I always set it just because.


curl_setopt($ch, CURLOPT_HEADER, FALSE);

CURLOPT_USERAGENT - Your browsers User Agent.  You don’t need this for all scrapers, but some sites throw a shit fit if you don’t include it.  So as a good rule of thumb just include it.  You can use Live HTTP Headers to tell what your current User Agent is, or just use the one I’m going to post.


$agent = 'User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.3) Gecko/2008100922 Ubuntu/8.04 (hardy) Firefox/3.0.3';

curl_setopt($ch, CURLOPT_USERAGENT, $agent);

CURLOPT_FOLLOWLOCATION - TRUE to follow any “Location:” header.  Basically to follow all redirects.  Important, should always be TRUE.


curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);

Cookies:

Setting cookies is pretty easy:

CURLOPT_COOKIESESSION -  From the cURL manual:

TRUE to mark this as a new cookie “session”. It will force libcurl to ignore all cookies it is about to load that are “session cookies” from the previous session. By default, libcurl always stores and loads all cookies, independent if they are session cookies or not. Session cookies are cookies without expiry date and they are meant to be alive and existing for this “session” only.

You almost always want this to be TRUE, on rare occasions depending on how the website acts, you may have to load the old session cookies.


curl_setopt($ch, CURLOPT_COOKIESESSION, TRUE);

CURLOPT_COOKIEFILE - A file to read/write all cookie data from/to.


$cookie = '/tmp/cookie.txt';

curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie);

CURLOPT_COOKIEJAR - Basically a file that cURL will write all internal cookies to when the connection closes.


curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie);

GET/PUT:

Well I kinda lied, GET is the same as just setting a URL and variables manually, retrieving the page with no header modification.  So we’ll actually cover GET in the next section.  However Post is a little bit more complicated, and less common.

CURLOPT_POST - TRUE to do a POST.


curl_setopt($ch, CURLOPT_POST, TRUE);

CURLOPT_POSTFIELDS - From the cURL manual:

The full data to post in a HTTP “POST” operation. To post a file, prepend a filename with @ and use the full path. This can either be passed as a urlencoded string like ‘para1=val1&para2=val2&…‘ or as an array with the field name as key and field data as value. If value is an array, the Content-Type header will be set to multipart/form-data.

I find it easiest to just use the urlencoded string, although I’m sure some of you are fanboys for the arrays.  You fanboys can get bent.


$post = 'var1=val1&var2=val2';

curl_setopt($ch, CURLOPT_POSTFIELDS, $post);

URL Execution and Retreival:

CURLOPT_URL - The value of your URL.


$url = 'http://www.domain.tld/index.php?var1=val1&var2=val2';

curl_setopt($ch, CURLOPT_URL, $url);

curl_exec - Finally, we execute and capture our returning output in the $rawdata variable:


$rawdata = curl_exec($ch);

NOTE: One of the biggest errors I see other coders make is they think they have to reset all their options every time they want to execute cURL.  In the same script, you only have to change the options you want to change between curl_exec.  What you have set will stay set.  So most the time all you have to do is just change the URL and run curl_exec again.

Fin!

When you put it all together you’ll have a nice little cURL skeleton like this:


<?php

$cookie = '/tmp/cookie.txt';
$agent = 'User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.3) Gecko/2008100922 Ubuntu/8.04 (hardy) Firefox/3.0.3';
$post = 'var1=val1&var2=val2&var3=val3';
$url = 'http://www.domain.tld/index.php?var1=val1&var2=val2';

$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, FALSE);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);

//POST - don't include for GET.
curl_setopt($ch, CURLOPT_POST, TRUE);
curl_setopt($ch, CURLOPT_POSTFIELDS, $post);

//cookies
curl_setopt($ch, CURLOPT_COOKIESESSION, TRUE);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie);

curl_setopt($ch, CURLOPT_URL, $url);
$rawdata = curl_exec($ch);

curl_close($ch);

?>

Out.



Tags: