Now we get the idea of POST and GET. We found our target, we know it’s url structure, we know where the data is, but how do we use PHP to fetch the webpages?
Luckily we have what is call cURL (from PHP.net):
PHP supports libcurl, a library created by Daniel Stenberg, that allows you to connect and communicate to many different types of servers with many different types of protocols. libcurl currently supports the http, https, ftp, gopher, telnet, dict, file, and ldap protocols. libcurl also supports HTTPS certificates, HTTP POST, HTTP PUT, FTP uploading (this can also be done with PHP’s ftp extension), HTTP form based upload, proxies, cookies, and user+password authentication.
Basically when it comes to interacting with the net, there’s not much cURL can’t do. I like to think about it as a completely controllable web browser.
Now chances are when you first start using cURL you’ll look at the manual on PHP.net and go bonkers. Where do I start? What should I know? Well, get ready for a cURL crash course. At the end of this will be a little cURL template that you can use to save you some time in future projects.
I like to break cURL up into a few little segments: Initialization/Closure, General Options, Cookies, GET/PUT, URL Execution and Retreival.
To initialize a cURL resource it’s as easy as:
$ch = curl_int();
$ch becomes our cURL resource. Why the variable $ch? Because that’s what the manual uses, and what I use. And to close a cURL resource:
curl_close($ch);
In between curl_int and curl_close there lots of options that can be set using curl_setop. A few I use on nearly every scraper are:
CURLOPT_SSL_VERIFYPEER – Used to tell cURL whether or not to verify a peers certificate. TRUE by default. I always use FALSE.
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
CURLOPT_RETURNTRANSFER - Very important. FALSE will have cURL just dump all the code onto the page, TRUE will allow us to capture all the info into a variable fo further processing.
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
CURLOPT_HEADER – TRUE to output the header. Doesn’t say if FALSE is the default value, but I always set it just because.
curl_setopt($ch, CURLOPT_HEADER, FALSE);
CURLOPT_USERAGENT – Your browsers User Agent. You don’t need this for all scrapers, but some sites throw a shit fit if you don’t include it. So as a good rule of thumb just include it. You can use Live HTTP Headers to tell what your current User Agent is, or just use the one I’m going to post.
$agent = 'User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.3) Gecko/2008100922 Ubuntu/8.04 (hardy) Firefox/3.0.3';
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
CURLOPT_FOLLOWLOCATION – TRUE to follow any “Location:” header. Basically to follow all redirects. Important, should always be TRUE.
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
Setting cookies is pretty easy:
CURLOPT_COOKIESESSION - From the cURL manual:
TRUE to mark this as a new cookie “session”. It will force libcurl to ignore all cookies it is about to load that are “session cookies” from the previous session. By default, libcurl always stores and loads all cookies, independent if they are session cookies or not. Session cookies are cookies without expiry date and they are meant to be alive and existing for this “session” only.
You almost always want this to be TRUE, on rare occasions depending on how the website acts, you may have to load the old session cookies.
curl_setopt($ch, CURLOPT_COOKIESESSION, TRUE);
CURLOPT_COOKIEFILE – A file to read/write all cookie data from/to.
$cookie = '/tmp/cookie.txt';
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie);
CURLOPT_COOKIEJAR – Basically a file that cURL will write all internal cookies to when the connection closes.
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie);
Well I kinda lied, GET is the same as just setting a URL and variables manually, retrieving the page with no header modification. So we’ll actually cover GET in the next section. However Post is a little bit more complicated, and less common.
CURLOPT_POST – TRUE to do a POST.
curl_setopt($ch, CURLOPT_POST, TRUE);
CURLOPT_POSTFIELDS – From the cURL manual:
The full data to post in a HTTP “POST” operation. To post a file, prepend a filename with @ and use the full path. This can either be passed as a urlencoded string like ‘para1=val1¶2=val2&…‘ or as an array with the field name as key and field data as value. If value is an array, the Content-Type header will be set to multipart/form-data.
I find it easiest to just use the urlencoded string, although I’m sure some of you are fanboys for the arrays. You fanboys can get bent.
$post = 'var1=val1&var2=val2';
curl_setopt($ch, CURLOPT_POSTFIELDS, $post);
CURLOPT_URL – The value of your URL.
$url = 'http://www.domain.tld/index.php?var1=val1&var2=val2';
curl_setopt($ch, CURLOPT_URL, $url);
curl_exec – Finally, we execute and capture our returning output in the $rawdata variable:
$rawdata = curl_exec($ch);
NOTE: One of the biggest errors I see other coders make is they think they have to reset all their options every time they want to execute cURL. In the same script, you only have to change the options you want to change between curl_exec. What you have set will stay set. So most the time all you have to do is just change the URL and run curl_exec again.
Fin!
When you put it all together you’ll have a nice little cURL skeleton like this:
<?php
$cookie = '/tmp/cookie.txt';
$agent = 'User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.3) Gecko/2008100922 Ubuntu/8.04 (hardy) Firefox/3.0.3';
$post = 'var1=val1&var2=val2&var3=val3';
$url = 'http://www.domain.tld/index.php?var1=val1&var2=val2';
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, FALSE);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
//POST - don't include for GET.
curl_setopt($ch, CURLOPT_POST, TRUE);
curl_setopt($ch, CURLOPT_POSTFIELDS, $post);
//cookies
curl_setopt($ch, CURLOPT_COOKIESESSION, TRUE);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie);
curl_setopt($ch, CURLOPT_URL, $url);
$rawdata = curl_exec($ch);
curl_close($ch);
?>
Out.