Noobies Guide on How to Scrape: Part 1 – Intro & Tools
Monday, April 6, 2009 0:03Welcome to the Noobies Guide to Scraping: Part 1. In this installment we are only going to focus on a few very basic things that we are going to need to get started, and no code will be wrote.
What is scraping? Scraping is the process of getting / gathering data from some web source, whether it’s off an rss feed or web-page or some other source. A program that automates the scraping (data collection) function is called a “Scraper“.
Still a little confused? Lets use a website as an example, like yellowpages.com. Yellowpages.com has tons of useful data you may want like addresses, phone numbers, business names. Let’s say for example, you wanted to build a small directory of business names, phone numbers, and addresses. Chances are it would take you ages to collect all the info you want, so why not build a scraper that gathers all the info and automates the process?
Have you ever heard the adage “Work smarter, not harder?”. There’s no point to waste your valuable time when some simple software can do it for you. Before we begin there’s a few things your going to need:
Firefox. That’s right, the mother of all browsers. Not just because the other tools we will use require it but because it’s a much better browser than IE.
Firebug. Every web developer should have this plugin in their arsenal. It is a swiss army knife of web-development. You may not need it for basic scrapers, but it’s better to have than not.
Live HTTP Headers. An integral part of seeing what kind of information being passed while using a website. Very useful even for basic scrapers, I find myself using it a ton.
Our scraper will be built in PHP utilizing cURL, as as much as this is a noobie tutorial this isn’t a lesson in PHP. Your expected to know at least basic-intermediate PHP.






















Georgie says:
April 6th, 2009 at 7:53 am
Live http headers is a bit lightweight for writing a scraper. I use Charles proxy, you can set filters so you ignore images and shit and gets all requests even from Flash files. You should check it out
Brad says:
April 6th, 2009 at 10:45 am
@Georgie: It’s also $50, fuck that. You don’t need it.