Noobies Guide on How to Scrape: Part 3 – Basics of Assessing Your Target

Sunday, June 7, 2009 13:29
Posted in category Noobie Scraping Guide

This post is a re-write.  I didn’t think the last version was very good, was too long, and looks like it even went out of date. This version is shorter, up to date (as of right now), and easier to follow I think.

For the sake of teaching, I’m going to pick a fairly easy target.  We are going to write a pretty simple Google scraper.   A Google scraper is just complex enough to get your dick wet, while at the same time posing a few easy to solve problems.  We are not interested in scraping all the information from the page we are going to focus on some general information: URL, Title, Description, and Cite.

In these beginning stages I tend to write things down, usually just in something like notepad.  Things like format of URLs, where certain information your trying to grab,  any other problems you may run into.

The rest of this article is just going to be my observations and notes.  I’ll be using Firebug to pull the page apart.

URL

The general Google url looks something like this:

http://www.google.com/search?q=huge+fish&hl=en&start=10&sa=N

Variables:

  • q – Your search query
  • hl – Language
  • start – Starting result number.  Not needed for first page results.  Goes up in increments of 10.
  • sa -  Not sure, but you don’t actually need it.

I used the search term huge fish because it gives us all the types of results that are possible, video, image, and search.

Search Entry Information

shot

Page Information (Click To Enlarge)

To make things easier I colored and numbered the code we are looking at.  Red boxes and numbers are the actuall ad.  Green boxes and numbers are the corosponding code.

  1. The Title
  2. The Description
  3. The Citation

You’ll also notice two light blue boxes around the <li class=”g> and the <div class=”s>.  These are both importantant to notice as they represent an actuall search entry (not video or image).  Both needed to determain if it is indeed a search entry.  We need to make note of this when building the scraper.

The only other thing we haven’t located was the URL.  It’s in the line above the green #1 box:

<a class=”l onmousedown=”return clk(this.href,”,”,’res’,'3′,”) href=”http://fishosaur.com/ realurl=”http://fishosaur.com/>

We’ll just pull it out of that href field.

That ends part 3.  We know where all our data is, now we just need to get down to coding.  Next up, the backbone of a scraper, cURL.

You can leave a response, or trackback from your own site.
Hotmail


Tags:

One Response to “Noobies Guide on How to Scrape: Part 3 – Basics of Assessing Your Target”

  1. Internet Marketing Blog says:

    September 25th, 2009 at 1:45 am

    http://www.google.com/search?q=huge+fish&hl=en&start=10&sa=N
    * q – Your search query
    * hl – Language
    * start – Starting result number. Not needed for first page results. Goes up in increments of 10.
    * sa – Not sure, but you don’t actually need it.

    I was never aware with this info but thnks for providing the exact general Google url ..
    that’s a new learning for me.

Leave a Reply