Noobies Guide on How to Scrape: Part 2 – URLs, URL Variables, and using Live HTTP Headers

Wednesday, April 8, 2009 21:11

Understanding the fundamentals of how sites communicate with themselves, and how we communicate with them, is crucial in being able to reverse engineering a site for our scraper.   Luckily it’s pretty easy for the most part.

Anatomy of a URL

Image 2.1

Image 2.1

  1. The protocol your using.
  2. The website your trying to get to.  Although www is synonymous with the base of the domain it doesn’t have to be, for example www.domain.tld can be a totally different site from domain.tld (no www).
  3. Domain name and extension.
  4. Page your trying to access.
  5. Separates between file name and arguments (variables) being passed.
  6. Name of the first variable.
  7. Value of first variable.
  8. Name of the second variable.  The ‘&’ symbol designates the end of one variable and the beginning of another variable.
  9. Value of second variable.

There really are only two ways how a site passes information: GET and POST.

There’s only minor differences between them. Image 2.1 is a demonstration of passing information using GET variables.  GET variables you’ll see right in your address bar, and have a fixed size limit. POST variables are passed in the header, are wrote the same way as GET variables, are not directly seen by the user, have no fixed size limit, and looks much cleaner.  POST variables are slightly “safer” but pretty much just as easy to read.

Live HTTP Headers Example 1: GET

In this first example, I went to yellowpages.com.  In Find I put: chinese.  In the Location box I put: chicago.  Pull up Live HTTP Headers, make sure “Capture” is clicked, and click “Find” on the yellowpages.com page.

Image 2.2 (click to enlarge)

Image 2.2 (click to enlarge)

You should get a similar response.  The bold numbers indicate:

  1. The URL we are navigating too.  You can see that we are passing everything in plain sight, it must be a GET request.
  2. We see in fact that it is a GET request, and we see the query.

The URL:

http://www.yellowpages.com/search?search_terms=chinese&geo_location_terms=chicago&x=21&y=11

You’ll also notice that our information we typed in are in the search_terms variable (chinese), and the geo_location_terms variable (chicago).  I have bolded them for illustration purposes.

There’s lots of other stuff there, but chances are you won’t have to use any of it.  That’s more of an advance subject that I won’t be covering in this series.

Live HTTP Headers Example 2: POST

POST is not used as commonly as GET are – if you wrote code you’d understand why.  Usually you see POST being used for logins.  So I headed over to Facebook and just tried to login with some fake credentials.  Email: aguy@someone.com Password: apssword.

Image 2.3 (click to enlarge)

Image 2.3 (click to enlarge)

You should get back something similar.  The bold numbers indicate:

  1. The URL we are trying to navigate to.  Even though all our variables are getting sent via POST, notice the one GET variable login_attempt.
  2. We see that we are using POST.
  3. All the variables being POSTED to the page.

This particular string of variables is pretty long, there’s some extra stuff in their that Facebook is passing itself, but the whole thing looks like this:

charset_test=%E2%82%AC%2C%C2%B4%2C%E2%82%AC%2C%C2%B4%2C%E6%B0%B4%2C%D0%94%2C%D0%84&locale=en_US&email=aguy%40someone.com&pass=apssword&pass_placeholder=&charset_test=%E2%82%AC%2C%C2%B4%2C%E2%82%AC%2C%C2%B4%2C%E6%B0%B4%2C%D0%94%2C%D0%84

You’ll see our email is in the email variable (aguy%40someone.com) %40 is really just the @ symbol.  Certain characters in a URL have to be encoded, and you’ll find a nice guide outlining all the encodings here.  You’ll also see our password in the pass variable (apssword).  I have bolded them for illustration purposes.

Fin

That’ll end Part 2.  You should have a pretty good idea of how URLs are encoded, variables are passed, and the basic usage of the Live HTTP Headers plugin for Firefox.  Stay tuned, we are just starting.

You can leave a response, or trackback from your own site.
Hotmail


One Response to “Noobies Guide on How to Scrape: Part 2 – URLs, URL Variables, and using Live HTTP Headers”

  1. Justin says:

    April 8th, 2009 at 11:47 pm

    Good stuff. I’m *really* looking forward to the upcoming articles. I’ve been using Snoopy and SimplePie to scrape basic data and rss feeds, but I really have no idea how they work. This is going to help immensely…

Leave a Reply