neta1o
Regular Member
- Sep 29, 2008
- 388
- 318
I wrote this tool a while back to quickly scrape content.
Here is how this works.
1) You will want to put the file on one of your servers that supports php.
2) Go to the page and you have to input 3 things.
- The full url of the page you want to scrape (http://www.google.com)
- The beginning search string
- The ending search string
Remember this works off of the source code of the page. Source code changes often, especially if you are scraping Google stuff. Also, sometimes you need to experiment a bit before you get what you want.
Here is an example of Yahoo Personals. Say I wanted to scrape the headlines/titles of each person.
First I go to a page that I want to scrape:
URL -
Then I view the source and I see that the title line looks like this
If I want to extract the title lines throughout the entire page I will need to use the following.
Beginning search string -
End search string -
This would give me the following:
Obviously this is a very simple example, this has a lot of applications and can save you a lot of time in various projects. Keep in mind, in my experience this will not work on private pages, only publicly viewable pages.
Example 2: Let's say you wanted to get all of the urls for the city listing on the following page. Load the page and viewing the source we see the following.
Input the following in the scraper tool
URL -
Beginning search string -
End search string -
Example 3: In this example we will scrape source code with variable code. I will use source code with a variable link. If I wanted to scrape the titles in a section of craigslist I may view the code and see the following.
Each title has a different /bik/#### depending on the post. To get around this we can use a wildcard .+?. This wildcard takes the place of everything in a variable rate.
URL -
Beginning search string -
End search string -
We get the following
I've got a lot of time savers so I thought I'd give back. A thanks would be appreciated
P.S. If you hare having trouble finding the proper combination of beginning and ending search strings for a URL post them here and I'll try to help. Also if you have any suggestions to enhance this I'd be happy to tweak it and re-upload a new version.
Here is how this works.
1) You will want to put the file on one of your servers that supports php.
2) Go to the page and you have to input 3 things.
- The full url of the page you want to scrape (http://www.google.com)
- The beginning search string
- The ending search string
Remember this works off of the source code of the page. Source code changes often, especially if you are scraping Google stuff. Also, sometimes you need to experiment a bit before you get what you want.
Here is an example of Yahoo Personals. Say I wanted to scrape the headlines/titles of each person.
First I go to a page that I want to scrape:
URL -
Code:
http://dating.personals.yahoo.com/results?resulttype=1&searchsource=1&searchview=1&r_gender=1&r_gender_pref=2&r_min_age=18&r_max_age=37&r_has_photo=2&r_locid=24024871&r_loc_ver=2&r_language_pref=1&use_compat=0&gender_select=2&&alt_nsi=&advanced=1&mbm_signup=
Then I view the source and I see that the title line looks like this
Code:
<em>“I NEED A SOUL MATEâ€</em>
If I want to extract the title lines throughout the entire page I will need to use the following.
Beginning search string -
Code:
<em>“
Code:
â€
This would give me the following:
I NEED A SOUL MATE
seeking for marriage
Seeking for a Godfearing man
will be glad to your attention to me!
Serious Relationship
Hello! let's go!
Take my hand, lets Explore something New
a little bit of this and that...
Let's meet
Am look my love
Obviously this is a very simple example, this has a lot of applications and can save you a lot of time in various projects. Keep in mind, in my experience this will not work on private pages, only publicly viewable pages.
Example 2: Let's say you wanted to get all of the urls for the city listing on the following page. Load the page and viewing the source we see the following.
Code:
<a href="http://abilene.craigslist.org/">abilene</a>
Input the following in the scraper tool
URL -
Code:
http://geo.craigslist.org/iso/us
Code:
<a href="
Code:
">
Example 3: In this example we will scrape source code with variable code. I will use source code with a variable link. If I wanted to scrape the titles in a section of craigslist I may view the code and see the following.
Code:
<a href="/bik/940033833.html"> Gary Fisher Joshua F-1 frame and forks etc. -</a>
URL -
Code:
http://anchorage.craigslist.org/bik/
Code:
<a href="/bik/.+?">
Code:
</a>
We get the following
Gary Fisher Joshua F-1 frame and forks etc. -
Bridgestone MB3 shockless -
Great Christmas Gift: Ladies Bicycle, Like New! -
Panniers -
Moutain Bike -
Bike Trailer/Jogger -
I've got a lot of time savers so I thought I'd give back. A thanks would be appreciated
P.S. If you hare having trouble finding the proper combination of beginning and ending search strings for a URL post them here and I'll try to help. Also if you have any suggestions to enhance this I'd be happy to tweak it and re-upload a new version.