Custom Scrape Tool PHP

neta1o

Regular Member
Sep 29, 2008
388
318
I wrote this tool a while back to quickly scrape content.

Here is how this works.
1) You will want to put the file on one of your servers that supports php.
2) Go to the page and you have to input 3 things.
- The full url of the page you want to scrape (http://www.google.com)
- The beginning search string
- The ending search string

Remember this works off of the source code of the page. Source code changes often, especially if you are scraping Google stuff. Also, sometimes you need to experiment a bit before you get what you want.

Here is an example of Yahoo Personals. Say I wanted to scrape the headlines/titles of each person.

First I go to a page that I want to scrape:

URL -
Code:
http://dating.personals.yahoo.com/results?resulttype=1&searchsource=1&searchview=1&r_gender=1&r_gender_pref=2&r_min_age=18&r_max_age=37&r_has_photo=2&r_locid=24024871&r_loc_ver=2&r_language_pref=1&use_compat=0&gender_select=2&&alt_nsi=&advanced=1&mbm_signup=

Then I view the source and I see that the title line looks like this
Code:
<em>“I NEED A SOUL MATE”</em>

If I want to extract the title lines throughout the entire page I will need to use the following.

Beginning search string -
Code:
<em>“
End search string -
Code:
”

This would give me the following:
I NEED A SOUL MATE
seeking for marriage
Seeking for a Godfearing man
will be glad to your attention to me!
Serious Relationship
Hello! let's go!
Take my hand, lets Explore something New
a little bit of this and that...
Let's meet
Am look my love

Obviously this is a very simple example, this has a lot of applications and can save you a lot of time in various projects. Keep in mind, in my experience this will not work on private pages, only publicly viewable pages.

Example 2: Let's say you wanted to get all of the urls for the city listing on the following page. Load the page and viewing the source we see the following.
Code:
<a href="http://abilene.craigslist.org/">abilene</a>

Input the following in the scraper tool

URL -
Code:
http://geo.craigslist.org/iso/us
Beginning search string -
Code:
<a href="
End search string -
Code:
">


Example 3: In this example we will scrape source code with variable code. I will use source code with a variable link. If I wanted to scrape the titles in a section of craigslist I may view the code and see the following.
Code:
<a href="/bik/940033833.html"> Gary Fisher Joshua F-1 frame and forks etc. -</a>
Each title has a different /bik/#### depending on the post. To get around this we can use a wildcard .+?. This wildcard takes the place of everything in a variable rate.

URL -
Code:
http://anchorage.craigslist.org/bik/
Beginning search string -
Code:
<a href="/bik/.+?">
End search string -
Code:
</a>

We get the following
Gary Fisher Joshua F-1 frame and forks etc. -
Bridgestone MB3 shockless -
Great Christmas Gift: Ladies Bicycle, Like New! -
Panniers -
Moutain Bike -
Bike Trailer/Jogger -

I've got a lot of time savers so I thought I'd give back. A thanks would be appreciated :)

P.S. If you hare having trouble finding the proper combination of beginning and ending search strings for a URL post them here and I'll try to help. Also if you have any suggestions to enhance this I'd be happy to tweak it and re-upload a new version.
 

Attachments

  • scrape.zip
    803 bytes · Views: 604
It will save you a lot of time. It is also a great way to get content for websites, articles, marketing, etc...

After you've tried it please leave feedback here, I'd be happy to continually develop this file with useful requests.

-neta1o
 
How can i scrape content from a search result example for this result http://www.google.com/search?hl=en&safe=off&q=backgroundcheckdatas&btnG=Search&aq=f&oq= and automatically posted on my page?

Including the tile and descriptions of each results

The key to scraping content is looking at the page source code and finding the common code encapsulating the content you want. Google is probably the most challenging to scrape. Right now (and I say right now because they change all of the time)

I used this to scrape the titles.
Beginning search string:
Code:
<h3 class=r>
End search string:
Code:
</a>

I used this to scrape the descriptions.
Beginning search string:
Code:
<div class="s">
End search string:
Code:
<cite>

As for the automatically posting it to your page. I always recommend a quick look through scraped content before posting it. You never know what you could get. So I manually review and clean content before posting it.
 
I originally hard coded scrape files for various pages, including google. But I found that the source code changed periodically and my scraper would no longer work. I made this custom scraper to test different inputs to find my desired output. Eventually I started using this custom scraper and saving the search strings.

I tried a lot of different scrapers that would work and break and need updates. With this the only update you'll ever need is a little code search away. I'd be happy to help you guys with scraping other pages and finding the right combination's. As mentioned before, I'd also be happy to take suggestions to improve this script.
 
thanks man
this is great !!!! this is good for i guess SEO stuff ?!

Absolutely, if you are starting a new website on a specific topic or if you are trying to just build some content. You can steal lots of good stuff and save yourself some valuable time :)
 
Thanks neta1o for the scraper. It does seem like many come and go because of source code changes and unless you can write your own scripts....which i've experimented with, but have had limited success.
 
The key to scraping content is looking at the page source code and finding the common code encapsulating the content you want. Google is probably the most challenging to scrape. Right now (and I say right now because they change all of the time)

I used this to scrape the titles.
Beginning search string:
Code:
<h3 class=r>
End search string:
Code:
</a>

I used this to scrape the descriptions.
Beginning search string:
Code:
<div class="s">
End search string:
Code:
<cite>

As for the automatically posting it to your page. I always recommend a quick look through scraped content before posting it. You never know what you could get. So I manually review and clean content before posting it.

Ok.. how to tweak your code and make the result show on the page automatically without typing any query? For example i wanted to create a directory scraped from various websites.

Sorry for the noob question..
 
In the source code you can manually set the default values for the beginning and ending search string.

Here is the original code
Code:
Beginning: <input name="beg" value="" style="width: 100px;">     End: <input name="end" value="" style="width: 100px;">

This would be an example of the modified code
Code:
Beginning: <input name="beg" value="<h3 class=r>" style="width: 100px;">     End: <input name="end" value="</a>" style="width: 100px;">

You just put the search stings in the value=""
 
I've created a new version that saves the last beginning/ending search string. This will scraping repetitive websites/directories easier. Download version 1.1 at the link below.
 

Attachments

  • scrape1.1.zip
    803 bytes · Views: 280
I'm working on a visual basic scraper also, what kinds of pages and content would you like to be able to scrape. Examples please :)
 
In the source code you can manually set the default values for the beginning and ending search string.

Here is the original code
Code:
Beginning: <input name="beg" value="" style="width: 100px;">     End: <input name="end" value="" style="width: 100px;">

This would be an example of the modified code
Code:
Beginning: <input name="beg" value="<h3 class=r>" style="width: 100px;">     End: <input name="end" value="</a>" style="width: 100px;">

You just put the search stings in the value=""


thanks but i still need to hit the scrape button for doing that. Any advice for the scape contents to appear instanly without hitting the scrape button?

Thanks again
 
kojakfull, so you want it to scrape the default values you save automatically when the pages opens?

I can set that for you but you will have to have a website and beg/end strings that are static. If this is what you want to do let me know and I'll modify it for you.
 
Back
Top
AdBlock Detected

We get it, advertisements are annoying!

Sure, ad-blocking software does a great job at blocking ads, but it also blocks useful features and essential functions on BlackHatWorld and other forums. These functions are unrelated to ads, such as internal links and images. For the best site experience please disable your AdBlocker.

I've Disabled AdBlock