1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Custom Scrape Tool PHP

Discussion in 'PHP & Perl' started by neta1o, Dec 1, 2008.

  1. neta1o

    neta1o Regular Member

    Joined:
    Sep 29, 2008
    Messages:
    388
    Likes Received:
    318
    Home Page:
    I wrote this tool a while back to quickly scrape content.

    Here is how this works.
    1) You will want to put the file on one of your servers that supports php.
    2) Go to the page and you have to input 3 things.
    - The full url of the page you want to scrape (http://www.google.com)
    - The beginning search string
    - The ending search string

    Remember this works off of the source code of the page. Source code changes often, especially if you are scraping Google stuff. Also, sometimes you need to experiment a bit before you get what you want.

    Here is an example of Yahoo Personals. Say I wanted to scrape the headlines/titles of each person.

    First I go to a page that I want to scrape:

    URL -
    Code:
    http://dating.personals.yahoo.com/results?resulttype=1&searchsource=1&searchview=1&r_gender=1&r_gender_pref=2&r_min_age=18&r_max_age=37&r_has_photo=2&r_locid=24024871&r_loc_ver=2&r_language_pref=1&use_compat=0&gender_select=2&&alt_nsi=&advanced=1&mbm_signup=
    Then I view the source and I see that the title line looks like this
    Code:
    <em>“I NEED A SOUL MATE”</em>
    If I want to extract the title lines throughout the entire page I will need to use the following.

    Beginning search string -
    Code:
    <em>“
    End search string -
    Code:
    ”
    This would give me the following:
    Obviously this is a very simple example, this has a lot of applications and can save you a lot of time in various projects. Keep in mind, in my experience this will not work on private pages, only publicly viewable pages.

    Example 2: Let's say you wanted to get all of the urls for the city listing on the following page. Load the page and viewing the source we see the following.
    Code:
    <a href="http://abilene.craigslist.org/">abilene</a>
    Input the following in the scraper tool

    URL -
    Code:
    http://geo.craigslist.org/iso/us
    Beginning search string -
    Code:
    <a href="
    End search string -
    Code:
    ">
    Example 3: In this example we will scrape source code with variable code. I will use source code with a variable link. If I wanted to scrape the titles in a section of craigslist I may view the code and see the following.
    Code:
    <a href="/bik/940033833.html"> Gary Fisher Joshua F-1 frame and forks etc. -</a>
    
    Each title has a different /bik/#### depending on the post. To get around this we can use a wildcard .+?. This wildcard takes the place of everything in a variable rate.

    URL -
    Code:
    http://anchorage.craigslist.org/bik/
    Beginning search string -
    Code:
    <a href="/bik/.+?">
    End search string -
    Code:
    </a>
    We get the following
    I've got a lot of time savers so I thought I'd give back. A thanks would be appreciated :)

    P.S. If you hare having trouble finding the proper combination of beginning and ending search strings for a URL post them here and I'll try to help. Also if you have any suggestions to enhance this I'd be happy to tweak it and re-upload a new version.
     

    Attached Files:

    • Thanks Thanks x 31
  2. ashilicious

    ashilicious Junior Member

    Joined:
    Aug 14, 2008
    Messages:
    162
    Likes Received:
    79
    Location:
    BᄂΛᄃK ΉΛƬ ЩӨЯᄂD BΛBY
    Wow this looks really useful. Thanks for sharing :)
     
  3. mrusman

    mrusman Newbie

    Joined:
    Feb 20, 2008
    Messages:
    17
    Likes Received:
    7
    Occupation:
    Nothing!
    Location:
    Behind You!
    Thx a lot great information and excellent tool! :D
     
  4. neta1o

    neta1o Regular Member

    Joined:
    Sep 29, 2008
    Messages:
    388
    Likes Received:
    318
    Home Page:
    It will save you a lot of time. It is also a great way to get content for websites, articles, marketing, etc...

    After you've tried it please leave feedback here, I'd be happy to continually develop this file with useful requests.

    -neta1o
     
  5. kojakfull

    kojakfull Senior Member

    Joined:
    Jan 13, 2008
    Messages:
    851
    Likes Received:
    1,050
    Location:
    CustomBotSolutions.com
    Home Page:
    Last edited: Dec 2, 2008
  6. neta1o

    neta1o Regular Member

    Joined:
    Sep 29, 2008
    Messages:
    388
    Likes Received:
    318
    Home Page:
    The key to scraping content is looking at the page source code and finding the common code encapsulating the content you want. Google is probably the most challenging to scrape. Right now (and I say right now because they change all of the time)

    I used this to scrape the titles.
    Beginning search string:
    Code:
    <h3 class=r>
    End search string:
    Code:
    </a>
    I used this to scrape the descriptions.
    Beginning search string:
    Code:
    <div class="s">
    End search string:
    Code:
    <cite>
    As for the automatically posting it to your page. I always recommend a quick look through scraped content before posting it. You never know what you could get. So I manually review and clean content before posting it.
     
  7. ashilicious

    ashilicious Junior Member

    Joined:
    Aug 14, 2008
    Messages:
    162
    Likes Received:
    79
    Location:
    BᄂΛᄃK ΉΛƬ ЩӨЯᄂD BΛBY
    I just spent $250 on a google scraper... I think I wasted my money.

    Thanks again.
     
  8. neta1o

    neta1o Regular Member

    Joined:
    Sep 29, 2008
    Messages:
    388
    Likes Received:
    318
    Home Page:
    I originally hard coded scrape files for various pages, including google. But I found that the source code changed periodically and my scraper would no longer work. I made this custom scraper to test different inputs to find my desired output. Eventually I started using this custom scraper and saving the search strings.

    I tried a lot of different scrapers that would work and break and need updates. With this the only update you'll ever need is a little code search away. I'd be happy to help you guys with scraping other pages and finding the right combination's. As mentioned before, I'd also be happy to take suggestions to improve this script.
     
  9. r-webb-k

    r-webb-k BANNED BANNED

    Joined:
    Dec 19, 2006
    Messages:
    373
    Likes Received:
    407
    can you code a wordpress ******** url harvester??? that will be really useful
     
  10. mankan

    mankan BANNED BANNED

    Joined:
    Nov 28, 2008
    Messages:
    43
    Likes Received:
    5
    thanks man
    this is great !!!! this is good for i guess SEO stuff ?!
     
  11. neta1o

    neta1o Regular Member

    Joined:
    Sep 29, 2008
    Messages:
    388
    Likes Received:
    318
    Home Page:
    If you can give me two urls (one with and one without ******** links) this could probably be done.
     
  12. neta1o

    neta1o Regular Member

    Joined:
    Sep 29, 2008
    Messages:
    388
    Likes Received:
    318
    Home Page:
    Absolutely, if you are starting a new website on a specific topic or if you are trying to just build some content. You can steal lots of good stuff and save yourself some valuable time :)
     
  13. glew

    glew Junior Member

    Joined:
    Feb 10, 2008
    Messages:
    141
    Likes Received:
    93
    Thanks neta1o for the scraper. It does seem like many come and go because of source code changes and unless you can write your own scripts....which i've experimented with, but have had limited success.
     
  14. kojakfull

    kojakfull Senior Member

    Joined:
    Jan 13, 2008
    Messages:
    851
    Likes Received:
    1,050
    Location:
    CustomBotSolutions.com
    Home Page:
    Ok.. how to tweak your code and make the result show on the page automatically without typing any query? For example i wanted to create a directory scraped from various websites.

    Sorry for the noob question..
     
  15. neta1o

    neta1o Regular Member

    Joined:
    Sep 29, 2008
    Messages:
    388
    Likes Received:
    318
    Home Page:
    In the source code you can manually set the default values for the beginning and ending search string.

    Here is the original code
    Code:
    Beginning: <input name="beg" value="" style="width: 100px;">     End: <input name="end" value="" style="width: 100px;">
    This would be an example of the modified code
    Code:
    Beginning: <input name="beg" value="<h3 class=r>" style="width: 100px;">     End: <input name="end" value="</a>" style="width: 100px;">
    You just put the search stings in the value=""
     
  16. neta1o

    neta1o Regular Member

    Joined:
    Sep 29, 2008
    Messages:
    388
    Likes Received:
    318
    Home Page:
    I've created a new version that saves the last beginning/ending search string. This will scraping repetitive websites/directories easier. Download version 1.1 at the link below.
     

    Attached Files:

    • Thanks Thanks x 3
  17. neta1o

    neta1o Regular Member

    Joined:
    Sep 29, 2008
    Messages:
    388
    Likes Received:
    318
    Home Page:
    I'm working on a visual basic scraper also, what kinds of pages and content would you like to be able to scrape. Examples please :)
     
  18. kojakfull

    kojakfull Senior Member

    Joined:
    Jan 13, 2008
    Messages:
    851
    Likes Received:
    1,050
    Location:
    CustomBotSolutions.com
    Home Page:

    thanks but i still need to hit the scrape button for doing that. Any advice for the scape contents to appear instanly without hitting the scrape button?

    Thanks again
     
  19. neta1o

    neta1o Regular Member

    Joined:
    Sep 29, 2008
    Messages:
    388
    Likes Received:
    318
    Home Page:
    kojakfull, so you want it to scrape the default values you save automatically when the pages opens?

    I can set that for you but you will have to have a website and beg/end strings that are static. If this is what you want to do let me know and I'll modify it for you.
     
  20. Panique

    Panique Power Member

    Joined:
    Sep 21, 2008
    Messages:
    589
    Likes Received:
    412
    Location:
    Caribbean Islands
    Home Page:
    Thank you for sharing your tool!
     
    • Thanks Thanks x 1