1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Building Online URL Scraper - Need Testers...

Discussion in 'BlackHat Lounge' started by SpamHat, Sep 4, 2009.

  1. SpamHat

    SpamHat Junior Member Premium Member

    Joined:
    Apr 27, 2009
    Messages:
    151
    Likes Received:
    67
    Location:
    UK
    Hey :)

    Writing an AJAX app that scrapes Google URLs and have gone over it a few times but I'm probably missing something.

    Anyone who can be bothered can you check this out:
    http://gScrape.com
    (Needs FireFox 3.5)

    Looking for bugs etc

    Cheers
     
    • Thanks Thanks x 3
  2. SpamHat

    SpamHat Junior Member Premium Member

    Joined:
    Apr 27, 2009
    Messages:
    151
    Likes Received:
    67
    Location:
    UK
    Yes, but with Goog et al you would search, copy and paste each url to a text file, go to page 2, copy and paste each..... etc

    With this it's a few clicks and you can have thousands of URLs.

    You seem pretty new (as you say) so here's some examples:


    Someone needs content for a website.
    They enter a google footprint into gScrape like this:
    site:goarticles.com bingo
    And that will return a nice big list of article URLs on goarticles.
    You can then use this list to scrape the articles with other tools.



    If you've got some seo software like autopligg or whatever and are looking for pligg urls (or any other site urls) to plug into the software you can do something like this:
    "powered by pligg"


    Or you want a big list of do-follow blogs to comment on:
    "powered by wordpress" +********


    and so on.

    It's a flexible way of powering other software and squeezing some of the goodness out of Google for other purposes.




    btw: how did you get on with the actual functioning of the app? Are you using FF 3.5+?
     
  3. SpamHat

    SpamHat Junior Member Premium Member

    Joined:
    Apr 27, 2009
    Messages:
    151
    Likes Received:
    67
    Location:
    UK
    Thanks for testing it out :)

    The URLs that are returned are similar to Google's so to get better results you just need to tune your footprint.

    For example, to get articles for AC you could use this footprint:
    site:associatedcontent.com inurl:article intitle:"YOUR KEYWORD(S) HERE"

    That gives me an idea though... a nice one.
    Thanks again :D
     
  4. iglow

    iglow Elite Member

    Joined:
    Feb 20, 2009
    Messages:
    2,081
    Likes Received:
    856
    Home Page:
    didnt work too godo for me
    nothing scraped - ive got newest fff
     
  5. SpamHat

    SpamHat Junior Member Premium Member

    Joined:
    Apr 27, 2009
    Messages:
    151
    Likes Received:
    67
    Location:
    UK
    @scudder: It only needs 3.5+. I'm running 3.5.2 like you.

    @iglow: any JS errors? You did click the LoadUrls button after it finished, right?
     
  6. redsasy

    redsasy Newbie

    Joined:
    Feb 16, 2009
    Messages:
    45
    Likes Received:
    78
    Occupation:
    In a relationship with a female hacker
    Location:
    Bulgaria
    Works great with FF 3.5.1
     
  7. mrsmirf

    mrsmirf Junior Member

    Joined:
    Aug 2, 2008
    Messages:
    105
    Likes Received:
    98
    I have FF 3.5.3 . Your website didn't work. It just says "Scraping: 'Keyword' " and nothing happens afterwords.
     
  8. fatboy

    fatboy Elite Member

    Joined:
    Aug 13, 2008
    Messages:
    1,618
    Likes Received:
    3,227
    Occupation:
    Retired
    Location:
    Old Peoples Home
    Working fine here mate - cool site :)
    No probs, everything worked as it should of for me.
     
  9. acp0rnstar

    acp0rnstar Junior Member

    Joined:
    Sep 29, 2008
    Messages:
    161
    Likes Received:
    153
    Occupation:
    Internet slacker
    Location:
    Los Angeles
    Same thing here. It just hangs. I tried with FF 3.5.3 and Chrome 4.0xx
     
  10. Velvet

    Velvet Junior Member

    Joined:
    Feb 18, 2009
    Messages:
    199
    Likes Received:
    146
    working fine for me with FF 3.5.3
    registered for the huge amount of urls :)
     
  11. Mesach

    Mesach Junior Member

    Joined:
    Oct 1, 2009
    Messages:
    108
    Likes Received:
    39
    I tested on 3.5.3 and it works great, however i did 2 queries, and when i ran the 2nd one, the results from the first completely unrelated query came up at the beginning of the results.
     
  12. SpamHat

    SpamHat Junior Member Premium Member

    Joined:
    Apr 27, 2009
    Messages:
    151
    Likes Received:
    67
    Location:
    UK
    Yeah it's built so it remembers the URLs from ALL queries so you can download them at the end.

    To clear the cache just refresh the page.
     
  13. iPwnJ00

    iPwnJ00 Junior Member

    Joined:
    Mar 10, 2009
    Messages:
    132
    Likes Received:
    21
    Location:
    Melbourne, Australia
    I'm using FF 3.7a1pre, otherwise known as Minefield and it says that I need FF 3.5....

    But I can just spoof it to 3.5.

    Why does it need FF3.5? What function are you taking from FF to make it FF-only?
     
  14. iPwnJ00

    iPwnJ00 Junior Member

    Joined:
    Mar 10, 2009
    Messages:
    132
    Likes Received:
    21
    Location:
    Melbourne, Australia
    Looks good, works well.

    Found a bug though. If you re-search for the same keyword, it doesn't check whether the URL is already in memory and therefore will just repeat the same URLs.
     
  15. SpamHat

    SpamHat Junior Member Premium Member

    Joined:
    Apr 27, 2009
    Messages:
    151
    Likes Received:
    67
    Location:
    UK
    I'll take a look at the useragent verification - it should be letting you through. It could technically be used with IE8 as well but would need some changes - I'm planning to do this sometime.

    I needed a way to manage all the urls in the users browser without the browser crashing so am using something only modern browsers have :)

    The best thing about it is that once the page has loaded in your browser it doesn't talk to the server - all the scraping is done client side with JS.


    hmmmmmmmm

    I just searched for "cat" without quotes, twice and didn't get any dupes.

    It will give you multiple URLs from the same domain if it finds them, but shouldn't give dupes.

    What keyword were you using?
     
  16. submitoke

    submitoke Newbie

    Joined:
    Dec 6, 2008
    Messages:
    16
    Likes Received:
    0
    my FF is not on 3.5 but ive use this...
    Code:
    http://goohackle.com/scripts/google_parser.php
     
  17. HenryHavoc

    HenryHavoc Jr. VIP Jr. VIP

    Joined:
    Mar 24, 2008
    Messages:
    789
    Likes Received:
    1,519
    Occupation:
    Hustler
    Location:
    Cincinnati
    After you make a search and clear the list, a research adds to the number of URLS scraped. Idk if that's on purpose. AMAZING overall! Bookmarking now :)
     
  18. SpamHat

    SpamHat Junior Member Premium Member

    Joined:
    Apr 27, 2009
    Messages:
    151
    Likes Received:
    67
    Location:
    UK
    There are ton's of small scripts like that - it requires a captha and only gives a few results. I wanted to make something that was fast, didn't get banned (ever) and gives unlimited results.


    Yeah, if I'm understanding you correctly, yes it's on purpose.

    Thanks for the feedback guys :cool2:
     
  19. shadowpwner

    shadowpwner Regular Member

    Joined:
    Apr 19, 2009
    Messages:
    300
    Likes Received:
    73
    You, sir, win a couple internets.

    Awesome script, works fine. However, signing up, i had to try twice. This is probably aweber's fault.

    Some advice: the screen froze up when i was trying to get 2000 keywords. Could you output it as a downloadable file, instead of on browser text?
     
    Last edited: Oct 11, 2009
  20. dizz

    dizz Elite Member

    Joined:
    May 19, 2009
    Messages:
    2,068
    Likes Received:
    1,774
    Occupation:
    This... AND MORE!! :D
    Location:
    Texas
    If you need anymore testers, we would like to try it out.

    Thanks