1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

best method of scraping WP sites ?

Discussion in 'General Programming Chat' started by r000k, Jul 30, 2013.

  1. r000k

    r000k Registered Member

    Joined:
    Jan 10, 2013
    Messages:
    66
    Likes Received:
    30
    So guys whats the best way to scrape a wordpress site ? not looking to copy site, basically scrape certain articles and what not

    Im sure SB could do it (which I own) or another tool maybee, but I like making my own basic tools to learn , and so I can load up 100 sites, hit one button and have it spit out my desired results in my desired format.

    Not all the sites im targeting have sitemaps, so scraping site.com/sitemap.xml wont work
    Most sites have different permalink structure's , so no really scraping site.com/page_1 etc.

    Is there a slight trick Im missing for easy scraping ?

    Ive started writing a (crappish) spider and doing it that way, im about half way through. I guess with that method it will work for non WP sites as well....

    Cheers,
     
  2. Schvamp

    Schvamp Power Member

    Joined:
    Feb 13, 2012
    Messages:
    684
    Likes Received:
    549
    Location:
    Hogwarts
    Why would you create your own spider if you have scrapebox?
    Scrape sites with WP footprint in SB -> save the list and scrape the list with the 'site:' function that way you will get alot of urls to innerpages -> Send that list to an article grabber or whatever.
     
    • Thanks Thanks x 1
  3. sire243

    sire243 Regular Member

    Joined:
    Jun 23, 2010
    Messages:
    261
    Likes Received:
    113
    Best way would be to code it yourself. Use regex since it's faster. What I would do is:

    1) Use scrapebox to scrape all the urls under that domain.
    2) then just create a simple scraper that goes to a website and scrapes.

    Or you could be all complicated and code up a spider, but I wouldn't advise that since it will hog resources like a bitch. (And besides, you're still learning)
     
    • Thanks Thanks x 1
  4. r000k

    r000k Registered Member

    Joined:
    Jan 10, 2013
    Messages:
    66
    Likes Received:
    30
    I havnt tried to code in about 10 months, im mainly doing it to brush up on my limited skill set. Plus when Im finished ill just be able to load a list of sites and it spit out the data is csv.

    Yeah the best way woud be to use scrapebox i guess, But I think ill attempt at making a crummy spider, ive got no problems with using regexs an threading and just run it on my vps... if i fail, ill just resort to SB.


    thanks for the replies.
     
  5. sire243

    sire243 Regular Member

    Joined:
    Jun 23, 2010
    Messages:
    261
    Likes Received:
    113
    Maybe this will help you then.

    Code:
    [URL]http://www.makeuseof.com/tag/build-basic-web-crawler-pull-information-website/[/URL]
    It's in php, but I'm sure you'll get some ideas.
     
    • Thanks Thanks x 1
  6. r000k

    r000k Registered Member

    Joined:
    Jan 10, 2013
    Messages:
    66
    Likes Received:
    30
    • Thanks Thanks x 1
  7. soundclicktop

    soundclicktop Newbie

    Joined:
    Aug 23, 2013
    Messages:
    10
    Likes Received:
    0
    I need soundclick top / program / chart increaser