1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Scrapping and Crawling

Discussion in 'General Programming Chat' started by ugorrogu, Nov 11, 2008.

  1. ugorrogu

    ugorrogu Registered Member

    Joined:
    May 27, 2008
    Messages:
    66
    Likes Received:
    4
    Does anybody have any good resources on Scrapping and Crawling websites. I am thinking of writing my own scrapper/crawler in C# as a fun little project but at the same time I want to create it to be really resource efficient.

    Thanks,
    UgorrogU
     
  2. shodan

    shodan Newbie

    Joined:
    Dec 11, 2007
    Messages:
    25
    Likes Received:
    4
    Google for "screen scraping c#" and "multithreading c#" - should at least get you some ideas how to start.
    IMO there are probably other languages more suited to this - ruby, php and python have some very nice & easy to use libraries for scraping purposes available
     
  3. ugorrogu

    ugorrogu Registered Member

    Joined:
    May 27, 2008
    Messages:
    66
    Likes Received:
    4
    Shodan,

    Thanks for reply. The only reason why I want to do it in C# is because I want to have a solution for this that does not depend on a server. Also I havent done to many projects in C# and I think that this is a good excuse to do it.

    ugorrogu
     
  4. lamlam

    lamlam Junior Member

    Joined:
    Oct 25, 2008
    Messages:
    134
    Likes Received:
    854
    Occupation:
    What do you think?
    Location:
    In my home...
    The goal of scraping is to isolate wanted text. Find something common within all the items you wish to scrape, like the tags. Then read the source line by line and isolate the interesting line first. After you have that, it's just a matter of splitting. That's what I do in all my software.
     
    • Thanks Thanks x 1
  5. ortal

    ortal Junior Member

    Joined:
    May 27, 2008
    Messages:
    106
    Likes Received:
    10
    Run Perl in ActivePerl or a PC Perl IDE.