1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

help scraping Gnews

Discussion in 'Visual Basic .NET' started by specopkirbs, Mar 20, 2011.

  1. specopkirbs

    specopkirbs BANNED BANNED

    Joined:
    Nov 28, 2008
    Messages:
    920
    Likes Received:
    746
    OK so im making a bot that scrapes google news articles based on a keyword...which should be fairly straight forward, however...

    because each site on gnews is different ie bbc,CNN etc etc, and has different formatting, when i scrape the source im struggeling to find a way just to extract the main article text with out the rest of the crap that comes with it.

    Its easy enough to scrape text if you know the site your trying to scrape however because there are literally thousands of different sites on gnews and each one is different, im finding this an impossible task.

    Does anyone know of a solution?
    im coding in VB.net
     
  2. johnniew

    johnniew Jr. VIP Jr. VIP Premium Member

    Joined:
    Jun 12, 2009
    Messages:
    182
    Likes Received:
    165
    search for patterns in the text and use regular expression to extract the needed text, it could help.
     
    • Thanks Thanks x 1
  3. andee

    andee Regular Member

    Joined:
    Jul 24, 2010
    Messages:
    218
    Likes Received:
    83
    The articles are contained between div tags. Heres four random ones. Theres certain patterns to the names.

    <div id ="storybody">
    <div id ="articlecontent">
    <div id ="storycontent">
    <div id ="main-content">


    Build up a list of as many div tag names as you can, shouldnt be too hard.

    Create a function to scrape names of all div id Tags from the page. <div id =.*?>

    If theres a match from any of those tags against your list, scrape the contents of that tag.



    something like that, dunno.....
     
    • Thanks Thanks x 1
  4. specopkirbs

    specopkirbs BANNED BANNED

    Joined:
    Nov 28, 2008
    Messages:
    920
    Likes Received:
    746
    sounds about right, only issue comes if it tries to scrape some random blog thats made it on to G news where they have customised the div tags, so ill look at using regex with that and see what i can come up with
     
  5. gnote

    gnote Registered Member

    Joined:
    Mar 10, 2009
    Messages:
    80
    Likes Received:
    6
    Occupation:
    Programmer
    Location:
    USA
    There is no magic scrape function you will find. The only way to get perfect matches is to hand code as many sites that you can find and store a custom regex for each one.

    I would code the most popular sites linked of gnews first, and store a domain dictionary of all the sites you have coded a regexp for. This way when it comes time to spider out, you can check the dictionary to see if there is a regex pattern, and when there is not, don't bother scraping that site.

    EDIT: I would also dump a log of all the domains that your bot comes across and doesn't have a regexp pattern for. This way you can go back and code anything that's missing.
     
  6. specopkirbs

    specopkirbs BANNED BANNED

    Joined:
    Nov 28, 2008
    Messages:
    920
    Likes Received:
    746
    great thinking, ive already started the process, but thats given me a few ideas
    thanks
     
  7. trooper

    trooper Regular Member

    Joined:
    Jun 5, 2009
    Messages:
    207
    Likes Received:
    210
    Location:
    Front lines
    I agree with gnote

    also, this .net library may help :)

    Code:
    http://htmlagilitypack.codeplex.com/
     
  8. specopkirbs

    specopkirbs BANNED BANNED

    Joined:
    Nov 28, 2008
    Messages:
    920
    Likes Received:
    746
    thanks mate ive used the htmlagilitypack before but i find i have more freedom with regex, might use it though.
    For those that are interested in what im doing, basically ive created a trend grabber that grabs the top 20 current hot trends on google, the top 10 current twitter trends, top finance trends, movie trends, news trends, sports trends, business trends etc etc
    im going to get it to scrape gnews for the keyword that you select to grab articles and then allow the ability to spin the article and post it to your blog
     
  9. captchaman

    captchaman Junior Member

    Joined:
    Sep 16, 2010
    Messages:
    190
    Likes Received:
    842
    Occupation:
    Software Programmer
    Location:
    USA
    Last edited: Mar 23, 2011