1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Learning how to scrape Google. New to this. C#

Discussion in 'C, C++, C#' started by seeplusplus, May 31, 2014.

  1. seeplusplus

    seeplusplus Power Member

    Joined:
    Aug 18, 2008
    Messages:
    511
    Likes Received:
    163
    So I've been wanting to learn how to scrape the web for a long time and now I have some time I've decided to give it a go.

    After some looking around I've decide to go with C# as I know C++ the best.

    Goal: Scrape search results from Google.

    Tools: Visual Studio and HtmlAgilityPack

    Point of thread: So anyone else wanting to learn this can join in to, we can all post our progress, problems etc in this thread.


    Progress so far:

    I'm using Visual Studio 2010.

    Have just installed the Nugent (package manager?) from HERE. I just clicked the Download link and Visual Studio recognised the link, so it installed no problem.

    Now when I open up V.S. I can go to the menu Tools > NuGet Package Manager and in the bottom window of the IDE I see the PM prompt.

    I entered:

    Which gave me an error saying I didn't have a solution open, so I created a new Windows Form project and tried again, after which it worked :)

    Next I've added a text box into the form and a button. The text field I'll use to enter the search query, the button will start the scraper.

    1.jpg


    Next task - Figure out how to navigate to Google and enter the search query.

    Should I use an instance of the WebRequest class do you think? - IF ANYONE HAS ANY ADVICE ON HOW I SHOULD DO THIS PLEASE DO TELL!!!


    Thanks!
     
    • Thanks Thanks x 2
    Last edited: May 31, 2014
  2. arpitagarwal82

    arpitagarwal82 Power Member

    Joined:
    Feb 20, 2008
    Messages:
    709
    Likes Received:
    456
    Location:
    Localhost
    Or you can buy scrapebox or turboranker scraper
     
  3. arganrecords

    arganrecords Elite Member

    Joined:
    Oct 12, 2013
    Messages:
    1,611
    Likes Received:
    1,904
    Occupation:
    I think I'm Marketer
    Location:
    Italy
    I think that the intent of OP is more didactic. Is to go into programming method of the web scraping
     
  4. arpitagarwal82

    arpitagarwal82 Power Member

    Joined:
    Feb 20, 2008
    Messages:
    709
    Likes Received:
    456
    Location:
    Localhost
    Ok. Just noticed that this thread is in programing section.
    Actually my point was there those tools are really cheap and efficient so there is no pint in investing much time and energies in coding a custom one. Until unless you have very specific scraping requirements.
     
  5. Tokarev

    Tokarev BANNED BANNED

    Joined:
    May 3, 2014
    Messages:
    106
    Likes Received:
    41
    OP, do you have any prior experiences with these languages and packages? I'm very interested in building a specific scraper as well..
     
  6. seeplusplus

    seeplusplus Power Member

    Joined:
    Aug 18, 2008
    Messages:
    511
    Likes Received:
    163
    Thanks for the quick replies guys.

    I do have Scrapebox, I wanna learn how to scrape myself though :)

    Tokarev - No experience with C# and the HtmlAgility package, my programming knowledge was from university where we did C++ primarily.

    I just figured scraping the Google results would be a good first step and throw up some curve balls which we could learn from.

    Purely educational for now.
     
    Last edited: May 31, 2014
  7. thejesus

    thejesus Registered Member

    Joined:
    Sep 29, 2008
    Messages:
    76
    Likes Received:
    22
    The search query for Google is easy, you just need to figure out the URLs and remove all the useless stuff. For example, I got this:

    Code:
    https://www.google.com/#q=some+search
    So you can easily build your own URL for your own search.

    Code:
    string search = "search for this thing"; //The thing you're going to search for
    string url = "https://www.google.com/#q="+search.Replace(" ", "+"); //The page on Google, replacing the spaces with a +
    
    //Code here to get that page
    I use HttpClient with C#, but HttpWebRequest works too. HttpClient requires either .NET 4 or 4.5.
     
  8. theMagicNumber

    theMagicNumber Regular Member

    Joined:
    May 13, 2010
    Messages:
    345
    Likes Received:
    195
    It's called URL encoding and it is more than just replacing empty spaces with +
    Use HttpUtility.UrlEncode(string query); it's in System.Web and System.Net namespaces.
     
  9. thejesus

    thejesus Registered Member

    Joined:
    Sep 29, 2008
    Messages:
    76
    Likes Received:
    22
    It's not really very difficult, you can easily convert a string to a Uri if you need to, like so:
    Code:
    var url = new Uri("http://google.com");
    With HttpClient you can just use strings, like so:
    Code:
    client.GetAsync("http://google.com");
     
  10. rawr00

    rawr00 Newbie

    Joined:
    Feb 24, 2014
    Messages:
    7
    Likes Received:
    3
    You do not need HAP to simply scrape the search results from Google. HAP is wonderful lib, but you can get the query links by using Regex (I promise). Open the source of a query result with FireBug or the like and start working on your pattern. There are many tools out there to help you create and test Regular Expressions. Do this, and you shall succeed! Example:

    Code:
     [COLOR=#fff0f5][FONT=Consolas]publicstaticasyncTask<List<string>> GetGoogleSearchLinksAsync(string keyword)[/FONT]   
         {
                var linksList = new List<string>();
                using (var client = new WebClient())
                {
                    var matches = Regex.Matches(await client.DownloadStringTaskAsync("GOOGLE_DOT_COM/search?q=" + keyword), REGEX_PATTERN_HERE);
                    foreach (Match match in matches)
                    {
                        linksList.Add(ADD_YOUR_MATCHES);
                    }
                }
                return linksList;
            }[/COLOR]
    Might want to use generics. Just an example.
     
    • Thanks Thanks x 1
    Last edited: Oct 21, 2014
  11. sohom

    sohom Senior Member

    Joined:
    May 26, 2013
    Messages:
    981
    Likes Received:
    175
    Location:
    not in Past
    nice intend, great to see many of you BHWers helping & sharing each other
    on topic , If there is a Selenium in C# like Python, then just use it, you will love it