1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

[Tut] How to scrape Google with VB.net

Discussion in 'Visual Basic .NET' started by zacatictac, Apr 11, 2014.

  1. zacatictac

    zacatictac Power Member

    Joined:
    May 2, 2010
    Messages:
    598
    Likes Received:
    755
    Occupation:
    SEO
    Location:
    Metaverse
    Hey guy I wrote a little tutorial on how to scrape google with vb.net. No it's not using a webbrowser or anything lame like that. If you guys are wanting to learn web automation and scraping then check it out.

    Code:
    [URL]http://pc-tips.net/how-to-scrape-google-with-vb-net/[/URL]
     
    • Thanks Thanks x 7
  2. TrevorB

    TrevorB Jr. VIP Jr. VIP Premium Member

    Joined:
    Dec 21, 2011
    Messages:
    1,185
    Likes Received:
    361
    Location:
    Canada
    Thanks for the tutorial. I'm sure this will come in handy for many people.
     
    • Thanks Thanks x 1
  3. zacatictac

    zacatictac Power Member

    Joined:
    May 2, 2010
    Messages:
    598
    Likes Received:
    755
    Occupation:
    SEO
    Location:
    Metaverse
    Your welcome! It's quite the pain in the ass to scrape google outside their api, so this is what i came up with. This code has worked great for me for a long time, surviving many google updates.
     
  4. prospect7

    prospect7 Regular Member

    Joined:
    Feb 24, 2010
    Messages:
    273
    Likes Received:
    192
    good stuff man! very nice tutorial

    I was really into VB for about a month but there's a pretty steep learning curve to make any kind of advanced bot. Like for example if you wanted to add in proxy support it's an insane amount of code from what I remember. I'm sticking with ubot for the time being.
     
    • Thanks Thanks x 1
  5. r000k

    r000k Registered Member

    Joined:
    Jan 10, 2013
    Messages:
    66
    Likes Received:
    30
    also fire up httpfox/fiddler , disable javascript , see the difference, much easier to scrape
     
  6. sandrine10

    sandrine10 Power Member

    Joined:
    Apr 14, 2010
    Messages:
    621
    Likes Received:
    63
    Location:
    CyberLand
    Nice share man!
     
  7. trustedfire9

    trustedfire9 Jr. VIP Jr. VIP Premium Member

    Joined:
    Jun 15, 2010
    Messages:
    2,120
    Likes Received:
    1,787
    thanks OP keep sharing , i think using htpwebrequest faster than Webclient , what do you think ?
     
  8. zohar

    zohar Newbie

    Joined:
    Jun 24, 2014
    Messages:
    44
    Likes Received:
    5
    Have not tried your code, but let me say this, since I coded 2 google scrapers in VB.net, the core HTML layout changes with different OS languages, I dont know why, but it happends. In one language, the 'Next' button has the classname 'pnnext' and in another, the element is classless.
     
  9. bocahpauk

    bocahpauk Newbie

    Joined:
    Mar 2, 2015
    Messages:
    5
    Likes Received:
    0
    zohar :
    why dont you loop the address for the next page? i think & start = xx will get you to the next page then
     
  10. gimme4free

    gimme4free Executive VIP Jr. VIP Premium Member

    Joined:
    Oct 22, 2008
    Messages:
    1,884
    Likes Received:
    1,932
    Or for multi-page search + captcha support:
    Code:
    Dim SearchString As String = "bht toolz"
    Dim PagesToScrape As Integer = 3
    Dim CurCountry As String = "countryUK|countryGB"
    If GoogleDomain.Contains(".com") Then
    	CurCountry = "countryUS"
    End If
    
    For i As Integer = 1 To PagesToScrape
    	URL = "https://www." & GoogleDomain & "/search?q=" & URLEncode(SearchString) & "&lr=&cr=" & CurCountry & "&hl=ro&as_qdr=all&tbs=ctr:" & CurCountry & "&ei=&sa=N&biw=1920&bih=969&num=100&start=" & ((i - 1) * 100)
    ' Load URL
    	
    ' Check For Captcha
    	If Content.Contains("unusual traffic from your computer") Then
    ' Scrape Redirect URL
    		URL = GetBetween(Content, "name=""continue"" value=""", """")
    		Dim CaptchaID As String = GetBetween(Content, "name=""id"" value=""", """")
    ' Re-Format URL's
    		URL = URL.Replace("&", "&")
    		Dim CaptchaImageURL As String = "http://ipv4.google.com/sorry/image?id=" & CaptchaID & "&hl=en"
    		Dim CaptchaAnswer As String = ""
    ' Process Captcha
    		
    ' Submit Captcha
    		URL = "http://ipv4.google.com/sorry/CaptchaRedirect?continue=" & URLEncode(URL) & "&id=" & CaptchaID & "&captcha=" & URLEncode(CaptchaAnswer) & "&submit=Submit"
    ' Load URL
    		
    	End If
    	
    ' Split Results Into Arrays
    	Dim OriginalLinks() As String = GetAllStringsBetween(Content, "<h3 class=r><a href=""", """")
    	Dim OtherLinks() As String = GetAllStringsBetween(Content, "<h3 class=""r""><a href=""", """")
    ' Join Link Arrays
    	Dim Links(OriginalLinks.Count + OtherLinks.Count) As String
    	Dim k As Integer = 0
    	For Each link In OriginalLinks
    		If link <> "" Then Links(k) = link
    		k += 1
    	Next
    	For Each link In OtherLinks
    		If link <> "" Then Links(k) = link
    		k += 1
    	Next
    ' Loop Results
    	Dim j As Integer = 0
    	For j = 0 To Links.Count - 1
    		If Links(j) = "" Then Continue For
    ' Add Link To Array
    		ScrapedURLS.Add(Links(j).ToString().Trim())
    	Next
    ' Loop Scraped URL's
    	For Each URL In ScrapedURLS
    ' Do Something With The URL
    		
    	Next
    Next
    
    Required functions:
    Code:
    ' GetBetween Function
        Private Function GetBetween(ByVal haystack As String, ByVal needle As String, ByVal needle_two As String) As String
            Dim istart As Integer = InStr(haystack, needle)
            If istart > 0 Then
                ' Dim istop As Integer = InStr(istart, haystack, needle_two)
                Dim istop As Integer = InStr(istart + Len(needle), haystack, needle_two)
                If istop > 0 Then
                    Try
                        Dim value As String = haystack.Substring(istart + Len(needle) - 1, istop - istart - Len(needle))
                        Return value
                    Catch ex As Exception
                        Return ""
                    End Try
                End If
            End If
            Return ""
        End Function
        ' GetAllStringsBetween Function
        Private Function GetAllStringsBetween(ByVal Haystack As String, ByVal StartSearch As String, ByVal EndSearch As String) As String()
            Dim rx As New Regex(StartSearch & "(.+?)" & EndSearch)
            Dim mc As MatchCollection = rx.Matches(Haystack)
            Dim FoundStrings(mc.Count) As String
            Dim i As Integer = 0
            For Each m As Match In mc
                FoundStrings(i) = m.Groups(1).Value.ToString()
                i += 1
            Next
            Return FoundStrings
        End Function
    
     
    • Thanks Thanks x 2
  11. Mercury_Hg

    Mercury_Hg Registered Member

    Joined:
    Aug 23, 2010
    Messages:
    88
    Likes Received:
    18
    What a hideous language. It's so verbose.