[Tut] How to scrape Google with VB.net

zacatictac

Power Member
Joined
May 2, 2010
Messages
660
Reaction score
805
Hey guy I wrote a little tutorial on how to scrape google with vb.net. No it's not using a webbrowser or anything lame like that. If you guys are wanting to learn web automation and scraping then check it out.

Code:
[URL]http://pc-tips.net/how-to-scrape-google-with-vb-net/[/URL]
 
Your welcome! It's quite the pain in the ass to scrape google outside their api, so this is what i came up with. This code has worked great for me for a long time, surviving many google updates.
 
good stuff man! very nice tutorial

I was really into VB for about a month but there's a pretty steep learning curve to make any kind of advanced bot. Like for example if you wanted to add in proxy support it's an insane amount of code from what I remember. I'm sticking with ubot for the time being.
 
also fire up httpfox/fiddler , disable javascript , see the difference, much easier to scrape
 
thanks OP keep sharing , i think using htpwebrequest faster than Webclient , what do you think ?
 
Have not tried your code, but let me say this, since I coded 2 google scrapers in VB.net, the core HTML layout changes with different OS languages, I dont know why, but it happends. In one language, the 'Next' button has the classname 'pnnext' and in another, the element is classless.
 
zohar :
why dont you loop the address for the next page? i think & start = xx will get you to the next page then
 
Or for multi-page search + captcha support:
Code:
Dim SearchString As String = "bht toolz"
Dim PagesToScrape As Integer = 3
Dim CurCountry As String = "countryUK|countryGB"
If GoogleDomain.Contains(".com") Then
	CurCountry = "countryUS"
End If

For i As Integer = 1 To PagesToScrape
	URL = "https://www." & GoogleDomain & "/search?q=" & URLEncode(SearchString) & "&lr=&cr=" & CurCountry & "&hl=ro&as_qdr=all&tbs=ctr:" & CurCountry & "&ei=&sa=N&biw=1920&bih=969&num=100&start=" & ((i - 1) * 100)
' Load URL
	
' Check For Captcha
	If Content.Contains("unusual traffic from your computer") Then
' Scrape Redirect URL
		URL = GetBetween(Content, "name=""continue"" value=""", """")
		Dim CaptchaID As String = GetBetween(Content, "name=""id"" value=""", """")
' Re-Format URL's
		URL = URL.Replace("&", "&")
		Dim CaptchaImageURL As String = "http://ipv4.google.com/sorry/image?id=" & CaptchaID & "&hl=en"
		Dim CaptchaAnswer As String = ""
' Process Captcha
		
' Submit Captcha
		URL = "http://ipv4.google.com/sorry/CaptchaRedirect?continue=" & URLEncode(URL) & "&id=" & CaptchaID & "&captcha=" & URLEncode(CaptchaAnswer) & "&submit=Submit"
' Load URL
		
	End If
	
' Split Results Into Arrays
	Dim OriginalLinks() As String = GetAllStringsBetween(Content, "<h3 class=r><a href=""", """")
	Dim OtherLinks() As String = GetAllStringsBetween(Content, "<h3 class=""r""><a href=""", """")
' Join Link Arrays
	Dim Links(OriginalLinks.Count + OtherLinks.Count) As String
	Dim k As Integer = 0
	For Each link In OriginalLinks
		If link <> "" Then Links(k) = link
		k += 1
	Next
	For Each link In OtherLinks
		If link <> "" Then Links(k) = link
		k += 1
	Next
' Loop Results
	Dim j As Integer = 0
	For j = 0 To Links.Count - 1
		If Links(j) = "" Then Continue For
' Add Link To Array
		ScrapedURLS.Add(Links(j).ToString().Trim())
	Next
' Loop Scraped URL's
	For Each URL In ScrapedURLS
' Do Something With The URL
		
	Next
Next

Required functions:
Code:
' GetBetween Function
    Private Function GetBetween(ByVal haystack As String, ByVal needle As String, ByVal needle_two As String) As String
        Dim istart As Integer = InStr(haystack, needle)
        If istart > 0 Then
            ' Dim istop As Integer = InStr(istart, haystack, needle_two)
            Dim istop As Integer = InStr(istart + Len(needle), haystack, needle_two)
            If istop > 0 Then
                Try
                    Dim value As String = haystack.Substring(istart + Len(needle) - 1, istop - istart - Len(needle))
                    Return value
                Catch ex As Exception
                    Return ""
                End Try
            End If
        End If
        Return ""
    End Function
    ' GetAllStringsBetween Function
    Private Function GetAllStringsBetween(ByVal Haystack As String, ByVal StartSearch As String, ByVal EndSearch As String) As String()
        Dim rx As New Regex(StartSearch & "(.+?)" & EndSearch)
        Dim mc As MatchCollection = rx.Matches(Haystack)
        Dim FoundStrings(mc.Count) As String
        Dim i As Integer = 0
        For Each m As Match In mc
            FoundStrings(i) = m.Groups(1).Value.ToString()
            i += 1
        Next
        Return FoundStrings
    End Function
 
Back
Top