WebScraping
Newbie
- Jun 19, 2012
- 31
- 14
First of all thanks to all of you guys for keeping me motivated and for sharing your storys how you made your first dime online, so here is how i made mine. Here is a little something im willing to share with the rest of the BH-world 
Web scraping (also called web harvesting or web data extraction) is a computer software technique of extracting information from websites,and it can bring some money. But i guess u all know what that is.I just posted few days ago on fiverr that i will scrape any website for data. First i thinked , this will never work, but quite soon, after few hours i got 7 jobs gigs request to filll, well that was fast,and many of them will take more then one day to finish so i charge them 1-2 gigs a day
So lets start now with how easy this is. First of all thanks to Selenium developers for giving us such a great tool to work with! You can check this project seleniumhq, or if u use ubuntu you can install it right away with :
and if u dont have pip installed you can install it with
Now you have a choice to go to seleniumhq and install SeleniumRC as a server on your machine but i found this to be only good if u have some larger projects and scraping to be done, and it has prety neat build in functionalaty for running on multiple machines,etc like i sayed its useless for me for now
For this example i will be writing code in python as it is my programing language of choice and explain everthing line by line.
We are importing here our webdriver that we can control later
xlwt is for making excel sheets and pyvirtualdisplay is for running firefox web browser in virutal display so u can work on other stuff while scraping.
Ok, this is where it gets a little tricky. So how do you find a button on what u need to click? Well u can:
a) install firebug, and right click->inspect element with firebug then u found your element and u can copy his xpath or whatever you need. Xpath is like route to your element on webpage. There is a lot more ways u can search for a desired elemt like driver.find_element_by_class,find_element_by_id ..and others.
A great way to find the function you need is with Eclipse and PyDev and intelisense, and if you need to do some testing before you can make your bug-free code
you can do that with a program called ipython.
b)
U can use the SeleniumIDE that can be installed in firefox. Basicly u start the plugin click on webpages and that tools generates the code for you that you can export and work with. Code u can get is C#, Python, Ruby,JUnit and u can choose want do you want to use,webdriver control or seleniumrc.
So the point is first you test it and then you write your code. That is why i like python for scraping.
Ok here is where i parse the the text. First i find the element by class name and then i look at the text he has and then i parse it with some simple tricks.
After i parsed the data i will write this to xls file.
Final summation:
With this set of tools and a little work u can probably scrape anything, as it is real browser that is browsing the web,u can make your custom email-acc creator,twitter accounts, pinterest and once you learn the basics, learning curve is quite fast.
Well then good luck guys and happy money making :china:
Web scraping (also called web harvesting or web data extraction) is a computer software technique of extracting information from websites,and it can bring some money. But i guess u all know what that is.I just posted few days ago on fiverr that i will scrape any website for data. First i thinked , this will never work, but quite soon, after few hours i got 7 jobs gigs request to filll, well that was fast,and many of them will take more then one day to finish so i charge them 1-2 gigs a day
So lets start now with how easy this is. First of all thanks to Selenium developers for giving us such a great tool to work with! You can check this project seleniumhq, or if u use ubuntu you can install it right away with :
Code:
pip install selenium
and if u dont have pip installed you can install it with
Code:
sudo apt-get install pip
Now you have a choice to go to seleniumhq and install SeleniumRC as a server on your machine but i found this to be only good if u have some larger projects and scraping to be done, and it has prety neat build in functionalaty for running on multiple machines,etc like i sayed its useless for me for now
For this example i will be writing code in python as it is my programing language of choice and explain everthing line by line.
Code:
from selenium import webdriver
Code:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
from selenium.common.exceptions import NoSuchElementException
import xlwt
from pyvirtualdisplay import Display
xlwt is for making excel sheets and pyvirtualdisplay is for running firefox web browser in virutal display so u can work on other stuff while scraping.
Code:
class Fiverrgig():
index=1
wbk=xlwt.Workbook() # creating xls document
sheet=wbk.add_sheet('sheet 1') #adding sheet
def setUp(self):
self.display= Display(visible=0,size=(800,600)) #this here creats virtual display
self.display.start()
self.visit_urls=[]
self.driver = webdriver.Firefox() #here we tell our program that the browser that we will be using is firefox
self.driver.implicitly_wait(30)
self.base_url = ""
def test_fiverrgig(self):
self.driver.get(self.base_url + "/uitgebreid-zoeken")
self.driver.find_element_by_id("Form_Name_Lid__prefixbtnSubmit").click()
a) install firebug, and right click->inspect element with firebug then u found your element and u can copy his xpath or whatever you need. Xpath is like route to your element on webpage. There is a lot more ways u can search for a desired elemt like driver.find_element_by_class,find_element_by_id ..and others.
A great way to find the function you need is with Eclipse and PyDev and intelisense, and if you need to do some testing before you can make your bug-free code
b)
U can use the SeleniumIDE that can be installed in firefox. Basicly u start the plugin click on webpages and that tools generates the code for you that you can export and work with. Code u can get is C#, Python, Ruby,JUnit and u can choose want do you want to use,webdriver control or seleniumrc.
So the point is first you test it and then you write your code. That is why i like python for scraping.
Code:
linkovi=self.driver.find_elements_by_link_text("Meer informatie") #this line here gets all the elements(links)
for link in linkovi:
self.posjeti_urlove.append(link.get_attribute('href'))
for link_ in self.posjeti_urlove:
print str(self.index) +"-->"+link_
self.driver.get(link_)
self.driver.find_element_by_xpath("//a[contains(text(),'Contact')]").click() #again we are location some web element
self.scrape()
Code:
def scrape(self):
website=self.driver.find_element_by_class_name('kolom')
url=website.find_element_by_tag_name('a').text
html=self.driver.find_element_by_id('tab_1_2')
lines=html.text.splitlines()
phone=lines[1]
adress=lines[-1].split(",")
city=adresa[-1]
PN=adresa[1]
self.zapisi(url,phone, city, PN)
Code:
def zapisi(self,url,telefon,grad="N\A",postanski_broj="N\A"):
self.sheet.write(self.index,0,url)
self.sheet.write(self.index,1,grad)
self.sheet.write(self.index,2,postanski_broj)
self.sheet.write(self.index,3,telefon)
self.index+=1
def is_element_present(self, how, what):
try: self.driver.find_element(by=how, value=what)
except NoSuchElementException, e: return False
return True
def tearDown(self):
self.driver.quit()
self.wbk.save("fiverr")
self.display.stop()
a=Fiverrgig()
a.setUp()
a.test_fiverrgig()
a.tearDown()
Final summation:
With this set of tools and a little work u can probably scrape anything, as it is real browser that is browsing the web,u can make your custom email-acc creator,twitter accounts, pinterest and once you learn the basics, learning curve is quite fast.
Well then good luck guys and happy money making :china:
Last edited: