1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Scraping search engine results using Java...

Discussion in 'Black Hat SEO' started by sfidirectory, Dec 9, 2010.

  1. sfidirectory

    sfidirectory Senior Member

    Joined:
    Mar 29, 2010
    Messages:
    899
    Likes Received:
    483
    Occupation:
    Web developer/BTC enthusiast
    Location:
    php artisan make:migration
    Home Page:
    Hi all,

    To keep up my Java skills for my next year of study, I thought of creating a program that saves a list of urls for a given search term. I know lots of programs like this are listed on here but I thought why not learn something new and create one myself. Trouble is I havn't yet to learn how to read in search engine results, so am asking Java experts here on BHW for some tips and ideas. I know there will need to be for loops, if loops, try-catch blocks, a button with a listener that saves the results to an html file in a specified folder.

    I am not sure how to implement proxies with such a program yet but I think a program's ability will be limited without proxy support as search engines could ban your I.P or something like that.

    Any thoughts appreciated, and if I successfully create the program, it will be free for BHW, just want to start making a decent amount of contributions and show that I have the ability and problem solving skills etc.
     
  2. imperial109

    imperial109 Regular Member

    Joined:
    Jan 19, 2009
    Messages:
    499
    Likes Received:
    361
    I'm in the same situation. You can use their AJAX API to get results, but I'm not familiar with that. They've taken every precaution to prevent automated queries, and I get an exception every time the script(JAVA) tries to connect.
     
  3. madoctopus

    madoctopus Supreme Member

    Joined:
    Apr 4, 2010
    Messages:
    1,249
    Likes Received:
    3,498
    Occupation:
    Full time IM
  4. imperial109

    imperial109 Regular Member

    Joined:
    Jan 19, 2009
    Messages:
    499
    Likes Received:
    361
    Yea, Here's another thing to connect to the G API, but you need to install it on your site. Other auto scripts are against TOS.
    Code:
    frankmccown.blogspot    com/2008/06/using-googles-ajax-search-api-with-java.html
    Share your code here OP so we can all improve upon it.
     
  5. dannyhw

    dannyhw Senior Member

    Joined:
    Jul 16, 2008
    Messages:
    980
    Likes Received:
    462
    Occupation:
    Software Engineer
    Location:
    New York City Burbs
    If you just use one socket you don't need a proxy as far as I can tell. Not on the API, just scraping the results from google.com.
     
  6. Monrox

    Monrox Power Member

    Joined:
    Apr 9, 2010
    Messages:
    615
    Likes Received:
    579
    I really am not trying to start a war here but you 2 should consider an MS backed language parallel to what you need for school. Maybe not VB because the syntax is a lot different but C# or managed C++ would be a nice complimentary skill.

    Here's why: first MS is interested in selling their OS and other products because this is how they are making money. An OS is interesting to the consumer if there is lots of software for it. So MS is providing developers with free tools and more importantly comprehensive documentation (MSDN). And you can make money from their users. While Sun does have an OS, I don't think many are using Solaris.

    Also don't expect to learn to program in college, you can't study this the old fashioned way. Most profs I've seen are really good at generating fractals and finding prime numbers but they have no idea how to click a button of another program from their application.

    For scraping SE results, here's how I'd start. Search manually for something, save the source code of the page, look for a tutorial, parse all the links and fill a listbox with them. Then learn how to get only the links that are actual results, excluding the ads. Then learn how to paste text into the search field of G and click the Search button. After that you can investigate the possibility to send raw POST requests to the server to get the source code in the first place.

    You can't simply open a connection to a proxy, construct all the headers, send them on their merry way, then start listening for the server's reply.

    Try not to use proprietary APIs wherever possible, they are subject to change and also the effort you put in learning them is almost the same as doing the stuff yourself.

    -------------------------
    EDIT:
    To illustrate all this, try using this tutorial to get search results just like that. Then try it with a proxy :D
    http://www.exampledepot.com/egs/java.net/Post.html
     
    Last edited: Dec 16, 2010
  7. madoctopus

    madoctopus Supreme Member

    Joined:
    Apr 4, 2010
    Messages:
    1,249
    Likes Received:
    3,498
    Occupation:
    Full time IM
    What's wrong with Java? IT's a nice language, has good docs and works on any OS which is great.
     
  8. dannyhw

    dannyhw Senior Member

    Joined:
    Jul 16, 2008
    Messages:
    980
    Likes Received:
    462
    Occupation:
    Software Engineer
    Location:
    New York City Burbs
    Just like any language it's good if it's cost effective.