1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Java screen scraper for imdb?

Discussion in 'Other Languages' started by ChampGuy, Jan 31, 2010.

  1. ChampGuy

    ChampGuy Junior Member

    Joined:
    Aug 8, 2009
    Messages:
    163
    Likes Received:
    38
    I am trying to make a bot that requires getting a link to an imdb(internet movie database) page and returning infornation about the movie on that page. It would be pretty simple, but imdb has specifically implemented protection against screen scrapers, so the method I usually use (create URL, open urlConnection, use BufferedReader to get HTML) doesn't work. However, I know that the task can't be impossible, because I can get the source just by opening the page in my browser and selecting view page source from the toolbar menu.

    When I searched google for an imdb screen scraper, I found ones for other languages, but not java. I need a skilled java associate to show me how to automate this process.
    Posted via Mobile Device
     
  2. paincake

    paincake Power Member

    Joined:
    Aug 18, 2010
    Messages:
    716
    Likes Received:
    3,099
    Home Page:
    How exactly doesn't it work? What data are you receiving? I suspect the problem is that they see your user agent, which is "java/1.4.2_xx"
    You need to change it:
    PHP:
    conn.setRequestProperty("User-Agent""Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.13) Gecko/2009073021 Firefox/3.0.13")
    where conn is your URLConnection object.

    I think you might also need to set this in the beginning
    PHP:
    System.setProperty("http.agent"""); 
     
    Last edited: Sep 17, 2010
  3. zelma143

    zelma143 Power Member

    Joined:
    Jun 25, 2010
    Messages:
    571
    Likes Received:
    37
    Occupation:
    PHP programmer,Bot maker,iMacro script maker
    hey php would be better and easy ...

    try it it's damn faster then other and goood...

    also you can use cURL with php...