1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

I need to scrape the contents of a site!?

Discussion in 'General Scripting Chat' started by T3chnician, Oct 8, 2012.

  1. T3chnician

    T3chnician Regular Member

    Joined:
    Oct 26, 2011
    Messages:
    278
    Likes Received:
    88
    Ok so to keep this brief.... I need to scrape a website, it has a bunch of schematics. I have tried "wget" but got a 301 permanently removed error. The site where the files I need does require a login and password (which i have) however when I copy the image location of one of the schematics I can retrieve it without being logged in. Yet when I try to pull up a directory via web browser I get the following error "Directory Listing Denied This Virtual Directory does not allow contents to be listed."

    Any ideas? Would help a lot
     
  2. cgimaster

    cgimaster Power Member

    Joined:
    Jun 30, 2012
    Messages:
    525
    Likes Received:
    311
    Gender:
    Male
    Have you tried setting wget to follow the url ? 301 is not an error but rather a redirection to the actual place it is located at, when you get a 301 it also send to you the new location on the header of the message.

    You can also specify on the wget to use credentials, save and reuse cookies, perhaps your issue is that you login and dont save the cookie so it doesnt know the credentials when u use it again ?

    Also it is possible that the page you are accessing requires the referer to be sent or verify what useragent youre using which may block wget for instance check those too.

    UPDATE:
    In regards the directory listing, that is a protection well known used to no list directories when no index is present to protect people from knowin the files in that directory but that does not meant you cannot access the file if you know its name unless there is additional protection to that directory for instance ip blocks.
     
    • Thanks Thanks x 1
    Last edited: Oct 8, 2012
  3. T3chnician

    T3chnician Regular Member

    Joined:
    Oct 26, 2011
    Messages:
    278
    Likes Received:
    88
    hm... you know im not sure, I will rety but if I remember right after I got the 301 redirect right after I would get a 404. btw im real new to wget


    this is the command I used (got it from my buddy) - wget -r -P /save/location -A jpeg,jpg,bmp,gif,png http://url.com

    EDIT: do you mind if I PM you? I dont want to put too much info about the site here...
     
    Last edited: Oct 8, 2012
  4. cgimaster

    cgimaster Power Member

    Joined:
    Jun 30, 2012
    Messages:
    525
    Likes Received:
    311
    Gender:
    Male
    -r means recursive
    -P is where to save
    -A is the accept list

    So basically that means to save the accept list to /save/location from site http://url.com

    THere is no cookie nor login usages in that command.

    If you want to see what is going on, try to run just wget http://url.com it will save a file and you can view the file saved with pico, vi, vim or preferred editor and see if it gives you for example a login error message or what the html is which may give you an idea of what is going on.

    UPDATE:
    No, I dont mind if u PM me, feel free to use it.
     
    • Thanks Thanks x 1
  5. cdutchman

    cdutchman Newbie

    Joined:
    Jul 29, 2012
    Messages:
    17
    Likes Received:
    6
    You may want to try PHP and CURL, as you can set a DO FOLLOW bit flag and that will follow 301 and 302 redirects.
     
  6. SonicSam

    SonicSam Registered Member

    Joined:
    Aug 21, 2012
    Messages:
    57
    Likes Received:
    5
    Location:
    X
    cURL can do this natively with -L flag.
     
  7. skulquake

    skulquake Registered Member

    Joined:
    Oct 6, 2010
    Messages:
    77
    Likes Received:
    10
    Have you tried httrack?