1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Software to harvest content from a site?

Discussion in 'Black Hat SEO' started by Bostoncab, Jan 7, 2013.

  1. Bostoncab

    Bostoncab Elite Member

    Joined:
    Dec 31, 2009
    Messages:
    2,255
    Likes Received:
    514
    Occupation:
    pain in the ass cabbie
    Location:
    Boston,Ma.
    Home Page:
    Hello,

    I found a membership site that blocks all of its content from the search engines. I want to hijack all of it. I have been trying to use Wget etc. etc. etc. and I can not get anything to work. Perhaps someone can suggest a method or software that will work?

    I would consider any suggestions helpful.

    Please save the moral discussion on whether or not I should do this for DP or some other forum. I don't think any other black hearts want to discuss it.
     
  2. dog-tag

    dog-tag Senior Member

    Joined:
    Oct 19, 2010
    Messages:
    811
    Likes Received:
    912
    Occupation:
    Full-Time Internet Marketer + Business Consultant
    Location:
    Thailand
    Content theft tut tut

    What platform is the site using? And what exactly are you looking to take?
     
  3. satyr85

    satyr85 Power Member

    Joined:
    Aug 7, 2011
    Messages:
    579
    Likes Received:
    444
    Location:
    Poland
    I was thinking about writing here something like:
    But you can think that i want to hijack that content myself. Anyway if you could share url of that site maybe i will find solution.
     
  4. AboveAll

    AboveAll Junior Member

    Joined:
    Apr 27, 2010
    Messages:
    150
    Likes Received:
    58
    If I was you, I would not do such thing. Remember everything returns back to you, the good and the evil.
     
    • Thanks Thanks x 1
  5. tas26

    tas26 Power Member

    Joined:
    Apr 21, 2009
    Messages:
    548
    Likes Received:
    274
    Occupation:
    student
    Location:
    BHW
    i can code you something in ubot if the content to be scraped is the same for multiple pages.
     
    • Thanks Thanks x 1
  6. ijof9

    ijof9 Power Member

    Joined:
    Mar 27, 2010
    Messages:
    536
    Likes Received:
    594
    Occupation:
    CTO
    Location:
    Western Europe
    Plenty of solutions.

    cURL, iMacros, Chrome Extension.

    Those can keep you logged in (send in the proper cookie headers) while you're requesting pages one by one to download member data. If it's worth $200-$300 (success-fee, of course) to you hit me up via PM, I'll do the work myself and send it to you.
     
  7. Bostoncab

    Bostoncab Elite Member

    Joined:
    Dec 31, 2009
    Messages:
    2,255
    Likes Received:
    514
    Occupation:
    pain in the ass cabbie
    Location:
    Boston,Ma.
    Home Page:
    If I had it I would pay it because I have been working on this for so long it may well be worth it but alas... I do not have such monies to spend. I see your tag line there. Thats alot of turnover. Are you saying this is the total value of the world market?

    This is actually an adult forum with a custom coded interface I believe written in Ruby on Rails. The layout is not very dissimilar from facbook but the rest of it is entirely different.

     
    Last edited: Jan 7, 2013
  8. Bostoncab

    Bostoncab Elite Member

    Joined:
    Dec 31, 2009
    Messages:
    2,255
    Likes Received:
    514
    Occupation:
    pain in the ass cabbie
    Location:
    Boston,Ma.
    Home Page:
    How do you mean the same? Like same Format? No I am sorry it is not. I would actually ideally want to hijack all page elements. The pictures as well as the videos and words. If you are going to annex the Sudetenland you mite as well go after Poland too right?
     
  9. ijof9

    ijof9 Power Member

    Joined:
    Mar 27, 2010
    Messages:
    536
    Likes Received:
    594
    Occupation:
    CTO
    Location:
    Western Europe
    Ruby is server side. Doesn't matter since we don't have server access.

    What I meant is that you're basically pulling out all available member data (including media), and importing it into your own website, right? I'm not sure what uBot can do, but you can call it crude scraping. Doesn't matter what's the platform, all that matters is that the HTML code doesn't change significantly between member pages. 99.9% that's the case. Your only problem is importing afterwards. That's why I called >$10 into play cause you're going to want to end up with a usable website.

    Don't worry about my sig. it's just a big number. The FX market's daily turnover never passed 2 trillion, I doubt they'd spend that much on advertising worldwide :) Or maybe, who knows...

    Btw, how did you make your stars green?
     
  10. regme

    regme Newbie

    Joined:
    Jun 16, 2012
    Messages:
    46
    Likes Received:
    10
    IF you can collect links, just use curl with filled User-agent and grab it all. That script might be helpfull
    PHP:

         
    function curl_get_contents$url)        {
                    
    $ch curl_init();                curl_setopt$chCURLOPT_URL$url );                curl_setopt$chCURLOPT_HEADER);                curl_setopt$chCURLOPT_RETURNTRANSFER);                curl_setopt$chCURLOPT_TIMEOUT20 );                curl_setopt$chCURLOPT_PORT80 );                curl_setopt$chCURLOPT_USERAGENT'Firefox' );                $content curl_exec$ch );                $code curl_getinfo$chCURLINFO_HTTP_CODE );                if ( $code >= 400 )                        $content false;                curl_close$ch );                return $content;        }

    $text=file('FILE_WITH_LINKS');//links divided by new strings.foreach ($text as $link )
    {
    $name=explode('/',$link); //if it is without parametrs
    $num=count($name);if( file_put_contents($name[$num-1],curl_get_contents($link)) )
           echo 
    'saved to'$name[$num-1] .'<br>';}
     
    Last edited: Jan 7, 2013
  11. audioguy

    audioguy Power Member

    Joined:
    Jun 12, 2010
    Messages:
    609
    Likes Received:
    224
    Location:
    Anywhere in the world building WP sites.
    What have you done using wget and fail?

    When reporting something doesn't work, that would help. Perhaps you just overlook something?
     
  12. Bostoncab

    Bostoncab Elite Member

    Joined:
    Dec 31, 2009
    Messages:
    2,255
    Likes Received:
    514
    Occupation:
    pain in the ass cabbie
    Location:
    Boston,Ma.
    Home Page:
    The site is a membership site. I asked for help using wget on an ubuntu forum and I used the string they told me to. It didnt work.


     
  13. Bostoncab

    Bostoncab Elite Member

    Joined:
    Dec 31, 2009
    Messages:
    2,255
    Likes Received:
    514
    Occupation:
    pain in the ass cabbie
    Location:
    Boston,Ma.
    Home Page:
    I never considered a solution that would pull the content automatically into my site. My plan had been to harvest all of the content and build out my site manually using the harvested content. Especially the text. I have had the feeling for a while that my tube site would rank shit loads higher if I had more keyword rich content. So I wanted to harvest all the text from this site which G has never seen apparently and put it into the descriptions of the videos on my tube site.

    There are a few other paid adult forums that block out the G where this could be accomplished also.

    My stars? I supposed it has something to do with length of membership,posts or rep or something? I know they were not always so and turned that color at a certain point.

    For the length of time I have been a BHW member I still aint made a penny and after panda and the rest my sites rank lower then when I built them.. sucks.

     
  14. Bostoncab

    Bostoncab Elite Member

    Joined:
    Dec 31, 2009
    Messages:
    2,255
    Likes Received:
    514
    Occupation:
    pain in the ass cabbie
    Location:
    Boston,Ma.
    Home Page:
    It obviously looks like you know what you are doing and if I had the slightest clue on how to use that script I would probably be thanking you profusely.

    Thank you but I have no idea what to do with that script and I do not expect you to spoon feed me.

     
  15. Asif WILSON Khan

    Asif WILSON Khan Executive VIP Premium Member

    Joined:
    Nov 10, 2012
    Messages:
    10,112
    Likes Received:
    28,526
    Gender:
    Male
    Occupation:
    Fun Lovin' Criminal
    Location:
    London
    Home Page:
  16. Bostoncab

    Bostoncab Elite Member

    Joined:
    Dec 31, 2009
    Messages:
    2,255
    Likes Received:
    514
    Occupation:
    pain in the ass cabbie
    Location:
    Boston,Ma.
    Home Page:
  17. sirgold

    sirgold Supreme Member

    Joined:
    Jun 25, 2010
    Messages:
    1,260
    Likes Received:
    645
    Occupation:
    Busy proving the Pareto principle right
    Location:
    A hot one
    Yes, sure: karma and little fairies flying all over us and the aliens among us and doomsday and the illuminati... ;) Coming back on planet earth: curl with cookies support might be helpful with a simple script to tell it what to do since it lacks the handy recursive switch wget has. If the website is particularly smart you might need an ad hoc c# or python or similar script with js support. BTW did you try iMacros? I'ts easy to hack something together and it'll act as a normal browser rather than a scraper.. It's an easy one to give it a shot to imo ;)
     
  18. regme

    regme Newbie

    Joined:
    Jun 16, 2012
    Messages:
    46
    Likes Received:
    10
    Sorry, have misunderstanded what you want. My simple script just download page-code from gathered links . As i see you need a little bit more and its needs using regular expressions to parse targeted info.