Software to harvest content from a site?

Bostoncab

Elite Member
Joined
Dec 31, 2009
Messages
2,256
Reaction score
514
Hello,

I found a membership site that blocks all of its content from the search engines. I want to hijack all of it. I have been trying to use Wget etc. etc. etc. and I can not get anything to work. Perhaps someone can suggest a method or software that will work?

I would consider any suggestions helpful.

Please save the moral discussion on whether or not I should do this for DP or some other forum. I don't think any other black hearts want to discuss it.
 
Content theft tut tut

What platform is the site using? And what exactly are you looking to take?
 
I was thinking about writing here something like:
PM me url of that site. Maybe i will be able to help.
But you can think that i want to hijack that content myself. Anyway if you could share url of that site maybe i will find solution.
 
Hello,

I found a membership site that blocks all of its content from the search engines. I want to hijack all of it. I have been trying to use Wget etc. etc. etc. and I can not get anything to work. Perhaps someone can suggest a method or software that will work?

I would consider any suggestions helpful.

Please save the moral discussion on whether or not I should do this for DP or some other forum. I don't think any other black hearts want to discuss it.

If I was you, I would not do such thing. Remember everything returns back to you, the good and the evil.
 
i can code you something in ubot if the content to be scraped is the same for multiple pages.
 
Plenty of solutions.

cURL, iMacros, Chrome Extension.

Those can keep you logged in (send in the proper cookie headers) while you're requesting pages one by one to download member data. If it's worth $200-$300 (success-fee, of course) to you hit me up via PM, I'll do the work myself and send it to you.
 
If I had it I would pay it because I have been working on this for so long it may well be worth it but alas... I do not have such monies to spend. I see your tag line there. Thats alot of turnover. Are you saying this is the total value of the world market?

This is actually an adult forum with a custom coded interface I believe written in Ruby on Rails. The layout is not very dissimilar from facbook but the rest of it is entirely different.

Plenty of solutions.

cURL, iMacros, Chrome Extension.

Those can keep you logged in (send in the proper cookie headers) while you're requesting pages one by one to download member data. If it's worth $200-$300 (success-fee, of course) to you hit me up via PM, I'll do the work myself and send it to you.
 
Last edited:
i can code you something in ubot if the content to be scraped is the same for multiple pages.

How do you mean the same? Like same Format? No I am sorry it is not. I would actually ideally want to hijack all page elements. The pictures as well as the videos and words. If you are going to annex the Sudetenland you mite as well go after Poland too right?
 
Ruby is server side. Doesn't matter since we don't have server access.

What I meant is that you're basically pulling out all available member data (including media), and importing it into your own website, right? I'm not sure what uBot can do, but you can call it crude scraping. Doesn't matter what's the platform, all that matters is that the HTML code doesn't change significantly between member pages. 99.9% that's the case. Your only problem is importing afterwards. That's why I called >$10 into play cause you're going to want to end up with a usable website.

Don't worry about my sig. it's just a big number. The FX market's daily turnover never passed 2 trillion, I doubt they'd spend that much on advertising worldwide :) Or maybe, who knows...

Btw, how did you make your stars green?
 
IF you can collect links, just use curl with filled User-agent and grab it all. That script might be helpfull
PHP:
     function curl_get_contents( $url)        {
                $ch = curl_init();                curl_setopt( $ch, CURLOPT_URL, $url );                curl_setopt( $ch, CURLOPT_HEADER, 0 );                curl_setopt( $ch, CURLOPT_RETURNTRANSFER, 1 );                curl_setopt( $ch, CURLOPT_TIMEOUT, 20 );                curl_setopt( $ch, CURLOPT_PORT, 80 );                curl_setopt( $ch, CURLOPT_USERAGENT, 'Firefox' );                $content = curl_exec( $ch );                $code = curl_getinfo( $ch, CURLINFO_HTTP_CODE );                if ( $code >= 400 )                        $content = false;                curl_close( $ch );                return $content;        }

$text=file('FILE_WITH_LINKS');//links divided by new strings.foreach ($text as $link )
{
$name=explode('/',$link); //if it is without parametrs
$num=count($name);if( file_put_contents($name[$num-1],curl_get_contents($link)) )
       echo 'saved to'. $name[$num-1] .'<br>';}
 
Last edited:
What have you done using wget and fail?

When reporting something doesn't work, that would help. Perhaps you just overlook something?
 
The site is a membership site. I asked for help using wget on an ubuntu forum and I used the string they told me to. It didnt work.


What have you done using wget and fail?

When reporting something doesn't work, that would help. Perhaps you just overlook something?
 
I never considered a solution that would pull the content automatically into my site. My plan had been to harvest all of the content and build out my site manually using the harvested content. Especially the text. I have had the feeling for a while that my tube site would rank shit loads higher if I had more keyword rich content. So I wanted to harvest all the text from this site which G has never seen apparently and put it into the descriptions of the videos on my tube site.

There are a few other paid adult forums that block out the G where this could be accomplished also.

My stars? I supposed it has something to do with length of membership,posts or rep or something? I know they were not always so and turned that color at a certain point.

For the length of time I have been a BHW member I still aint made a penny and after panda and the rest my sites rank lower then when I built them.. sucks.

Ruby is server side. Doesn't matter since we don't have server access.

What I meant is that you're basically pulling out all available member data (including media), and importing it into your own website, right? I'm not sure what uBot can do, but you can call it crude scraping. Doesn't matter what's the platform, all that matters is that the HTML code doesn't change significantly between member pages. 99.9% that's the case. Your only problem is importing afterwards. That's why I called >$10 into play cause you're going to want to end up with a usable website.

Don't worry about my sig. it's just a big number. The FX market's daily turnover never passed 2 trillion, I doubt they'd spend that much on advertising worldwide :) Or maybe, who knows...

Btw, how did you make your stars green?
 
It obviously looks like you know what you are doing and if I had the slightest clue on how to use that script I would probably be thanking you profusely.

Thank you but I have no idea what to do with that script and I do not expect you to spoon feed me.

IF you can collect links, just use curl with filled User-agent and grab it all. That script might be helpfull
PHP:
     function curl_get_contents( $url)        {
                $ch = curl_init();                curl_setopt( $ch, CURLOPT_URL, $url );                curl_setopt( $ch, CURLOPT_HEADER, 0 );                curl_setopt( $ch, CURLOPT_RETURNTRANSFER, 1 );                curl_setopt( $ch, CURLOPT_TIMEOUT, 20 );                curl_setopt( $ch, CURLOPT_PORT, 80 );                curl_setopt( $ch, CURLOPT_USERAGENT, 'Firefox' );                $content = curl_exec( $ch );                $code = curl_getinfo( $ch, CURLINFO_HTTP_CODE );                if ( $code >= 400 )                        $content = false;                curl_close( $ch );                return $content;        }

$text=file('FILE_WITH_LINKS');//links divided by new strings.foreach ($text as $link )
{
$name=explode('/',$link); //if it is without parametrs
$num=count($name);if( file_put_contents($name[$num-1],curl_get_contents($link)) )
       echo 'saved to'. $name[$num-1] .'<br>';}
 
Yes, sure: karma and little fairies flying all over us and the aliens among us and doomsday and the illuminati... ;) Coming back on planet earth: curl with cookies support might be helpful with a simple script to tell it what to do since it lacks the handy recursive switch wget has. If the website is particularly smart you might need an ad hoc c# or python or similar script with js support. BTW did you try iMacros? I'ts easy to hack something together and it'll act as a normal browser rather than a scraper.. It's an easy one to give it a shot to imo ;)
If I was you, I would not do such thing. Remember everything returns back to you, the good and the evil.
 
It obviously looks like you know what you are doing and if I had the slightest clue on how to use that script I would probably be thanking you profusely.

Thank you but I have no idea what to do with that script and I do not expect you to spoon feed me.
Sorry, have misunderstanded what you want. My simple script just download page-code from gathered links . As i see you need a little bit more and its needs using regular expressions to parse targeted info.
 
Back
Top