1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Where to start: scrape a lot of content

Discussion in 'Black Hat SEO' started by pbozhenev, Jul 27, 2017.

  1. pbozhenev

    pbozhenev Newbie

    Joined:
    Jul 25, 2017
    Messages:
    7
    Likes Received:
    0
    Gender:
    Male
    Hi,

    Sorry I am new, not sure if I am in the right section. But anyway,

    I am pretty new to internet marketing, but I know what I want and I've already explored some stuff but I can't find a structured answer to what I have in plans. Maybe someone can help me?

    Basically, I want to find a way to scrape content from bunches of websites in an automated way or at least semi-automated. I want to find something so I can post an url and it returns me all the content of the website in a structured way (no html tags, etc.). Then I want to use a tool to modify the content slightly by using synonyms and rephrases and things like that. Goal is to have a database of 100's of articles of scrapped content which would be modified enough to not be categorized as plagiarism.

    I've checked out some guides and tools, downloaded some plugins, checked a few things out but can't seem to find something that suites my needs.

    Could someone recommend the best tools out there for that? Or point me in the right direction to learn it by myself (I am comfortable with programming), or tell me if it's even possible. If not, what is the limit of what would be possible? I found a bunch of threads already on web scraping and different tools but I can't figure out where to start / what are the BEST tools (I don't mind spending a few hundred bucks or time learning about it).

    Any guidance would be VERY appreciated.

    Thanks.
     
  2. w.waterman

    w.waterman Registered Member

    Joined:
    Jun 12, 2017
    Messages:
    84
    Likes Received:
    21
    Gender:
    Male
    Occupation:
    Doing what it takes to make bread
    Sounds like your lazy.
    I used, apparantly, the best word spinner (wordai) and it isnt good enough to completely avoid plagiarism, it can but doesnt read very well..
    U will have to put the work in to edit EVERYTHING..
    Or pay for rewrites and again u will have to edit EVERYTHING unless you find a really good native writer.

    As for the scraping im not very sure, it will take some time to rewrite everything anyway so it shouldnt need to be automated as you might not be able to upload that fast so why scrape that fast.

    Anyway my point being, WordAI was ok but still needs work, maybe the next version will be better

    As for scraping tools, i have zero knowledge. Plausible deniability
     
  3. pbozhenev

    pbozhenev Newbie

    Joined:
    Jul 25, 2017
    Messages:
    7
    Likes Received:
    0
    Gender:
    Male
    Thank you for you reply. Will definitely check out WordAI ASAP and test it out. I am not lazy, honestly, I am just trying to find out what is the limit of what is possible and from what point do I have to start doing edits and work myself, just trying to build awareness. I had no idea what to look for but now with your help at least I know with which tool I can start and look around for that part of my plan.

    My plan is to build a quality PBN without making it super time consuming so I am trying to find ways to massively populate those sites with "decent" content without having to write each article from scratch one by one and make it as automated as possible. I know that I will need to do the editing either way but I am trying to find ways to minimize my work input in order to leverage my time.

    In the same way, I am looking for how to program bots or the best bot tools out there so that that is as automated as possible as well in order for me to gain time and leverage. Been going through threads about each and every particular one but it would be great if someone could point me towards a guide or list of all the best bots out there or reviews or how to program them. Anything would help really.
     
  4. itz_styx

    itz_styx Power Member

    Joined:
    May 8, 2012
    Messages:
    675
    Likes Received:
    343
    Occupation:
    CEO / Admin / Developer
    Location:
    /dev/mem
    Home Page:
    you could check out my tool, it does exactly what you want. sorry i'm the dev and i know people instantly think of advertising, but i thought since you requested it i could mention it.
    its called argo content generator and is mainly a toolkit made to create websites and wp blogs. it has an article scraper that can scrape from many article directories or different search engines like google,bing,rambler,aol etc.
    the articles can be stored in a sql database without need of any external database server, its all handled by the program. additionally you can also autospin the articles in different languages without having to use any external software like wordAI that requires an extra subscription. also the text can be further manipulated with different algorithms so the structure isn't the same. here is a video so you can see it in action:
    in case you have any questions just PM me :)
     
  5. ContentWriter

    ContentWriter Jr. VIP Jr. VIP

    Joined:
    May 8, 2013
    Messages:
    3,030
    Likes Received:
    398
    Occupation:
    Content Writer
    Home Page:
    Hi, @pbozhenev.

    Are you planning to build a churn-and-burn website?

    If that's not your goal, you should avoid spun content. There's more work involved than just replacing words with their synonyms to escape plagiarism.
    Even if you'll use content spinners, hard work is still needed to tweak the spun content in order to come up with something that still makes sense.
     
    • Thanks Thanks x 1
  6. itz_styx

    itz_styx Power Member

    Joined:
    May 8, 2012
    Messages:
    675
    Likes Received:
    343
    Occupation:
    CEO / Admin / Developer
    Location:
    /dev/mem
    Home Page:
    who cares about things like copyscape? certainly not google ;)
     
  7. pbozhenev

    pbozhenev Newbie

    Joined:
    Jul 25, 2017
    Messages:
    7
    Likes Received:
    0
    Gender:
    Male
    Thank you for your reply @ContentWriter . I am thinking about a churn-and-burn but that's more of a second option.

    After I use spinning software, how much hard work exactly would it need to create something that would escape plagiarism? Could you define hard work and the amount of time I would need to spend per 500 words article and what would need to be done manually after I use a tool like the one mentioned by @itz_styx ? How likely and how fast a spun article could be categorized as plagiarism and what would be Google's procedure in such a case?

    Sorry I have a lot of questions but you must have figured by now that I have no idea what I'm getting into. Just trying to configure a great step-by-step plan which could give me good leverage.
     
  8. ContentWriter

    ContentWriter Jr. VIP Jr. VIP

    Joined:
    May 8, 2013
    Messages:
    3,030
    Likes Received:
    398
    Occupation:
    Content Writer
    Home Page:
    The people who own the copied content care about their work. Over the past 17 years that I've been in the content writing business, I got a few clients who shared their stories with me when they were starting up. They copied the content of some big brands only to receive a DMCA Takedown notice later.
     
  9. pbozhenev

    pbozhenev Newbie

    Joined:
    Jul 25, 2017
    Messages:
    7
    Likes Received:
    0
    Gender:
    Male
    I want to mention as well that I plan to scrape content from mostly expired domains and their archived websites as much as possible. If not I would scrape it from lower ranking / unpopular sites. I already started doing some research on that end.
     
  10. ContentWriter

    ContentWriter Jr. VIP Jr. VIP

    Joined:
    May 8, 2013
    Messages:
    3,030
    Likes Received:
    398
    Occupation:
    Content Writer
    Home Page:
    Hi, @pbozhenev.

    The common problems with content spinning software are the following:
    - grammar issues
    - missing punctuation marks
    - broken train of thought or thoughts that do not make sense

    I've lost count of the number of BHW members who came to me asking me to rewrite their spun content. In most cases, I tell them that it's better for me to write the article from scratch instead of fixing what's already broken.

    I would spend more time fixing a spun content than writing an article from scratch.

    Now, if you won't spend time fixing your spun content, you're at risk of earning comments, if not insults, from your readers about how nasty the article is, most especially if you share your articles on social media.

    I cannot tell say, "Oh, it'll only take you 20 to 30 minutes to fix a spun content" because it depends on how meticulous you are as a proofreader.
     
  11. itz_styx

    itz_styx Power Member

    Joined:
    May 8, 2012
    Messages:
    675
    Likes Received:
    343
    Occupation:
    CEO / Admin / Developer
    Location:
    /dev/mem
    Home Page:
    with argo it would be easy you have hundreds of articles ready in a matter of minutes (spun/synonymized also supports nested spintax that most other tools lack) and it can directly post it to wordpress, or create a complete html site out of the text with hundreds or even thousands of subpages.
    i've been developing this tool since 2010 (in pub since 2011) and no google update ever caused any issues. its because its not simply copy+pasting some content, it creates complete websites or populates wordpress blogs in a more sophisticated way, unlike the other tools who just post random spinned content and thats it and people wonder why it doesn't work. well if its properly done it does work and you don't have to spend hours and hours preparing content.
    of course if you want a whitehat site then write content yourself, but auto generated content works just as well, its just a matter what you want to do with the sites. you can get them ranked for sure and not only churn and burn. i've got many sites with auto generated content that are indexed since years in google. sometimes they lose ranking, but sofar i could always bring them back with some link juice. also this doesn't only work with low competition keywords. well just give it a try and see for yourself, google isn't that clever as they pretend ;) besides most of the people who claim that spun content doesn't work never even tried and just repeat what they've heard somewhere, thats a fact.
     
  12. ContentWriter

    ContentWriter Jr. VIP Jr. VIP

    Joined:
    May 8, 2013
    Messages:
    3,030
    Likes Received:
    398
    Occupation:
    Content Writer
    Home Page:
    @pbozhenev, try the recommendation of @itz_styx because you'll never know what you're gaining or losing until you try. But I suggest that you invest time and money that you can only afford to put at risk.

    I tried spun content for my own websites. It didn't work.
    Many people are coming to me asking me to fix their content. That confirms and validates my experience that spun content is not the way if you want to be in the game for the long haul.

    Do you frequent social media? Haven't you stumbled upon netizens' comments that go like this, "I smell scam on this website. Just look at the grammar of this article and the rest of their articles. No legit company would be willing to publish such low-quality content"?

    I've seen so many of those on Facebook.

    At the end of the day, we all have the ability and common sense to identify what will really work on a long-term basis. In most cases, you'll find that common sense is all you need.
     
  13. itz_styx

    itz_styx Power Member

    Joined:
    May 8, 2012
    Messages:
    675
    Likes Received:
    343
    Occupation:
    CEO / Admin / Developer
    Location:
    /dev/mem
    Home Page:
    you forget that you are in the blackhat section. nobody is looking to put up spun content on his companies website ?! yes sure for whitehat sites i also write my own content obviously. however if you are trying to make money from affiliate and CPA offers this is a great way to scale it up and thats why many people do it. you seem to be in the complete wrong place or miss the whole point of why people do this! its surely not for company or ecommerce websites.....

    sure you can make some bucks with your pretty content, but you can make just as much or even more by having many more sites and sending the traffic to offers rather than wanting them to stay on your own site and thats the whole point you are missing. the money you invest here is easily gained back. if the ROI wouldn't match, nobody would bother doing it. aside of that you can also cloak it, so only search engines will see it and real users get sent to a different landing page. yes i know you probably think that cloaking also doesn't work, but it may shock you that it does work well.
    besides if you post a little random spun content on your sites and it doesn't perform as well means nothing. there are many factors, maybe you just didn't promote it good enough ;)
     
  14. ContentWriter

    ContentWriter Jr. VIP Jr. VIP

    Joined:
    May 8, 2013
    Messages:
    3,030
    Likes Received:
    398
    Occupation:
    Content Writer
    Home Page:
    Maybe you're right, @itz_styx. I'm glad that works for you.

    Let's wish OP to find what could work for him.
     
  15. itz_styx

    itz_styx Power Member

    Joined:
    May 8, 2012
    Messages:
    675
    Likes Received:
    343
    Occupation:
    CEO / Admin / Developer
    Location:
    /dev/mem
    Home Page:
    yes, good luck and don't be afraid to try things even if people say it wouldn't work! thats the best way to learn what really works and what doesn't for you. every niche is different :)
     
  16. sagarmadan

    sagarmadan Newbie

    Joined:
    Sep 27, 2014
    Messages:
    46
    Likes Received:
    9
    This is very basic. You won't really need much programing knowledge for this.
    I would suggest python with requests is the best solution or if certain websites need login, It's easier to do it with selenium.
    Won't mind helping you on this if you have a descent idea.

    Edit:
    You want to do it from archive..
    So, I had same plan a few months ago. Problem I faced,
    Scraped 1000 articles as a test(obviously I wasn't the first one to do it)
    40-50% of these articles were 0% unique.
    Now problem occurs if you do on copyscaping all 1000s of articles. It gets very expensive.
    If you scrape 10k articles, there sure will be a few high quality and unique ones but problem is how do you find them?
    I tried to manually find websites which had unique content in archive. (Those websites hardly have a few articles not worth scraping[doing manually was faster])
    This sure can be done, but the research required to get the results you are looking for will take time and effort. It's much better to just outsource these.
    Recently I came across article forge, It creates pretty neat content also gets indexed.
    If you need these articles for pbn domains, collect lot of keywords related to niche(sort them in mainkeyword,sub,sub,subkeyword) and send them to a va or can do it yourself.
    Extract articles from article forge and get them through grammarly once. (pm me if you need a script to automate this. I have a simple javascript to fix all grammar errors in grammarly)
    save finished article in a folder and it's ready to be posted anywhere.

    [also use copyscape with article forge]
    or you can try
    itz_styx 's tool

    I haven't got a change to try it yet but have heard a lot of people use it for similar tasks.
    good luck :)
     
    Last edited: Jul 27, 2017