1. This website uses cookies to improve service and provide a tailored user experience. By using this site, you agree to this use. See our Cookie Policy.
    Dismiss Notice

Journey to 100k selling scrapped articles from archive.org

Discussion in 'My Journey Discussions' started by mylastvacation, May 20, 2020.

  1. mylastvacation

    mylastvacation Jr. VIP Jr. VIP

    Joined:
    Apr 26, 2018
    Messages:
    777
    Likes Received:
    639
    Gender:
    Male
    Occupation:
    Teacher
    Location:
    Ecuador
    Home Page:
    This will be a no-fluff straight to the point worth of technical kind of journey. No speculations, no second-guessing ourselves, no planning, just taking action, and following my motto: "It doesn't need to be perfect, it just needs to be done".

    For this journey, I have partnered with @Jrim_Software to sell scrapped content from the way back machine. He'll be doing the back end, I'll be doing the front end.

    What do I have?
    • A large number of very strong Reddit, Fiverr, Facebook, Instagram, Pinterest, Youtube, and other social media accounts to promote our service and software to automate them.
    • A mailing list of a few hundred bloggers and people interested in getting free or cheap articles for their blogs.
    • Unlimited web hosting accounts, VPS, proxies, domains, and the like.
    • 5 VAs in India, Indonesia, Argentina and Nigeria which I have trained, I trust, and will be taking care of the customer service, fulfilling orders and other non-technical aspects.
    • A large network of blogger friends to help me promote my service, in exchange for an affiliate commission.
    • Strong copywriting skills.
    What does he have?
    • Programming knowledge and is halfway through building the scrapper (which is actually the most important piece of the puzzle).
    • Native English skills to proofread the rubbish I write.

      What do we have together?
    • Trust in each other
    • Work ethic.
    • A desire to thrive.
    As most of you know, most JVs fail because one of the parties ends up doing more than the other or they can't agree on how to do some things. Or sometimes they just don't have the motivation to keep pushing forward, this is not the case here, as we are both experts in our respective fields and we're ready to synergize it.

    How will the profit be split?
    50/50 after expenses.

    What has been done so far?
    • Exploring the potential of introducing AI into our project with that famous GitHub code that they said it was too dangerous and then released it anyway.
    • Testing rounds:
      1. it searches expireddomains for the query; we are currently looking at the first 3 pages just to see how it works in a small test. It found 75 domains on the first 3 pages with ACR between 10 and 300. (that's every singel result in this case since there are 25 per page).
      2. it checked waybackmachine for the sitemap for every URL of each domain
      3. It's now working to scrape the articles on the combined 20000 URLs across the 75 domains. Every article is validated for word count >800 and valid Grammarly plagiarism score.
    • dumping these into a database where they are viewable/searchable from a web UI, this is just to figure out if it's worthwhile and fine tune the logic to make it work well
    • running another test for now, I bumped up the threads to 20 to see how it behaves. 63000 URLs, again only checking the first 3 pages of expireddomains, looks like it takes about 3-4 minutes per 1000 URLs at 20 threads.
    • finished, almost 4 hours to crawl those 63k links at 20 concurrency. far fewer good articles overall (461 kb). maybe this is a niche someone has already used this technique on, or perhaps the first few pages and the domains we crawled just didn't have appropriate content. there are 225k domains under wellness total and at least back to the 10th page still has ACR around 100 so there is a ton more content to scrape, I did see a few articles come up in German which makes me wonder if this method can be improved by specifically sourcing foreign language articles, then translating them - or vice versa. transforming plagiarized english content into a foreign language and having a foreign blog.
    • so far have had success just using the Grammarly plagarism checker demo on their website - no need to sign in. I found the web interface very complex after logging in with the account you bought. The software has a hard time automating complex, content rich websites in the socket based logic we use.
    • Figured out how we can keep certain tags in there like h1 h2, we can allow some tags so that our customers don't need to spend hours editing the content they get from us.
    • figured out how to export two different files - one with text only, one with HTML tags intact, so people can choose to do what they want with it.
    • Testing how to keep all the tags in the content:
      'h1', 'h2', 'h3', 'h4', 'h5',
      'u', 'b', 'i', 'em', 'strong',
      'div', 'span', 'p', 'article', 'blockquote', 'section',
      'pre', 'code',
      'ul', 'ol', 'li', 'dd', 'dl',
      'table', 'th', 'tr', 'td', 'thead', 'tbody', 'tfood',
      'label',
      'fieldset', 'legend',
      'img', 'picture',
      'br', 'p', 'hr',
      'a'
    • Figured out how the front end works: users would be given credentials to our web interface where we handle payment processing and order fulfillment. in the same UI they could see their orders and view results of the scrape.

    Click "watch thread" on the top right to get our weekly updates.

    Thank you for reading :)
     
    • Thanks Thanks x 6
  2. GUC

    GUC Newbie

    Joined:
    Dec 14, 2019
    Messages:
    15
    Likes Received:
    2
    Gender:
    Male
    Nice one OP, can’t follow but good luck
     
  3. Jrim_Software

    Jrim_Software Jr. VIP Jr. VIP

    Joined:
    Aug 1, 2011
    Messages:
    787
    Likes Received:
    191
    I'm happy and grateful to be working with @mylastvacation. He is a smart guy! I am very fortunate to find a partner that is not as much of a perfectionist as I am. It means things get DONE!

    You guys are lucky he is posting the journey thread - he shared way more information than I would have. Hope someone else can learn from our journey! :)
     
  4. armin64

    armin64 Registered Member

    Joined:
    Feb 13, 2009
    Messages:
    85
    Likes Received:
    29
    Gender:
    Male
    Location:
    IR
    Home Page:
    Are u going to use GPT-2 AI for text generating?
     
    Last edited: May 21, 2020
  5. Jrim_Software

    Jrim_Software Jr. VIP Jr. VIP

    Joined:
    Aug 1, 2011
    Messages:
    787
    Likes Received:
    191
    It's something we would like to do more experimentation on. I did some initial tests that were not very promising. The text was interesting but not really coherent or particularly on-topic. With some refinement maybe we can get to where we are trying to go, but it's just not ready yet in its current form.
     
    • Thanks Thanks x 1
  6. armin64

    armin64 Registered Member

    Joined:
    Feb 13, 2009
    Messages:
    85
    Likes Received:
    29
    Gender:
    Male
    Location:
    IR
    Home Page:
    That's the Intrinsic feature of such a technology, the text would be very smooth but meaningless. Still I think we can use it for second tier link buildings. Am working on same project as well. Good luck.
     
  7. Visual Eagle

    Visual Eagle Jr. VIP Jr. VIP

    Joined:
    Dec 11, 2008
    Messages:
    2,425
    Likes Received:
    1,967
    Gender:
    Male
    Occupation:
    Graphic Designer
    Location:
    Slovenija
    Home Page:
    Journey looks good :) good luck though try to avoid grammarly plagiarism checker as it doesnt detect much. Copyscape while paid is the way to go for providing a better quality service.
     
  8. Lioraky1234

    Lioraky1234 Newbie

    Joined:
    May 10, 2020
    Messages:
    9
    Likes Received:
    0
    Gender:
    Male
    Interesting. Looking forward to know your journey.
     
  9. Sristy

    Sristy Jr. VIP Jr. VIP

    Joined:
    Aug 17, 2010
    Messages:
    2,882
    Likes Received:
    818
    Gender:
    Female
    Location:
    In My Blog Network
    Home Page:
    Nice thought. All the best..
     
  10. Wendy logan

    Wendy logan Registered Member

    Joined:
    May 5, 2020
    Messages:
    76
    Likes Received:
    22
    Gender:
    Female
    This is great I hope to learn a little bit from you guys. I believe great teamwork is a trait all JV should have
     
  11. YanCendra

    YanCendra Registered Member

    Joined:
    Dec 23, 2019
    Messages:
    73
    Likes Received:
    11
    Gender:
    Male
    love to watch more. good luck
     
  12. The_Surge

    The_Surge Junior Member

    Joined:
    May 16, 2020
    Messages:
    107
    Likes Received:
    52
    Gender:
    Male
    Damn! This is what I call smart thinking!
     
  13. SpawneR

    SpawneR Jr. VIP Jr. VIP

    Joined:
    Aug 15, 2014
    Messages:
    1,538
    Likes Received:
    1,880
    Gender:
    Male
    Occupation:
    Internet Marketing
    Location:
    ✅ BEST SMM PROVIDER BELOW ✅
    Home Page:
    Wow, this looks promising buddy! Keep us updated, this could be huge money maker! :)
     
  14. Kaine

    Kaine Junior Member

    Joined:
    Apr 24, 2015
    Messages:
    101
    Likes Received:
    37
    Gender:
    Male
    Location:
    France
    Home Page:
    Interesting on the other hand archive.org is really very slow and it is quite rare that the links work elsewhere than on the home page, you will struggle to make massive scrape.
    That said, I had just suggested something in this style on a Jimbobo thread, so I also think the idea is largely doable.
    Good luck on your business guys, I'm going to follow all of that closely.
     
  15. HelloInsomnia

    HelloInsomnia Jr. Executive VIP Jr. VIP

    Joined:
    Mar 1, 2009
    Messages:
    1,875
    Likes Received:
    3,029
    Interesting, I'm in the process of doing something very similar, best of luck!

    They have an api..
     
  16. Kaine

    Kaine Junior Member

    Joined:
    Apr 24, 2015
    Messages:
    101
    Likes Received:
    37
    Gender:
    Male
    Location:
    France
    Home Page:
    In this case they should do a complete mirroring, then test the duplicate afterwards. Even taking only the best backup of each site, it would require exceptional storage ... I do not know if SQL is the best in this case. Flat file should be considered.
     
  17. damewood

    damewood Jr. VIP Jr. VIP

    Joined:
    Dec 1, 2008
    Messages:
    274
    Likes Received:
    110
    Awesome journey, keep it up.

    Following.
     
  18. mylastvacation

    mylastvacation Jr. VIP Jr. VIP

    Joined:
    Apr 26, 2018
    Messages:
    777
    Likes Received:
    639
    Gender:
    Male
    Occupation:
    Teacher
    Location:
    Ecuador
    Home Page:
    Update: The scraper is now ready and our Reddit promotion started. You can imagine my surprise when I went to open the Reddit upvote panels I used to use a few months ago and they were all down, and with the money I had loaded on them!

    Anyway, I checked the BHW marketplace and it seems the automated upvote panels are gone and now you need to contact the sellers on Skype directly to ask them for upvotes, so that took a while, as you need to find them online do everything manually.

    Got a few leads from Reddit but no sales yet, half of my posts got taken down by the mods, so 50% are still there and that in Reddit means success.

    Today I did posts on Reddit and tomorrow I'll do comments, searching manually for the keywords I;m marketing and replying to the recent comments or reaching out to the commenters.

    On Monday will start creating the site to take orders automatically and will send the copy to my VA for her to create the accounts on Fiverr, people per hour, upwork, SEO clerks and selly.gg

    On Tuesday I'll start ranking those gigs created on Monday with "fake" orders from different sources.

    Spent: $20 in upvotes, $5 on VPS, $5 on Grammarly account, total $35
    Earned $0 No orders on the first day but that was expected, and I know it's possible to make no mistakes and still fail, that is no weakness, that is life.

    IMPORTANT: I got a warning from a mod that I'm not allowed to JV with members from other countries who would like to resell our articles in their languages, so please don't PM me about that.
     
  19. Meddie

    Meddie Jr. VIP Jr. VIP Premium Member

    Joined:
    Jan 6, 2015
    Messages:
    5,458
    Likes Received:
    5,824
    Occupation:
    DigitalGeckos.com
    Location:
    DigitalGeckos.com
    Home Page:
    I've been scraping in mass millions of domains for years and nothing happens, all you need is to not a big multi-thread otherwise it will collapse, not because of security, but of their back-end.

    ^This, or chromium extension will do the job.


    Anyway, good luck @mylastvacation and to your partner with your project!
     
    • Thanks Thanks x 1
  20. aristocratic

    aristocratic Jr. VIP Jr. VIP

    Joined:
    Dec 31, 2017
    Messages:
    1,035
    Likes Received:
    501
    Gender:
    Male