1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

scraping large lists

Discussion in 'Black Hat SEO Tools' started by utuxia, Aug 10, 2012.

  1. utuxia

    utuxia BANNED BANNED

    Joined:
    Feb 14, 2011
    Messages:
    673
    Likes Received:
    111
    I am scraping large pligg bookmarking lists using footprints with scrapebox.

    I get a list of about 60k PR1+ pligg bookmarking sites, and then import them into Ultimate Demon's site tester. Out of 60k, I might get about 500 that are recognized as a supported platform (pligg) within Ultimate Demon.

    Why such a large discrepency?
     
  2. Scritty

    Scritty Elite Member Premium Member

    Joined:
    May 1, 2010
    Messages:
    2,807
    Likes Received:
    4,496
    Occupation:
    Affiliate Marketer
    Location:
    UK
    Home Page:
    Footprints are highly variable ways of finding sites.Search engine results are also very dubious for some footprints (check them yourself)
    They just often get completely random after a page or two, sometimes the first site on page one of a certain footprint seems to not have the footprint anywhere at all. Google don't like to make life easy - searches aren't an exact science, the SE's like to "interpret" what they think you are looking for. Same reason some pages rank for entirely spurios stuff. So with Scrapebox set to 1000 URL's per search, I doubt more than 1 percent of the average search over 1000 results actually contains the exact footprint you asked it to find. Like I say, just open google and see for yourself.y

    You can visit the URL yourself and not see the search phrase you are looking for the "footprint" anywhere - either on the page or in the HTML code behind the page. 99%. It's a million miles away from being an exact science. So you need to scale up MASSIVELY. like panning for gold.
    S
    o - at the end of the day - that's how many are actually Pligg sites with the sign up page that UD recognises. Another failure will be that you will pick up pages that have the platform footprint mentioned in text...like this...

    inurl:story.php?title=

    There - now this page will be recognised as a Pligg site by many scrapebox footprints..of course it isn't
    It's the same with any scrape for any site type for any software.

    I scrape 2 million URL's a day/ I remove dupes and sometimes trim to root (some footprints you need to trim to root to prevent massive dupications as every page of some sites
    has the sign up link that UD is looking for)- then export and split into 40,000 blocks

    So I might have 600,000 left in 15 small 40,000 files
    Each block of 40,000 takes about 45 minutes (a day for the lot)
    UD accepts maybe 1500 as new
    I create accounts to every one and post content
    Of which 150 accept accept the content

    Here's me doing just that with a 40,000 list...



    Same with every tool (scraping blogs for scrapebox/jet/nhseo/GSA) same when using Hrfer with Xrumer. If 0.025% of the URL's you scrape end up accepting links that is a brilliant return. It's all about scale. Keyword lists of 10,000 plus, 100 proxies, multiple servers 2 million scraped URL's a dayand away you go.

    That's why I run 2 scrapebox servers 24/7 for my members.

    Scritty
     
    Last edited by a moderator: May 18, 2016
  3. utuxia

    utuxia BANNED BANNED

    Joined:
    Feb 14, 2011
    Messages:
    673
    Likes Received:
    111
    Thanks. You're still getting 3 times the number of sites that UD recognizes than I am. Out of 40k, you get 1500 good UD-capable sites. Out of 60k I get about 500. Maybe my footprints are too generic. I have been trimming to root and removing duplicates, which I realize I 'll loose out on the sites where /pligg/ is the installation path.
     
  4. Scritty

    Scritty Elite Member Premium Member

    Joined:
    May 1, 2010
    Messages:
    2,807
    Likes Received:
    4,496
    Occupation:
    Affiliate Marketer
    Location:
    UK
    Home Page:
    No - you misunderstood. I get 1500 out of the lot (though many of the fails are now duplicates as I've been scraping and sorting none stop for Ultimae Demon for the past 7 months) I get 1500 out of 600,000 - of which UD actually allows sign up and submit to 150.
    150 is a LOT of links for one platform and represents a great couple of days work. It's all hands off.

    The exception was between November 2011 and May 2012 when you could scrape 30,000 WIKI sites at the click of your fingers.
    They were all spammed to death and most of the crap ones are either closed, de indexed (due to all the spam they got) or refusing any more accounts.

    Wiki spamming... a 6 month fad as predictibly self cannibalising as the Article cull a year earlier.

    [Hint - Only GOOD article and WIKI sites are left. Few in number but higher in authority - better still the little boys and girls with more oxygen in their head than brain cells have now abandoned article and WIKI marketing altogether - just when the "wheat has been sorted from the chaff" - Unless you are only impressed by big numbers... "I want a 30,000 WIKI blast" nonsense... there has never been a better time to use these tools on these platforms. All the crap gone - only the good left. What's not to like about that?]

    Scritty
     
  5. utuxia

    utuxia BANNED BANNED

    Joined:
    Feb 14, 2011
    Messages:
    673
    Likes Received:
    111
    Ok...i see. My results are on par with what you're getting, about the same as the video too. I'll just keep scraping, filtering and scanning with UD.