1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

How do I clean a scraped list of sites effectively?

Discussion in 'Black Hat SEO Tools' started by RBseatown, Feb 19, 2012.

  1. RBseatown

    RBseatown Junior Member

    Joined:
    Apr 22, 2009
    Messages:
    185
    Likes Received:
    130
    My usual process is to:

    1. use scrapebox to scrape a list of sites for one of my seo tools
    2. trim to root
    3. remove duplicate domains
    4. remove domains with bad keywords in them
    5. I'd say run alive check here, but lately I haven't even done that. I've noticed alive check works only with high quality proxies, in large numbers, which I don't have. When I run it without proxies or with a small number of proxies, I get thousands of "DEAD" results for sites that are really "ALIVE" and its killing my lists, so I stopped doing alive checks.
    6. At this step I plug my list into my software, do a run, and then use the successful submissions for my list.

    The problem is, I'm scrapping lists of 100k+ and it's taking my bots and tools forever to run through the whole list. Are there some better ways to trim down the results before doing the test run? Am I using alive check wrong and it does actually work?

    Any help is greatly appreciate, thanks guys.
     
  2. ThreadKiller

    ThreadKiller Power Member

    Joined:
    Jan 31, 2012
    Messages:
    614
    Likes Received:
    303
    Location:
    London
    Try using Ultraedit and learn all the special regex functions there in combination with search and replace
     
  3. Deeve

    Deeve Newbie

    Joined:
    Jul 16, 2011
    Messages:
    23
    Likes Received:
    69
    why would alive checker need proxies at all?
     
  4. Tenshisendo

    Tenshisendo Registered Member

    Joined:
    Nov 20, 2010
    Messages:
    64
    Likes Received:
    24
    You have to be getting terrible results trimming to root. Most home pages are not able to accept a comment.

    Also I would not waste my time with alive checker.

    If you are going after auto approve I usually just do this.

    1-scrape urls
    2-fast post to all without any cleanup "use either a throwaway url or just put all your tier 1's here"
    3-run link checker
    4-if you want to check pagerank and filter out the bad ones do it here.
    5-use link extractor tool to extract all internal links from links found from step 3
    6-run fast poster again
    7-check links and add found to auto approve list.
     
    • Thanks Thanks x 1
  5. kokoloko75

    kokoloko75 Elite Member

    Joined:
    Jan 1, 2011
    Messages:
    1,628
    Likes Received:
    1,935
    Occupation:
    Design director
    Location:
    Paris (France)
    Like ThreadKiller, I use UltraEdit to clean a list via tons of regular expressions.

    Beny
     
  6. metalice

    metalice Junior Member

    Joined:
    Apr 12, 2010
    Messages:
    125
    Likes Received:
    4
    Beny, like what?
    can you give an example please?

    thanks
     
  7. sixalarm

    sixalarm Regular Member

    Joined:
    Aug 21, 2011
    Messages:
    244
    Likes Received:
    101
    Location:
    The Enterprise
    I use the sick submitter platform sorter to help out with this. I need large numbers of bookmarking sites for what I do, and I don't have a year to wait for my bot to go through and determine which sites are actually useful. I don't even use sicksubmitter anymore, but the platform sorter is extremely useful.

    It is a free download, and it will go through a large list of sites and organize them based on the platform they are using. For example, I can load a 100k+ sites into it, and it will determine which ones are phpdug, pligg etc. You might want to check it out. It covers all the major platforms and it isnt just limited to bookmarking platforms either.
     
  8. ThreadKiller

    ThreadKiller Power Member

    Joined:
    Jan 31, 2012
    Messages:
    614
    Likes Received:
    303
    Location:
    London
    If OP (or anyone) shows a couple of lines from a typical proxy list, I will post an example of how it can be cleaned in UltraEdit using regular expressions.


     
  9. metalice

    metalice Junior Member

    Joined:
    Apr 12, 2010
    Messages:
    125
    Likes Received:
    4
    what do you mean a couple of lines from a typical proxy list?

    a few proxy servers!?
     
  10. ThreadKiller

    ThreadKiller Power Member

    Joined:
    Jan 31, 2012
    Messages:
    614
    Likes Received:
    303
    Location:
    London
    Just an example of the type of raw list that the OP wants to trim down.

     
  11. ljt3759

    ljt3759 Regular Member

    Joined:
    Nov 18, 2009
    Messages:
    228
    Likes Received:
    100
    Occupation:
    Entrepreneur
    Location:
    Lewis-Tisdall.com
    Home Page:
    I personally find trimming to root and then removing duplicate domains takes longer than removing duplicate domains then trimming to root, but it might just be me.
     
  12. RBseatown

    RBseatown Junior Member

    Joined:
    Apr 22, 2009
    Messages:
    185
    Likes Received:
    130
    I don't trim to root when going for blog comments :p
     
  13. RBseatown

    RBseatown Junior Member

    Joined:
    Apr 22, 2009
    Messages:
    185
    Likes Received:
    130
    BINGO! Thanks man.