1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

i have a list of 20 million blog urls - how on earth can i filter it?

Discussion in 'Black Hat SEO Tools' started by links, Dec 19, 2009.

  1. links

    links Regular Member

    Joined:
    Mar 4, 2009
    Messages:
    210
    Likes Received:
    195
    I want to unique host and unique url the list, buts its so big its seemingly impossible without cutting it into about 200 files! is there any software i can do this in?

    or maybe someone could do it for me in exchange for the huge list?

    thanks..
     
  2. dzoniij

    dzoniij Regular Member

    Joined:
    May 24, 2009
    Messages:
    469
    Likes Received:
    111
    Do you want that in your Db are only unique domains? If so I can do that. PM me.
     
  3. Kaimi

    Kaimi Newbie

    Joined:
    Dec 6, 2009
    Messages:
    35
    Likes Received:
    230
    Home Page:
    In *nix you can extract unique lines like this:
    uniq source.txt > sorted.txt
     
  4. miltiades

    miltiades Newbie

    Joined:
    Jun 23, 2009
    Messages:
    37
    Likes Received:
    3
    You may use the free version of WEBCEO and the backlink tool they offer. The software gives a list of unique filtered domains to your site
     
  5. links

    links Regular Member

    Joined:
    Mar 4, 2009
    Messages:
    210
    Likes Received:
    195
    ok thanks for the help guys, in the end i had to split it, and ended up with 1.7 million blogs after unique url'ing it just gotta unique host it now
     
  6. dannyhw

    dannyhw Senior Member

    Joined:
    Jul 16, 2008
    Messages:
    979
    Likes Received:
    465
    Occupation:
    Software Engineer
    Location:
    New York City Burbs
    Pretty easy with Php, I just have to downgrade Ubuntu and reinstall LAMP if you want me to do it. I'd be interested in trying to filter these by PR and ******** too. I'm not sure how fast I can query PR for this though but that would be a nice database to have around.
     
  7. links

    links Regular Member

    Joined:
    Mar 4, 2009
    Messages:
    210
    Likes Received:
    195
    well, i have tried doing it by pr, i mean there must be some gems in the list, but its just gonna take so so long to do the pr, even for the 2 mil list i got after unique urling
     
  8. dannyhw

    dannyhw Senior Member

    Joined:
    Jul 16, 2008
    Messages:
    979
    Likes Received:
    465
    Occupation:
    Software Engineer
    Location:
    New York City Burbs
    I would just do PR for the domain not every link. I'd just unique URL it first. I could write a script to generate a unique list by domain in about 10 minutes. Then I'd probably load into a SQL database and start mining the PR and nofollow information and maybe whether there's a captcha and stuff like that which would be a script running over a weekend in the background.