1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Harvesting more than 1 million URLS in Scrapebox

Discussion in 'Black Hat SEO Tools' started by dunk15us, Dec 6, 2010.

  1. dunk15us

    dunk15us Regular Member

    Joined:
    Sep 24, 2010
    Messages:
    227
    Likes Received:
    37
    I've recently run into the problem of not being able to import into or harvest more than 1 million URLS into scrapebox. I have a list of about 10 million urls, however, I want to be able to remove duplicates on this list but I can't load more than 1/10th of them into Scrapebox. Does anyone have any experience with dealing this problem such as some script or something I could use outside of scrapebox to remove duplicate URLS/Domains etc. Thanks for any and all input!
     
  2. CyrusVirus

    CyrusVirus BANNED BANNED Premium Member

    Joined:
    Aug 20, 2009
    Messages:
    1,110
    Likes Received:
    686
    you know, i run into this problem at least 3-5 times a day, gets annoying but i haven't found a solution yet. if i do, ill let you know
     
  3. ericsson

    ericsson Elite Member Premium Member

    Joined:
    Apr 25, 2009
    Messages:
    2,642
    Likes Received:
    8,132
    Occupation:
    www
    Location:
    Swe
    Home Page:
    Use "filesplitter"

    Split those into 10-12 files (So it won´t reach the 1.000.000) limit.
    Load into scrapebox, remove duplicate urls.
    Do it on all files. How many you have left now?
     
  4. dunk15us

    dunk15us Regular Member

    Joined:
    Sep 24, 2010
    Messages:
    227
    Likes Received:
    37
    Yea that's a good idea. The only problem I see with that method is if hypothetically I wanted to sell a list of like 5 million URLS, even if I kept splitting them, I would have to combine them into one file to make sure the "finalized" list did not have duplicates etc. Wouldn't I still run into the same problem with a big enough URL list?
     
    Last edited: Dec 6, 2010
  5. tharako

    tharako Junior Member

    Joined:
    Nov 13, 2010
    Messages:
    142
    Likes Received:
    121
    Location:
    Spain
    yes it is something which scrapebox doesn't work so well, for me it just keeps with the hand-mouse icon whem i load a list with more than 1 million aprox. and i got to restart scrapebox, i hope they fix this soon, or maybe it's a problem of our computer, no idea really.
     
  6. dunk15us

    dunk15us Regular Member

    Joined:
    Sep 24, 2010
    Messages:
    227
    Likes Received:
    37
    tharako it's actually a problem with the program, it does not support loading more than 1 million urls into it
     
  7. CyrusVirus

    CyrusVirus BANNED BANNED Premium Member

    Joined:
    Aug 20, 2009
    Messages:
    1,110
    Likes Received:
    686
    lifesaver over there aren't ya, anyways, i kinda needed that, now only if i could harvest more than 1million in 1 session

    its not your pc, its scrapebox. they may fix it one day
     
  8. dumbodrop

    dumbodrop Regular Member

    Joined:
    Oct 18, 2010
    Messages:
    336
    Likes Received:
    51
    Location:
    TeXas
    Home Page:
    i wish there was a solution too, we had to use some linux commands to remove dupes of a list larger than 1 million.
     
  9. accelerator_dd

    accelerator_dd Jr. VIP Jr. VIP Premium Member

    Joined:
    May 14, 2010
    Messages:
    2,441
    Likes Received:
    1,005
    Occupation:
    SEO
    Location:
    IM Wonderland
    my solution: scrape over million (3-4) and go to the sessions folder, find the most recent folder, load the files separately, remove dupes in all files and merge all of them and get one mil of urls unique :)
     
  10. onething1

    onething1 Junior Member

    Joined:
    Sep 10, 2010
    Messages:
    116
    Likes Received:
    2
    thats no solution since sb wouldnt be able to process that session!! and u'd have to divide it to import into sb...
     
  11. kalrudra

    kalrudra BANNED BANNED

    Joined:
    Oct 29, 2010
    Messages:
    271
    Likes Received:
    300
    Hi,

    I have created a tool which removes duplicate URLs in up to 5 million URL list:

    Here is the screen shot:

    http://www.blackhatworld.com/blackhat-seo/attachment.php?attachmentid=6171&stc=1&d=1293886027

    If you are interested to buy this ... let me know..

    Thanks,
    Kalrudra
     

    Attached Files:

  12. kaidoristm

    kaidoristm Power Member

    Joined:
    Feb 13, 2009
    Messages:
    561
    Likes Received:
    726
    Occupation:
    Freelancer
    Location:
    Estonia
    Home Page:
    To avoid crashes i split my lists into 200000 lines per file. Theres an nice program called GSplit which is free and damn good.
     
  13. softtouch2009

    softtouch2009 Senior Member

    Joined:
    Dec 2, 2009
    Messages:
    1,001
    Likes Received:
    225
    Occupation:
    Programming
    Location:
    ssdnet.biz
    Home Page:
    What do you want to remove, duplicate urls or duplicate domains?
     
  14. kalrudra

    kalrudra BANNED BANNED

    Joined:
    Oct 29, 2010
    Messages:
    271
    Likes Received:
    300
    you have both options
     
  15. softtouch2009

    softtouch2009 Senior Member

    Joined:
    Dec 2, 2009
    Messages:
    1,001
    Likes Received:
    225
    Occupation:
    Programming
    Location:
    ssdnet.biz
    Home Page:
    http://www.scrapebox.com/free-dupe-remove
     
    • Thanks Thanks x 3
  16. Anon752

    Anon752 Regular Member

    Joined:
    Jul 3, 2010
    Messages:
    244
    Likes Received:
    179
    Great find. Nice to see the Scrapebox team dealing with this.

    Thanks for the link.
     
  17. kalrudra

    kalrudra BANNED BANNED

    Joined:
    Oct 29, 2010
    Messages:
    271
    Likes Received:
    300
    If want to find duplicates from unlimited URLs, search particular URLs (for example .edu or .gov) you can use this tool:

    [​IMG]

    [​IMG]
    [​IMG]
    [​IMG]
     
  18. Debian

    Debian Jr. VIP Jr. VIP Premium Member

    Joined:
    Feb 17, 2009
    Messages:
    711
    Likes Received:
    282
    Occupation:
    Residential Proxies & VPN's
    Home Page:
    I think Link Whore has a file splitter/merger as well. I gotta dig that program off my backup drive and see. But last time I looked at it I'm pretty sure its there.
     
  19. softtouch2009

    softtouch2009 Senior Member

    Joined:
    Dec 2, 2009
    Messages:
    1,001
    Likes Received:
    225
    Occupation:
    Programming
    Location:
    ssdnet.biz
    Home Page:
    You're welcome. The SB team is surely always willing to help out :)
     
  20. Promotion

    Promotion Newbie

    Joined:
    Jan 2, 2011
    Messages:
    14
    Likes Received:
    0
    Definitely necessary to delete duplicates.