1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

How to work with large files in Scrapebox?

Discussion in 'Black Hat SEO' started by observer, Mar 7, 2011.

  1. observer

    observer Power Member

    Joined:
    Apr 7, 2010
    Messages:
    731
    Likes Received:
    22
    Guys, my SB freezes when I try delete duplicated URLs from 500K scraped URL list.

    Now, I restarted and trying to open it at least to split it, but no such luck. Can't open - freezes.

    Now I tried to split files with Gsplit, but it created some .gns files - no idea how to convert them into txt to cvs.

    Can you please advise what is the easiest way to go about working with large scraped lists? (Good thing i was able to sve it, but after that cannot perform any operation)
     
  2. HoNeYBiRD

    HoNeYBiRD Jr. VIP Jr. VIP

    Joined:
    May 1, 2009
    Messages:
    7,160
    Likes Received:
    8,147
    Gender:
    Male
    Occupation:
    Geographer, Tourism Manager
    Location:
    Ghosted
    there's already quite a few threads about this:
    - you can follow crazyflx's guide how to deal with large .txt files, it was posted here as well by him, i cannot find now tho, so here it is from his blog:
    Code:
    http://crazyflx.com/scrapebox-tips/remove-duplicate-domains-urls-from-huge-txt-files-with-ease/
    - you can use hjsplit, then rename the results back into .txt
    - there was a thread not too long ago in the download section with a tool made by a bhw member for this task (splitting large txt files into smaller files with ease)
     
  3. observer

    observer Power Member

    Joined:
    Apr 7, 2010
    Messages:
    731
    Likes Received:
    22
    Ys, that's exactly the post i've found :) and his instructions don't seem to work.

    Well, in my case it doesn't - it just edits text instead.

    Do you use the same program?
     
  4. Monrox

    Monrox Power Member

    Joined:
    Apr 9, 2010
    Messages:
    615
    Likes Received:
    580
    I can clean up your list but it will also remove URLs which differ on subdomain meaning it will leave only subdomain.yahoo.com/etcetc from a list that looks like this:

    sport.yahoo.com/etcetc
    news.yahoo.com/etcetc
    politics.yahoo.com/etcetc

    If that is unacceptable, load your file in wordpad and copy and paste chunks of lines in new text files.
     
  5. HoNeYBiRD

    HoNeYBiRD Jr. VIP Jr. VIP

    Joined:
    May 1, 2009
    Messages:
    7,160
    Likes Received:
    8,147
    Gender:
    Male
    Occupation:
    Geographer, Tourism Manager
    Location:
    Ghosted
    yea, i'm using the same program following crazyflx's step by step guide word from word

    pay attention to type in the colon character first and just after copy-paste what crazyflx wrote

    type the ":" character and then
    HTML:
    sort u
    type the ":" character and then
    HTML:
    let g:gotDomains={}
    type the ":" character and then
    HTML:
    %g/^/let curDomain=matchstr(getline('.'), '\v^http://%(www\.)?\zs[^/]+') | if !has_key(g:gotDomains, curDomain) | let g:gotDomains[curDomain]=1 | else | delete _ | endif
    it should work!

    but as i already mentioned above you can use hjsplit to cut the bigger files into smaller pieces and just then use SB to remove dupes
     
  6. GoogleAlchemist

    GoogleAlchemist Regular Member

    Joined:
    Nov 25, 2009
    Messages:
    249
    Likes Received:
    28
    Occupation:
    Bad Ass SEO Consultant
    Location:
    Wherever I want
    Home Page:
    sweetfunny just released a tool to bulk merge as well as bulk dup check 1mil+ urls though i doesnt seem to be as accurate unless i am missing something