How to work with large files in Scrapebox?

Discussion in 'Black Hat SEO' started by observer, Mar 7, 2011.

  1. observer

    observer Power Member

    Joined:
    Apr 7, 2010
    Messages:
    731
    Likes Received:
    22
    Guys, my SB freezes when I try delete duplicated URLs from 500K scraped URL list.

    Now, I restarted and trying to open it at least to split it, but no such luck. Can't open - freezes.

    Now I tried to split files with Gsplit, but it created some .gns files - no idea how to convert them into txt to cvs.

    Can you please advise what is the easiest way to go about working with large scraped lists? (Good thing i was able to sve it, but after that cannot perform any operation)
     
  2. HoNeYBiRD

    HoNeYBiRD Jr. VIP Jr. VIP

    Joined:
    May 1, 2009
    Messages:
    8,104
    Likes Received:
    8,954
    Gender:
    Male
    Occupation:
    Geographer, Tourism Manager
    Location:
    Ghosted
    there's already quite a few threads about this:
    - you can follow crazyflx's guide how to deal with large .txt files, it was posted here as well by him, i cannot find now tho, so here it is from his blog:
    Code:
    http://crazyflx.com/scrapebox-tips/remove-duplicate-domains-urls-from-huge-txt-files-with-ease/
    - you can use hjsplit, then rename the results back into .txt
    - there was a thread not too long ago in the download section with a tool made by a bhw member for this task (splitting large txt files into smaller files with ease)
     
  3. observer

    observer Power Member

    Joined:
    Apr 7, 2010
    Messages:
    731
    Likes Received:
    22
    Ys, that's exactly the post i've found :) and his instructions don't seem to work.

    Well, in my case it doesn't - it just edits text instead.

    Do you use the same program?
     
  4. Monrox

    Monrox Power Member

    Joined:
    Apr 9, 2010
    Messages:
    615
    Likes Received:
    580
    I can clean up your list but it will also remove URLs which differ on subdomain meaning it will leave only subdomain.yahoo.com/etcetc from a list that looks like this:

    sport.yahoo.com/etcetc
    news.yahoo.com/etcetc
    politics.yahoo.com/etcetc

    If that is unacceptable, load your file in wordpad and copy and paste chunks of lines in new text files.
     
  5. HoNeYBiRD

    HoNeYBiRD Jr. VIP Jr. VIP

    Joined:
    May 1, 2009
    Messages:
    8,104
    Likes Received:
    8,954
    Gender:
    Male
    Occupation:
    Geographer, Tourism Manager
    Location:
    Ghosted
    yea, i'm using the same program following crazyflx's step by step guide word from word

    pay attention to type in the colon character first and just after copy-paste what crazyflx wrote

    type the ":" character and then
    HTML:
    sort u
    type the ":" character and then
    HTML:
    let g:gotDomains={}
    type the ":" character and then
    HTML:
    %g/^/let curDomain=matchstr(getline('.'), '\v^http://%(www\.)?\zs[^/]+') | if !has_key(g:gotDomains, curDomain) | let g:gotDomains[curDomain]=1 | else | delete _ | endif
    it should work!

    but as i already mentioned above you can use hjsplit to cut the bigger files into smaller pieces and just then use SB to remove dupes
     
  6. GoogleAlchemist

    GoogleAlchemist Regular Member

    Joined:
    Nov 25, 2009
    Messages:
    249
    Likes Received:
    28
    Occupation:
    Bad Ass SEO Consultant
    Location:
    Wherever I want
    sweetfunny just released a tool to bulk merge as well as bulk dup check 1mil+ urls though i doesnt seem to be as accurate unless i am missing something