1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Cleaning large text files with Textpipe instead of scrapebox

Discussion in 'Black Hat SEO Tools' started by zerackam, May 25, 2013.

  1. zerackam

    zerackam Newbie

    Joined:
    Oct 9, 2011
    Messages:
    15
    Likes Received:
    0
    Hi all so i have several very large text files generated from scrapebox some ranging from 2gb upto 9gb and i cannot work with files this big using scrapebox as it will crash. one option has been gvim which is great but when files get to the 5gb and higher it can have some issues so i have gotten textpipe on the recommendation of some of the black hat posts but now im at a loss of how to exactly go about using this program. What im mainly looking todo is remove duplicate urls on some occasions removing duplicate domains and also stripping urls down to the domain level. Anyone have any insight as to how i could go about accomplishing this with textpipe?
     
  2. RedLable

    RedLable Regular Member

    Joined:
    Feb 16, 2011
    Messages:
    244
    Likes Received:
    32
    If you have a file which is 9gb why not break it into smaller lots and just use scrapebox? would this not be quicker then using regx and textpipe? if you are unsure of regx? if not google "regular-expressions-cheat-sheet-v2.pdf" there should be a free version about to help.
     
  3. HatIsBlacked

    HatIsBlacked Regular Member

    Joined:
    Dec 30, 2010
    Messages:
    224
    Likes Received:
    57
    You are better off programming it yourself with C/C++ and saving yourself $400. What's so hard about opening up a text file, reading it in chunks into a buffer, processing, writing out in chunks, and closing the file?
     
  4. DarkPixel

    DarkPixel Jr. VIP Jr. VIP Premium Member

    Joined:
    Oct 4, 2011
    Messages:
    1,328
    Likes Received:
    1,239
    Location:
    ↓↓↓↓
    Home Page:
    Yeap but OP may not have any coding skills, and everyone knows that starting coding, even small apps to read text files is very hard.
     
  5. satyr85

    satyr85 Power Member

    Joined:
    Aug 7, 2011
    Messages:
    579
    Likes Received:
    444
    Location:
    Poland
    Gscraper can do it but you need alot of ram.
     
  6. zerackam

    zerackam Newbie

    Joined:
    Oct 9, 2011
    Messages:
    15
    Likes Received:
    0
    Well i tried breaking the file up into smaller ones so i could use scrapebox but this was extremly time consuming especially since i have many many gb of text files to go through. I already own textpipe so the cost is not an issue and i know it can handle multiple commands so its an ideal piece of software to work with i just dont know how to go about getting it to work with urls. As for coding my own thing i honestly dont have any experience writing code i know how useful it would be but at the moment and in the near future im not going to learn so im still looking for anyone who knows how to go about achieving these goals using textpipe.
     
  7. Russian-Czar

    Russian-Czar Regular Member

    Joined:
    Feb 10, 2012
    Messages:
    218
    Likes Received:
    64
  8. loopline

    loopline Jr. VIP Jr. VIP

    Joined:
    Jan 25, 2009
    Messages:
    3,371
    Likes Received:
    1,799
    Gender:
    Male
    Home Page:
    Are you using the dupe remove addon in scrapebox? It can work with up to 180 million lines. The size files your quoting should have less then 150 million lines. Worst case you could hack it in half, with the dupe remove addon and work with like 3GB files and then once those 2 or 3 are done merge them together and do it.

    You can try the "hi speed" tool I have here, it will do the job too.

    http://scrapeboxmarketplace.com/free-tools/scrapebox-helper-tools

    Although I would consider scrapeboxes dupe remove addon more stable.
     
  9. datamystic

    datamystic Newbie

    Joined:
    Feb 7, 2011
    Messages:
    1
    Likes Received:
    2
    For removing duplicate domains, use Filters\Remove\Duplicate lines.

    Here is how the filter list should look to strip urls down to the domain level:

    |
    |--Extract URLs
    |
    |--Comment...
    | | strip domains to root level
    | |
    | +--Perl pattern [https?://[^/]*?] with [$0\r\n]
    | [ ] Match case
    | [ ] Whole words only
    | [ ] Case sensitive replace
    | [ ] Prompt on replace
    | [ ] Skip prompt if identical
    | [ ] First only
    | [X] Extract matches
    | Maximum text buffer size 4096
    | [ ] Maximum match (greedy)
    | [ ] Allow comments
    | [X] '.' matches newline
    | [ ] UTF-8 Support
    |
    |--Ascending ANSI sort (case insensitive), remove duplicates, length 4096
    |
     
    • Thanks Thanks x 1