1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Scrapebox Dupremove Question

Discussion in 'Black Hat SEO' started by cottonwolf, May 23, 2015.

  1. cottonwolf

    cottonwolf Regular Member

    Joined:
    Jan 20, 2015
    Messages:
    469
    Likes Received:
    239
    I've scraped about 100gb of urls the last week or two and I've been two lazy to clean them and work with them in gsa and now it's a pain the cave to do so.

    What I've done is that I've used the scrapebox harvester session folder files, I didn't bother to wait for the harvester to load gigs of urls after a scrape, and put these large files into dupe remove addon.

    However, the duperemove addon, according to me, often messes up the files encoding somehow. It's probably me being stupid and inexperienced.

    edit:I used the addon to separate an 8gb file to files of 5 million urls. And I've got 16 parts back.
    I don't think that the dup url and dup domain remover addon would work with these messed up part files. /edit

    A file containing these lines:

    http://website.com
    http://domain.com

    often becomes
    h t t p : / / w e b s i t e . c o m
    h t t p : / / d o m a i n . c o m

    What can I do to sort these kind of files? GSA SER can't process such a url, obviously.

    I've got no idea what file encoding I should use. I think scrapebox saves harvest as unicode. These files take up a huge amount of space. Then when I'm done cleaning I export my files either as utf8 or lately as ANSI and load them to ser. These encodings don't mean much to me.

    Thanks!

    Edit: I don't even SEO
     
    Last edited: May 23, 2015
  2. zimaseo

    zimaseo Registered Member

    Joined:
    May 12, 2015
    Messages:
    98
    Likes Received:
    3
    I have no Idea :( but after some time some good person give me review here than we will get idea..
     
  3. loopline

    loopline Jr. VIP Jr. VIP

    Joined:
    Jan 25, 2009
    Messages:
    3,807
    Likes Received:
    2,028
    Gender:
    Male
    Home Page:
    GSA SER only uses ANSI, the older versions of the dupe remove tool that is built into V2 did sometimes mess up the encoding on splitting files.

    But its fixed in the latest version and the dupe remove addon didn't do this. So if you go to options >> default export file format - and change it to ANSI. Then its already GSA compatible. Ive never seen Scrapebox alter the urls like that, but if you were importing non ANSI files into GSA (it can only read ANSI) then GSA could have read the files wrong and given results like that.

    If the dupe remove addon was trying to read a utf-8 file with out a BOM then it could produce wonky results because it will see it as an ANSI file but it won't work out right.

    So just do all your exports in ANSI and work with everything in ANSI and then when you import into GSA you won't see funky urls.
     
    • Thanks Thanks x 1