1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Friends Never Let Friends Scrape 120GB list!

Discussion in 'BlackHat Lounge' started by Doffy86, May 25, 2015.

  1. Doffy86

    Doffy86 Newbie

    Joined:
    Jan 12, 2014
    Messages:
    11
    Likes Received:
    8
    Here's a short story of mines. So I made a post a few weeks ago, about buying GSA. I was so happy to start my own SEO projects. Anyway I been practicing with GSA blasting a old youtube video. Not to rank, just to build a list. So I decided to scrape a 140,000 keyword + footprint list over the weekend and comeback back to it on monday. (Also before anyone ask, I used Scrapebox because GSA to slow.)

    I just came back from a 3 day weekend and notice that I have this huge list! It's so massive that it crash my computer when I tried to open it. So I checked how much the file size was and O.M.G the list is a 120GB;) Anyway I will be splitting and sorting the list in GSA this whole week! I just thought it was a funny story to share my Blackhat mates.
     
    Last edited: May 25, 2015
  2. wizard04

    wizard04 Elite Member

    Joined:
    Apr 1, 2014
    Messages:
    2,700
    Likes Received:
    2,538
    Location:
    Outside your house
    The title is misleading..anyway good for you OP.
     
  3. qrazy

    qrazy Senior Member

    Joined:
    Mar 19, 2012
    Messages:
    1,115
    Likes Received:
    1,725
    Location:
    Banana Republic
    Funny story????
     
  4. cottonwolf

    cottonwolf Regular Member

    Joined:
    Jan 20, 2015
    Messages:
    469
    Likes Received:
    239
    I had the same shit happen to me and I ended up ditching about 100-150gb raw scrapes as I just couldn't process them.
    Scrapebox corrupted some files. File separation took forever. GSA SER freezes about if you import a text file of more than 5 million urls, or it doesn't freeze and ends up eating 2gb+ ram and stays like that.
     
  5. Doffy86

    Doffy86 Newbie

    Joined:
    Jan 12, 2014
    Messages:
    11
    Likes Received:
    8
    Well if it's not a funny story I could always do a song and dance for you!
     
    • Thanks Thanks x 1
  6. Doffy86

    Doffy86 Newbie

    Joined:
    Jan 12, 2014
    Messages:
    11
    Likes Received:
    8
    Thanks for the heads up I will try to limit mines down to a million urls per split.
     
  7. Repulsor

    Repulsor Power Member

    Joined:
    Jun 11, 2013
    Messages:
    768
    Likes Received:
    276
    Location:
    PHP Scripting ;)
    What are you going to use to split the huge file?
     
  8. Doffy86

    Doffy86 Newbie

    Joined:
    Jan 12, 2014
    Messages:
    11
    Likes Received:
    8
    Scrapebox is doing a great job so far, but I guess it because I broken my raw list to a million per split.
     
  9. cottonwolf

    cottonwolf Regular Member

    Joined:
    Jan 20, 2015
    Messages:
    469
    Likes Received:
    239
    Loopline gave me a good tip yesterday to set scrapebox default export filetype to ansi from Options/Default File Type Format/Ansi.
    That should give you more space and be ready for gsa ser. I found that I had the default unicode enabled in the past few months and separating large files with the scrapebox dedupe remover sometimes corrupted URLs from http://xyz.com to h t t p : / / x y z . c o m .
    The url with spaces in it is useless and needs manipulating with file encoding. I don't understand this encoding shit entirely.
     
  10. sysco32

    sysco32 Jr. VIP Jr. VIP

    Joined:
    Feb 5, 2014
    Messages:
    607
    Likes Received:
    226
    Location:
    Skopje/Pecs
    The thing you have to do is get automator for SB slice up your keywords and footpronts to be around 120K keywords.That is around 27-40 million urls.If you dedupe that by domain it is only 1,3-1,7 million domains.That one u can plug it into GSA easily.
     
  11. Doffy86

    Doffy86 Newbie

    Joined:
    Jan 12, 2014
    Messages:
    11
    Likes Received:
    8
    Thanks I didn't know that I just set it now! Hopefully the next scraping I do tonight will help with your tip.
     
  12. Doffy86

    Doffy86 Newbie

    Joined:
    Jan 12, 2014
    Messages:
    11
    Likes Received:
    8
    I have always thought of picking up the automator, but you gave me a good reason too now!
     
  13. loopline

    loopline Jr. VIP Jr. VIP

    Joined:
    Jan 25, 2009
    Messages:
    3,770
    Likes Received:
    2,010
    Gender:
    Male
    Home Page:
    Yeah encoding can be tricky, but I think your issue was coming from encoding and not the actual addon its self messing up the files.

    But go for the dupe remove addon in V2 and although it will take a while, can split and process large files fine. I just got done with a 129GB scrape and while that amount of data does take a good while to process it can do it.

    I break things down into small chunks and dump them in a folder and I have a script that can just take any files in a folder and randomly write them to a project to post to. Its easy enough, I did it in python, so if you can code you can do something like that.

    Make a Scrapebox automator process that gets a file with 1000 keywords in it, scrapes, exports results to another folder and another script writes those exported to a GSA project. Then the Scrapebox automator calls a program that grabs another 1000 keywords and loads them to the file the the automator will scrape from and then it loops. Complete automation and always working with small chunks so no issues with GSA freezing or having to split etc... Its entirely hands off.

    Anyway, thats the kind of stuff I do, it makes life easier, although sometimes I do manual stuff, which is why I just finished a big scrape. Manual testing is easier and then when I fine tune a method I can fully automate it. Anyway, if you can code or want to learn to code you can make your life easier.