1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Best way to generate unique keywords

Discussion in 'Black Hat SEO Tools' started by rodyaj, Sep 5, 2012.

  1. rodyaj

    rodyaj Newbie

    Joined:
    May 16, 2012
    Messages:
    23
    Likes Received:
    5
    What is the secret to getting a low number of dupe results from Scrapebox harvests? When I scrape for URLs I'm getting 60+% dupes (often more), which means wasted time and burned proxies. I've tried using bigger keyword lists and setting pages to scrape down from 1000 to 100, but I'm still getting a high percentage of dupe URLs. I've also made an awk script that chomps up lines of my keyword list so that each and every line is full of unique words (every single word and phrase in the file is 100% unique). Sadly, this method drastically cuts down my list and also doesn't really seem to reduce the dupe URL results much.

    Furthermore, I've made a script to only keep long tail keywords (3+ words). Why? Well I wondered if less generic phrases would stop the same old popular and authorative sites popping up in the harvests and creating so many dupes. Alas, even this doesn't reduce the amount. I don't know.... perhaps I've just been unlucky and need to try again.

    Or is this just an inherent problem of scraping search engines that can't really be remedied to any noticeable extent? If that's the case, I'd rather just concentrate on building massive keyword lists. One method I've developed for getting massive lists is creating a seed file and Google suggest scraping from it. I then use an awk script to look for the seed file keywords in the resulting file and replace them with blanks, so all that is left over is the unique suggest words that were generated. This gives me pretty unique lists every time, but I wonder if having such fragmented and unnatural phrases such as this will produce more spammy results in the harvests (I'm looking for quality, high PR domains)? Should I just stick to natural language phrases (whole phrases people are actually searching for)?

    Thanks for reading this long and confusing post.
     
    Last edited: Sep 5, 2012
  2. nipester

    nipester Regular Member

    Joined:
    Feb 1, 2009
    Messages:
    256
    Likes Received:
    28
    I don't know how many URLs you're harvesting but when I use dictionaries as my "keywords" staggered with platform footprints I easily hit between 95% and 99% dupes after scoring a couple hundred thousand unique domains. The reason I think this happens is because most sites are just very well indexed so while you are searching for the ones on the outer fringes you're going to come across these massive hubs very frequently because they *are* very densely indexed.

    You can fight it and improve your yields somewhat, but you can never vanquish it. It's the nature of the beast.
     
    • Thanks Thanks x 1
  3. rodyaj

    rodyaj Newbie

    Joined:
    May 16, 2012
    Messages:
    23
    Likes Received:
    5
    That makes sense nipester. It does seem as though the same big authority sites keep coming up simply because of the sheer volume of keywords they rank for. I guess the only way to try and avoid it is by trying to generate keyword lists with more niche and longtail words, but this is actually very hard to do. I'll have a go at developing a way, and if I can't beat it, I guess I'll just give in and get a faster computer and network connection (or a decent VPS) so I can go for pure bulk searching.
     
    Last edited: Sep 5, 2012
  4. nipester

    nipester Regular Member

    Joined:
    Feb 1, 2009
    Messages:
    256
    Likes Received:
    28
    I'm wondering if minuses might work. Like if you made a list of the most commonly indexed phrases and words you could put a -word1 -word2 -"phrase one" in your query? That way you'd have a higher proportion of hits coming from domains that are not as densely indexed as the usual servings by google.
     
  5. rodyaj

    rodyaj Newbie

    Joined:
    May 16, 2012
    Messages:
    23
    Likes Received:
    5
    Good idea. At some point I will make a blacklist of sorts and then post it here. For now, though, I'm going to concentrate on trying to build up a more long-tail keyword list for merging. In theory, I should be able to get to those outer fringes you speak of by doing this. I'm going to try and scrape the footprint of those Wordpress plugins that show "incoming search terms" of visitors. There are already a few paid tools that do this such as IKS (incoming keywords scraper). This should allow me to build up a big list of natural long tail phrases and give a bit more diversity than the standard dictionary scrape.
     
    Last edited: Sep 5, 2012
  6. ShabbySquire

    ShabbySquire Power Member

    Joined:
    Nov 30, 2011
    Messages:
    574
    Likes Received:
    122
    Location:
    UK
    I'm also interested in eliminating sb dupes.
     
  7. tony-raymondo

    tony-raymondo Junior Member

    Joined:
    Jun 19, 2009
    Messages:
    181
    Likes Received:
    459
    You're scraping from the websites themselves...? Does this work better than just using one of the million autocomplete scrapers out there -- including mine?
     
  8. rodyaj

    rodyaj Newbie

    Joined:
    May 16, 2012
    Messages:
    23
    Likes Received:
    5
    I'm just doing the same as most other people: scraping SERP results. Right now I'm using the one crazyflx posted on these forums (search "get Google suggest scraper huge lists"). I also have a bash script which scrapes Google, Amazon, Bing, Blekko and a few other sources (but I don't use it just yet because I haven't adapted it to be multi-threaded). The only problem with these suggest scrapers is that they generate a lot of similar keywords e.g.,

    dog
    dog training
    dog training centres
    dog training resources
    dog on a lead
    ...
    dog biscuits premium

    I wondered if these similar keywords were creating all the similar results I was getting, so I tried bulk replacing the seed keywords so that my lists look more like:

    training
    training centres
    training resources
    on a lead
    ...
    biscuits premium

    .... but it doesn't really reduce the similarity of the keywords much unless you cut them right down to single keywords. I've tried just scraping from single keywords, but I think there are certain authority sites that rank for most dictionary words (so you always get similar duplicated results). And it creates some strange looking keywords (as you can see by 'on a lead'). I'm not sure if these fragmented results would affect the quality of the sites that get scraped (e.g., bring up spam sites that just stuff dictionary words).
     
    Last edited: Sep 5, 2012