1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

additive words for hrefer

Discussion in 'Black Hat SEO Tools' started by GeXus, Jan 11, 2011.

  1. GeXus

    GeXus Newbie

    Joined:
    Dec 7, 2009
    Messages:
    12
    Likes Received:
    0
    I'm just starting out with hrefer and have searched around for a bunch of additive words to use, well.. I have a list now of about 700. Is that way overboard? I've seen some posts here where one guy only has '/forum/'...

    I'll be using xrumer 7 now, so I want to try and get some sites using recaptcha also... any suggestions? Never used any of this!

    Thank you :)
     
  2. bezopravin

    bezopravin BANNED BANNED

    Joined:
    May 11, 2010
    Messages:
    461
    Likes Received:
    3,471
    Bunch of Additive words Results Another Bunch of crappy Lists. Its better to keep it small and targeted.

    For ex,

    Scrape using following additive words and run a test blast with that list. I'm Sure you'll get over 50-70% Profiles in Xrumer 7 with Stock Xas_Ai


    Code:
    "powered by smf" inurl:"topic*"
    "powered by phpbb" inurl:"topic*"
    "powered by punbb" inurl:"topic*"
    To test recaptcha's scrape Vbulletin forums

    Code:
    "powered by vbulletin" inurl:"topic*"
    Note: Make sure to Select only "Google Classic" & "Google Mobile" SEs while scraping with these additive words

    Good Day! :)
     
    • Thanks Thanks x 3
  3. GeXus

    GeXus Newbie

    Joined:
    Dec 7, 2009
    Messages:
    12
    Likes Received:
    0
    Nice, Thanks! Will try these.

    I also noticed that my words for keywords like 'some keyword', it splits the keywords so it looks for 'some' + additive and 'keyword' + additive, but not 'some keyword' + additive

    Any suggestions?
     
  4. bezopravin

    bezopravin BANNED BANNED

    Joined:
    May 11, 2010
    Messages:
    461
    Likes Received:
    3,471
    Goto Parsing Options Menu and Uncheck "Disable Filtering Harvested links by Sieve-Filter" under "Query Options"

    Let me know how it goes..
     
  5. theweakerman

    theweakerman Newbie

    Joined:
    Sep 15, 2011
    Messages:
    1
    Likes Received:
    0
    Thank you kindly for this. So I tried setting the additive words to:

    Code:
    myKeyWord anotherKeyWord andAnotherKeyword "powered by smf" inurl:"topic*"
    and followed the rest of the great instructions on this post.

    and still no results. I just get zeros in the parsing report. I suspect it has to do with the sieve-filter options or list.

    Am I right or is there more to it?
     
  6. rekagear20

    rekagear20 Newbie

    Joined:
    Jul 15, 2012
    Messages:
    14
    Likes Received:
    0
    I'm having the same problem. proxylist says its updated but the button is stuck on "starting..." i cant grab any pages or links even with a very small word list and additive word list.
     
  7. jb2008

    jb2008 Senior Member

    Joined:
    Jul 15, 2010
    Messages:
    1,158
    Likes Received:
    972
    Occupation:
    Scraping, Harvesting in the Corn Fields
    Location:
    On my VPS servers
    1. You need to check your additive words (footprints) in Google first. If they are accurate i.e. yielding the type of URLs you are targeting for the most part, it's a good footprint. If it's giving you a lot of non-viable bs pages, discard it.

    2. 700 additives is too much. Max I've ever used is 100 or so really good footprints. If you have good footprints targeting one platform, you don't need all that many. I suppose if you were targeting ALL forums (i.e. all forum platforms) on the net you may have 200 or so.

    3. For larger amounts of additives you may need to use the sieve filter, BUT be very careful that you ONLY filter out URLs which are definitely NOT of the platforms you require. I actually prefer to filter later on using Xrumer's links database analysis, not only on the basis of URL but rather on the basis of X or Y forms/certain source code in each URL harvested. You WILL need to filter your very large raw hrefer scraped lists this way.

    4. Remember 700 additives is huge because, let's say, you are using a 50,000 keyword list. That's 35,000,000 queries. Also, I get the maximum by doing one run of additive + keyword, then another of keyword + additive, so that makes 70,000,000 . I also scrape Bing too, so that would make 140,000,000 queries. Be very careful that you are not overdoing the additives (or even the keywords) as IP banning is becoming stricter by the day and your proxies won't last anywhere near as long as they did 6 months ago.

    5. On the subject of proxies, for large scrapes, you need MULTIPLE sources of proxies. HMA paid daily lists, proxygo, and that proxy grabbing/filtering program in BST called Proxy Multiplier, my own scraped and filtered lists, as well as other daily updated PUBLIC proxy lists, are some of the sources I use. Yes, combined.

    6. I know this will sound odd, but in SB do not check to filter for Google passed proxies. They change so often and you only get 10% of the proxies you would get by just filtering anonymous. hrefer can rotate proxies if one is failing to deliver, so what I do is filter for anonymous and I get a max of around 3000 working anonymous proxies, I scrape on 500-1000 threads. (Used to do 1000 all the time, but shit's got stricter recently so cut it down.) You can check your bandwidth to see how well your proxies are doing, as well as the speed at which the number of harvested URLs goes up on hrefer. You will observe that if you use a small list of 300 G passed proxies, even if you cut thread count to just 50 or something, you will get them all banned and the scraping will grind to a halt. If you use a shit ton of anonymous proxies however, some of them recover from their G softbans from time to time, and hrefer can rotate them 6-7 times, so eventually your query gets done. And on 500 threads, pretty quickly still, despite the retries.

    7. Proxies are a vital, integral part of hrefer. Turn auto checking, auto replacing proxies off. Yes, hrefer can gather proxies automatically. But it can't do ANYWHERE near the numbers or quality you can get by proactively searching for proxies, by scraping and testing yourself, using Proxy Multiplier (I think that's what it is called) to scrape and test as well, paying for public proxies daily from HMA and proxygo, amongst others. Then when you've got your filtered lists, gather them all together in one file and test them all for anonymity only. Also, keep your old lists e.g. from HMA, proxygo, Proxy Multiplier etc. and put them together in one big ass file, remove duplicate lines and test them from time to time. Old lists from good quality sources are the best quick, bulk sources of good quality proxies, but you just have to test them right before you put them into hrefer. Then, when you're in the middle of a big harvest every 1-2 days, add new proxy lists from your daily updating sources to your master file, test all that again, and replace the xproxy.txt with the recently checked proxies. Close and open hrefer again. Yes you will be manually testing proxies and it is time consuming, but it's totally worth it. I did this for many months straight, tedious but necessary.


    8. Finally, ffs, don't use special operators. Regular queries get banned quick enough on Google. These days, using inurl: for automatic queries will get you banned faster than you can say "randomize datacenters". Find creative ways to get accurate, high yielding footprints, without having to resort to the crude and destructive special operators. If you use special operators, you will put all your work gathering proxies to waste. Special operators are overrated and you can get much better results without them if you spend some time researching and using your brain.
     
    • Thanks Thanks x 3
  8. Nakama

    Nakama Newbie

    Joined:
    May 30, 2011
    Messages:
    28
    Likes Received:
    8
    Occupation:
    Full time IMer
    Location:
    Planet Earth
    HR newbies should thank you for this great post. I liked the point above in part 6, thank you for reminding me that :)