1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

hrefer duplicate results

Discussion in 'Black Hat SEO Tools' started by stealthstorm, Apr 9, 2011.

  1. stealthstorm

    stealthstorm Newbie

    Joined:
    Nov 2, 2010
    Messages:
    21
    Likes Received:
    1
    So, duplicates filtering is on (by hostnames) yet 4/5 of the harvested urls are all duplicates. What gives?
     
  2. lifco

    lifco Regular Member

    Joined:
    Apr 5, 2010
    Messages:
    206
    Likes Received:
    283
    Strange coz when i always use by hostname and pr2+ not seen a dup

    Are we talking like the odd one or like a lot ?
     
  3. stealthstorm

    stealthstorm Newbie

    Joined:
    Nov 2, 2010
    Messages:
    21
    Likes Received:
    1
    For example, I have collected 570000 urls and when I filter them with scrapebox I have about 110000 left.

    I'm talking about harvested forums btw.
     
    Last edited: Apr 9, 2011
  4. jb2008

    jb2008 Senior Member

    Joined:
    Jul 15, 2010
    Messages:
    1,158
    Likes Received:
    972
    Occupation:
    Scraping, Harvesting in the Corn Fields
    Location:
    On my VPS servers
    Did you turn hrefer off (close it) and then open it again? When you open it the duplicate filtering only starts from that open and not from the whole harvest.
     
  5. stealthstorm

    stealthstorm Newbie

    Joined:
    Nov 2, 2010
    Messages:
    21
    Likes Received:
    1
    The option to filter on duplicates when loading the links database is turned off by default.

    Right now it filters while harvesting, and it does filter duplicated hosts. But apparently it doesn't filter good enough.
     
  6. jb2008

    jb2008 Senior Member

    Joined:
    Jul 15, 2010
    Messages:
    1,158
    Likes Received:
    972
    Occupation:
    Scraping, Harvesting in the Corn Fields
    Location:
    On my VPS servers
    Yes, jesus christ, even if you DO have the hostname filtering check, that's what hrefer does.

    If you close hrefer, even with duplicates filtering ON, it will only filter duplicate hostnames from that particular session. Try it vs not closing hrefer at all. If you don't close hrefer at all, there will be no duplicates. If you close / reopen many times, more duplicates. But hey, don't take my word for it... :rolleyes: