1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Scraping forums with scrapebox, how to filter out the good ones

Discussion in 'Black Hat SEO Tools' started by stealthstorm, Feb 15, 2011.

  1. stealthstorm

    stealthstorm Newbie

    Joined:
    Nov 2, 2010
    Messages:
    21
    Likes Received:
    1
    So after scraping for forums I have a list of over 100,000 urls. But not all of them are actual forums.

    So there are plenty of urls not worth anything to me, what is a good way of filtering the non forum links out of there?

    Doing it manually is going to cost alot of time, or is this the way to go? I can't really imagine a way that can do this accurate.

    Hope you can help.
     
  2. chris456

    chris456 Regular Member

    Joined:
    May 17, 2010
    Messages:
    281
    Likes Received:
    567
    I wish to know it too .
    You can put them on web http://profilemachine.com and try to convert them (sort) or use SICK platform reader to sort them by platform , but if you put 100K urls it will crash for sure (will crash if you try to sort only 10K urls) so I would like to know it too , how other users filters good ones.
     
  3. kappa84

    kappa84 Power Member

    Joined:
    May 19, 2010
    Messages:
    736
    Likes Received:
    334
    Location:
    Bath, UK
    Use more targeted footprints to avoid this problem, instead of usual "powered by smf" etc...
     
  4. stealthstorm

    stealthstorm Newbie

    Joined:
    Nov 2, 2010
    Messages:
    21
    Likes Received:
    1
    I have, things like "index.php?a=forum" etc., but not targeted enough I guess...
     
  5. chris456

    chris456 Regular Member

    Joined:
    May 17, 2010
    Messages:
    281
    Likes Received:
    567
    More targeted footprints requires better proxies because scrapebox proxies are almost all of them dead for operator footprint extracting , when I search for new ones they are also bad for google operator footprint searches , and to buy private proxies , because of searching of more targeted forums is for me very expensive task.
    But you are right more targeted (accurate) extraction would resolve it.
     
  6. oinky222

    oinky222 Regular Member

    Joined:
    Oct 2, 2010
    Messages:
    389
    Likes Received:
    175
    This program works extremely well:

    Code:
    http://sickmarketing.com/forum/showthread.php?t=1351
     
  7. ericsson

    ericsson Elite Member Premium Member

    Joined:
    Apr 25, 2009
    Messages:
    2,642
    Likes Received:
    8,132
    Occupation:
    www
    Location:
    Swe
    Home Page:
    Register url for vbulletin footprint in scrapebox.

    Code:
    "Powered By vBulletin" + "In Order to proceed, you must agree with the following rules"
    And if u got a list of 100k, then split down these to 10 part. Run in Sick Platform reader.

    Bam!
     
    • Thanks Thanks x 2
  8. chris456

    chris456 Regular Member

    Joined:
    May 17, 2010
    Messages:
    281
    Likes Received:
    567
    Very nice tool but crashing , I hate splitting the files before running it , 10K is maximum to run it .
     
  9. kappa84

    kappa84 Power Member

    Joined:
    May 19, 2010
    Messages:
    736
    Likes Received:
    334
    Location:
    Bath, UK
    I use a method read here:

    - get around 15-20k proxies from different places (not internal source)
    - test, remove failed, filter over 2,000ms
    - rinse and repeat until you get a nice list of few hundreds great public proxies for harvesting
    - until it finishes a batch of keywords+queries I already find other proxies with another sb license
    - etc
     
  10. Anubis1980

    Anubis1980 Regular Member

    Joined:
    Mar 20, 2010
    Messages:
    276
    Likes Received:
    81
    Occupation:
    webmaster and father
    No only that..It fails on good ones too.
     
  11. walker

    walker Junior Member

    Joined:
    Feb 19, 2009
    Messages:
    146
    Likes Received:
    49
  12. chris456

    chris456 Regular Member

    Joined:
    May 17, 2010
    Messages:
    281
    Likes Received:
    567
    Kappa thank you very much , but I do the same , but I bet you never will be able to get many results on some operator footprint like inurl:..... , never , you can have many results for "powered by ...." but never for inurl: which are much more accurate
     
  13. Flurbuff

    Flurbuff Regular Member

    Joined:
    Jun 17, 2010
    Messages:
    227
    Likes Received:
    94
    I use "powered by" and inurl something like profile or whatever that platform has for its member's profile pages. Probably over 90% of the urls it grabs are actual forum member profile pages.

    This way it's likely they will public and not private. Also try using marketing keywords to find forums that aren't heavily moderated. If you see a profile with an old viagra link in it then it's probably smooth sailing.
     
    • Thanks Thanks x 1
  14. chris456

    chris456 Regular Member

    Joined:
    May 17, 2010
    Messages:
    281
    Likes Received:
    567
    Thanks for suggestion , but I am trying to find more targeted market (nightclub's , stripclub's blogs - forums etc) . I run an agency for lap dancers (I have income from that ) and BHW forum I use to work more greyhat and try to earn also some money with my websites where I am describing those clubs (thanks BHW I learn better SEO when I see some tips , hacks , little cheating and can use better tools I had till now ) , but I can't write common comments on those blogs & forums because my website(s) are intended to be serious websites , I can't allow to spam . I only want to be better than my other whitehat thinking and working competitors using techniques like scrapebox to find High PR relevant blogs & forums for commenting them with real (unfortunately) manual comments with links to my website . That's why I try to find exactly what I need by operator footprints (not every opened public blog on this planet , but HIGH PR my niche relevant blogs & forums for making manual comments ( it sounds terrible , but I don't know other solution how to gain precious backlinks and do not spam....

    If I put inurl:.......... my footprint in scrapebox become immediately enemy to Google and refuces every proxy I have . So I can only use " quotes " to get some extractions , inurl: footprints are more targeted - accurated but I don't want to buy private proxies because of that.
     
    Last edited: Feb 16, 2011
  15. stealthstorm

    stealthstorm Newbie

    Joined:
    Nov 2, 2010
    Messages:
    21
    Likes Received:
    1
    You know all those footprints, it seems to work just as well when you put all the footprints in the keyword area...so you get alot more connections as well instead of just one.

    But I guess this is old news?

    The only thing about scrapebox, it doesn't seem to keep digging and digging.....have to start harvesting over and over again.
     
    Last edited: Feb 17, 2011
  16. nanotechno

    nanotechno Junior Member

    Joined:
    Feb 4, 2011
    Messages:
    161
    Likes Received:
    24
    I noticed this too. The public proxies does not deliver accurate Google foot prints search results.