[Urgent] Does Scrapebox have this capability?


Sep 14, 2010

I'm scraping lists but I would like to compare my new list with the old 1.

For instance say I scraped 1 million urls and then a bit later I scraped 2 million.

I wanted to compare the first list with the old and ONLY take the new targets.

I was told to do this ....
load your newly scraped list into SB, then use "select the url list to compare" if you want to compare lists by url, or "select the url list to compare on domain level" if you want to compare by domain."

I'm not sure where this option is....

You can load both lists into scrapebox (or GVIM since you lists are huge) and remove the duplicate URLs.
This is a few step process, but I think it is what you want.

1. Load each file into Notepad ++ separately, sort the list and remove duplicates (TextFX Tools add in required), save each file separately.
2. Load the first file in Code Compare from Dev. Art, then select the second file to compare with.

Code compare will show in blocks the differences and similarities between the two files.
I've created a tool for this purpose some time ago. Here you go.

Enter your first list, your list that you want to compare with and click "Start Filter". It's a fast tool and you will get your results in few seconds. For example I tested on my personal computer (Core 2 Duo T6600, Windows 7) a list of over 3 million urls (173 MB) compared on domain level with a list of almost 450k urls(12 MB).

The output text document was generated in about 9 seconds. So, is very fast !

Just make sure the domains from "old list" are trimmed to root. You can do this with scrapebox. ;)
