Source code scraper

clark123

Regular Member
Joined
Feb 27, 2012
Messages
399
Reaction score
329
Hi there!

I was wondering is there any software like "Scrapebox" which I scrape websites which contain certain code in the source code?

I have scrapebox and I have been trying to do above task, seems it has its limitation.

Thanks for your input in advance.
 
xrumer can do that, but you could also easily write a bash/php script using curl to do the same.
 
Scrapebox is only limited by 2 things.

1.) The search engines.

2.) The person using it.


Any scraper, including hrefer (which is xrumers scraping product, which is inferior to scrapebox IMHO - and yes I own Xrumer and have worked with it a lot in the past) is subject to what search engines will give it. You can't search for html in any major search engine, only content, in fact thats why products like ... meh Im drawing a blank, somone else can chime in. But there are multiple paid search engines that index the html of the page and allow you to search it but they are all paid and private.

At any rate, the page scanner in scrapebox can scan the html of a page and tell you if a page contains certain markers, but you have to have a list of urls to feed it.

The custom data graber can scrape html from a page, but you have to have the urls you want to scrape data from and you need to build the module telling it what to scrape.

The harvester can scrape search engines for urls based on the content they contain.

No free service, that I have ever heard of, allows you to input html in a search box and then get back results. Point being if thats what your after, yes scrapebox is limited on that because no search engine can do it. Scrapebox doesn't make content it just scrapes it. For that matter, no script you build will be able to allow you to search html with a search engine and hreffer, gscraper and any other tool out there is all limited in this way as well. All of them, including Scrapebox scrape data, not make up data. So you can't scrape what doesn't exist and no search engine will provide such data.

So if you can give more clarity on what your doing I could tell you if scrapebox can do it.
 
Not used scrapebox for a number of years but the footprints thing should be able to do this? I could be very wrong...
 
Scrapebox is only limited by 2 things.


1.) The search engines.


2.) The person using it.




Any scraper, including hrefer (which is xrumers scraping product, which is inferior to scrapebox IMHO - and yes I own Xrumer and have worked with it a lot in the past) is subject to what search engines will give it. You can't search for html in any major search engine, only content, in fact thats why products like ... meh Im drawing a blank, somone else can chime in. But there are multiple paid search engines that index the html of the page and allow you to search it but they are all paid and private.


At any rate, the page scanner in scrapebox can scan the html of a page and tell you if a page contains certain markers, but you have to have a list of urls to feed it.


The custom data graber can scrape html from a page, but you have to have the urls you want to scrape data from and you need to build the module telling it what to scrape.


The harvester can scrape search engines for urls based on the content they contain.


No free service, that I have ever heard of, allows you to input html in a search box and then get back results. Point being if thats what your after, yes scrapebox is limited on that because no search engine can do it. Scrapebox doesn't make content it just scrapes it. For that matter, no script you build will be able to allow you to search html with a search engine and hreffer, gscraper and any other tool out there is all limited in this way as well. All of them, including Scrapebox scrape data, not make up data. So you can't scrape what doesn't exist and no search engine will provide such data.


So if you can give more clarity on what your doing I could tell you if scrapebox can do it.
Thanks for the reply. I guess I can't do this task with scrapebox. I just want to scrape URLs which contain certain script code in their page. As you mention I need to feed URL first.


Don't get me wrong tho. I love scrapebox! I been using it for two years now.




xrumer can do that, but you could also easily write a bash/php script using curl to do the same.
Thanks for point me to right direction.
 
Thanks for the reply. I guess I can't do this task with scrapebox. I just want to scrape URLs which contain certain script code in their page. As you mention I need to feed URL first.


Don't get me wrong tho. I love scrapebox! I been using it for two years now.





Thanks for point me to right direction.

FYI Xrumer can not do that nor can a custom script, unless you build a script that crawls the web and then build a search engine to search that, but its probably easier to find a better footprint that you can use or pay a service that already does this. Reinventing google doesn't make much sense.
 
Get 'web data miner' from "Theskysoft", you can extract text, images, contact etc from any website
 
Get 'web data miner' from "Theskysoft", you can extract text, images, contact etc from any website

Thats not what he is asking, scrapebox can do all that. he is asking for a program that he can enter some html and it will query a search engine and return pages that have that html. No program can do this, because no search engine will return this data, they do not return results for html.

There are however paid services that do this. For the life of me I can't remember the name of them though. Something about being "the search engine for geeks". Ill post it if I think of it.
 
You can use Scrapy for Python, is very nice.
Also there are many examples for crawling websites with different languages. PHP, Python, Perl, Node.JS, Ruby, etc.
 
You can use Scrapebox to harvest keywords as many as you want, then use them to harvest urls. Now you can use Scrapebox page scanner for your purpose
 
There are however paid services that do this. For the life of me I can't remember the name of them though. Something about being "the search engine for geeks".
you looking for this: nerdydata.com
but the index is really small, search for "min.js" and you get "607,915 pages within the top million websites were found"
you probably don't want to search for things on the top million websites

i know of another one, but cannot find it in my bookmarks
but it has the same problem: super small index
i think they even mention on their website they use the top alex sites or so
 
Scrapebox is only limited by 2 things.

1.) The search engines.

2.) The person using it.


Any scraper, including hrefer (which is xrumers scraping product, which is inferior to scrapebox IMHO - and yes I own Xrumer and have worked with it a lot in the past) is subject to what search engines will give it. You can't search for html in any major search engine, only content, in fact thats why products like ... meh Im drawing a blank, somone else can chime in. But there are multiple paid search engines that index the html of the page and allow you to search it but they are all paid and private.

At any rate, the page scanner in scrapebox can scan the html of a page and tell you if a page contains certain markers, but you have to have a list of urls to feed it.

The custom data graber can scrape html from a page, but you have to have the urls you want to scrape data from and you need to build the module telling it what to scrape.

The harvester can scrape search engines for urls based on the content they contain.

No free service, that I have ever heard of, allows you to input html in a search box and then get back results. Point being if thats what your after, yes scrapebox is limited on that because no search engine can do it. Scrapebox doesn't make content it just scrapes it. For that matter, no script you build will be able to allow you to search html with a search engine and hreffer, gscraper and any other tool out there is all limited in this way as well. All of them, including Scrapebox scrape data, not make up data. So you can't scrape what doesn't exist and no search engine will provide such data.

So if you can give more clarity on what your doing I could tell you if scrapebox can do it.

I was looking to do something similar to this recently - I was just trying to scrape all of the "related searches" at the bottom of the Google SERP pages. But more specifically I wanted to isolate the words and phrases that Google bolded in those related searches. These are words/phrases that are not part of the user's original search query. I was hoping I could just grab the raw HTML for each of these related searches and pull them into Excel, then just do a MID formula to isolate anything within the B tags.

I know I could just grab the related searches as a whole and compare which words were and weren't part of the original search query, word by word, but it was a pain in the butt trying to come up with a way to reassemble the new words back into their respective phrases, in instances where Google added a phrase(s) to the query instead of just a single word or two.

I had forgotten about the custom data grabber in Scrapebox, I definitely need to go check out that tutorial again - do you think it would be able to do something like this? I can easily supply it with a page of Google serp page URLs.

FWIW to the original poster, perhaps if there is some sort of footprint in the rendered pages which contain a specific script you are looking for, you can harvest those pages first by that footprint, then grab the HTML with the custom data grabber in Scrapebox? For example, if the pages you are looking for that have a certain script also contain a message like "powered by..." in the actual page.

UPDATE: I tried the Custom Data Grabber for the situation described above and it worked like a charm. I just need to lower the amount of simultaneous connections, perhaps just let it do one at a time though, since a couple of the URLs in my list came back with 0 results. I used the simple method rather than Regex. Do you know if there are any easy to use Regular Expression generators that can create a basic regular expression (or at least a starting point) based on some highlighted text within a web page? In Chrome it's easy to get the XPath of something by simply highlighting it and right clicking, and choosing "Inspect" then copy the XPath. But for those of us who aren't up to speed on writing custom Regular Expressions, I'm wondering if there is something similar?
 
Last edited:
Hi there!

I was wondering is there any software like "Scrapebox" which I scrape websites which contain certain code in the source code?

I have scrapebox and I have been trying to do above task, seems it has its limitation.

Thanks for your input in advance.

It's not exactly what you were asking for, but another route could be to use the SEOTools For Excel plugin. You would still need to harvest a list of URL's to check, but once you have those URL's, the plugin adds a new function called "IsFoundOnPage", which is pretty simple to use - it's a bit simpler than using the Custom Data Grabber in Scrapebox. You would just import your list of URLs into a column in Excel, then write a basic formula in the following format:

=IsFoundOnPage("http://nielsbosma.se","google-analytics.com/ga.js")

There's just two parts of the formula, the URL you are looking to analyze (which you could just swap out for a cell reference in your spreadsheet), then a comma, and then the unique snippet of code you are looking for enclosed in quotes. The only caveat I've noticed thus far though is if your unique snippet of code has quotes already in it, it will break the function. So you will need to find some unique snippet of code from the script you are looking to identify which doesn't have any quote marks in it.
 
1. Custom Grabber has 2 masks: regex's speed is not so fast than before_after, this is what I observed.
What code you want to find? I try to compile it.
Now the scrapebox's harvesting speed is very fast.
Very likely, it can locate many of sites which has the code you want to find.


2. Or you need a distributed crawl system find you code. There are about 200000000 domains, if the
code is not on front page, that is a heavy job. Task like: find out which sites using a particular font.
 
you looking for this: nerdydata.com
but the index is really small, search for "min.js" and you get "607,915 pages within the top million websites were found"
you probably don't want to search for things on the top million websites

i know of another one, but cannot find it in my bookmarks
but it has the same problem: super small index
i think they even mention on their website they use the top alex sites or so

Yes thats it. Yes I know there are 1 or 2 others but you are also correct, small index is the issue. Trying to find some other marker in the content or focusing on a niche and then scanning the results for the code with the page scanner or some other tactic is where I landed, just depends on the situation.
 
I'd code something up in PHP with curl+xpath. Its a skill worth having :-)
 
Would I need a scraper or crawler to be able to get hidden data attributes from sites? Like say a site releases a product in 2 days but none of the product variants are visible, how am i able to get this data?
 
You can try PublicWWW - it indexed source codes of 158 millions websites.
 
Scrapebox is only limited by 2 things.

1.) The search engines.

2.) The person using it.


Any scraper, including hrefer (which is xrumers scraping product, which is inferior to scrapebox IMHO - and yes I own Xrumer and have worked with it a lot in the past) is subject to what search engines will give it. You can't search for html in any major search engine, only content, in fact thats why products like ... meh Im drawing a blank, somone else can chime in. But there are multiple paid search engines that index the html of the page and allow you to search it but they are all paid and private.

At any rate, the page scanner in scrapebox can scan the html of a page and tell you if a page contains certain markers, but you have to have a list of urls to feed it.

The custom data graber can scrape html from a page, but you have to have the urls you want to scrape data from and you need to build the module telling it what to scrape.

The harvester can scrape search engines for urls based on the content they contain.

No free service, that I have ever heard of, allows you to input html in a search box and then get back results. Point being if thats what your after, yes scrapebox is limited on that because no search engine can do it. Scrapebox doesn't make content it just scrapes it. For that matter, no script you build will be able to allow you to search html with a search engine and hreffer, gscraper and any other tool out there is all limited in this way as well. All of them, including Scrapebox scrape data, not make up data. So you can't scrape what doesn't exist and no search engine will provide such data.

So if you can give more clarity on what your doing I could tell you if scrapebox can do it.
I copy full site, only using scrapebox. It is magic. I crawl urls, i download images, i scrape html. It the only tool you will ever need
 
Back
Top