Source code scraper

clark123 · Dec 18, 2015

Hi there!

I was wondering is there any software like "Scrapebox" which I scrape websites which contain certain code in the source code?

I have scrapebox and I have been trying to do above task, seems it has its limitation.

Thanks for your input in advance.

itz_styx · Dec 18, 2015

xrumer can do that, but you could also easily write a bash/php script using curl to do the same.

loopline · Dec 18, 2015

Scrapebox is only limited by 2 things.

1.) The search engines.

2.) The person using it.

Any scraper, including hrefer (which is xrumers scraping product, which is inferior to scrapebox IMHO - and yes I own Xrumer and have worked with it a lot in the past) is subject to what search engines will give it. You can't search for html in any major search engine, only content, in fact thats why products like ... meh Im drawing a blank, somone else can chime in. But there are multiple paid search engines that index the html of the page and allow you to search it but they are all paid and private.

At any rate, the page scanner in scrapebox can scan the html of a page and tell you if a page contains certain markers, but you have to have a list of urls to feed it.

The custom data graber can scrape html from a page, but you have to have the urls you want to scrape data from and you need to build the module telling it what to scrape.

The harvester can scrape search engines for urls based on the content they contain.

No free service, that I have ever heard of, allows you to input html in a search box and then get back results. Point being if thats what your after, yes scrapebox is limited on that because no search engine can do it. Scrapebox doesn't make content it just scrapes it. For that matter, no script you build will be able to allow you to search html with a search engine and hreffer, gscraper and any other tool out there is all limited in this way as well. All of them, including Scrapebox scrape data, not make up data. So you can't scrape what doesn't exist and no search engine will provide such data.

So if you can give more clarity on what your doing I could tell you if scrapebox can do it.

Banned member 185991 · Dec 18, 2015

Not used scrapebox for a number of years but the footprints thing should be able to do this? I could be very wrong...

clark123 · Dec 19, 2015

loopline said:
Scrapebox is only limited by 2 things.

1.) The search engines.

2.) The person using it.

Any scraper, including hrefer (which is xrumers scraping product, which is inferior to scrapebox IMHO - and yes I own Xrumer and have worked with it a lot in the past) is subject to what search engines will give it. You can't search for html in any major search engine, only content, in fact thats why products like ... meh Im drawing a blank, somone else can chime in. But there are multiple paid search engines that index the html of the page and allow you to search it but they are all paid and private.

At any rate, the page scanner in scrapebox can scan the html of a page and tell you if a page contains certain markers, but you have to have a list of urls to feed it.

The custom data graber can scrape html from a page, but you have to have the urls you want to scrape data from and you need to build the module telling it what to scrape.

The harvester can scrape search engines for urls based on the content they contain.

No free service, that I have ever heard of, allows you to input html in a search box and then get back results. Point being if thats what your after, yes scrapebox is limited on that because no search engine can do it. Scrapebox doesn't make content it just scrapes it. For that matter, no script you build will be able to allow you to search html with a search engine and hreffer, gscraper and any other tool out there is all limited in this way as well. All of them, including Scrapebox scrape data, not make up data. So you can't scrape what doesn't exist and no search engine will provide such data.

So if you can give more clarity on what your doing I could tell you if scrapebox can do it.

Thanks for the reply. I guess I can't do this task with scrapebox. I just want to scrape URLs which contain certain script code in their page. As you mention I need to feed URL first.

Don't get me wrong tho. I love scrapebox! I been using it for two years now.

itz_styx said:
xrumer can do that, but you could also easily write a bash/php script using curl to do the same.

Thanks for point me to right direction.

loopline · Dec 19, 2015

clark123 said:
Thanks for the reply. I guess I can't do this task with scrapebox. I just want to scrape URLs which contain certain script code in their page. As you mention I need to feed URL first.

Don't get me wrong tho. I love scrapebox! I been using it for two years now.

Thanks for point me to right direction.

FYI Xrumer can not do that nor can a custom script, unless you build a script that crawls the web and then build a search engine to search that, but its probably easier to find a better footprint that you can use or pay a service that already does this. Reinventing google doesn't make much sense.

theskysoft2 · Dec 21, 2015

Get 'web data miner' from "Theskysoft", you can extract text, images, contact etc from any website

loopline · Dec 22, 2015

theskysoft2 said:
Get 'web data miner' from "Theskysoft", you can extract text, images, contact etc from any website

Thats not what he is asking, scrapebox can do all that. he is asking for a program that he can enter some html and it will query a search engine and return pages that have that html. No program can do this, because no search engine will return this data, they do not return results for html.

There are however paid services that do this. For the life of me I can't remember the name of them though. Something about being "the search engine for geeks". Ill post it if I think of it.

dreadpixel · Dec 22, 2015

You can use Scrapy for Python, is very nice.
Also there are many examples for crawling websites with different languages. PHP, Python, Perl, Node.JS, Ruby, etc.

frynizy · Dec 22, 2015

You can use Scrapebox to harvest keywords as many as you want, then use them to harvest urls. Now you can use Scrapebox page scanner for your purpose

sockpuppet · Dec 22, 2015

loopline said:
There are however paid services that do this. For the life of me I can't remember the name of them though. Something about being "the search engine for geeks".

you looking for this: nerdydata.com
but the index is really small, search for "min.js" and you get "607,915 pages within the top million websites were found"
you probably don't want to search for things on the top million websites

i know of another one, but cannot find it in my bookmarks
but it has the same problem: super small index
i think they even mention on their website they use the top alex sites or so

Atomic76 · Dec 22, 2015

loopline said:
Scrapebox is only limited by 2 things.

1.) The search engines.

2.) The person using it.

Any scraper, including hrefer (which is xrumers scraping product, which is inferior to scrapebox IMHO - and yes I own Xrumer and have worked with it a lot in the past) is subject to what search engines will give it. You can't search for html in any major search engine, only content, in fact thats why products like ... meh Im drawing a blank, somone else can chime in. But there are multiple paid search engines that index the html of the page and allow you to search it but they are all paid and private.

At any rate, the page scanner in scrapebox can scan the html of a page and tell you if a page contains certain markers, but you have to have a list of urls to feed it.

The custom data graber can scrape html from a page, but you have to have the urls you want to scrape data from and you need to build the module telling it what to scrape.

The harvester can scrape search engines for urls based on the content they contain.

No free service, that I have ever heard of, allows you to input html in a search box and then get back results. Point being if thats what your after, yes scrapebox is limited on that because no search engine can do it. Scrapebox doesn't make content it just scrapes it. For that matter, no script you build will be able to allow you to search html with a search engine and hreffer, gscraper and any other tool out there is all limited in this way as well. All of them, including Scrapebox scrape data, not make up data. So you can't scrape what doesn't exist and no search engine will provide such data.

So if you can give more clarity on what your doing I could tell you if scrapebox can do it.

I was looking to do something similar to this recently - I was just trying to scrape all of the "related searches" at the bottom of the Google SERP pages. But more specifically I wanted to isolate the words and phrases that Google bolded in those related searches. These are words/phrases that are not part of the user's original search query. I was hoping I could just grab the raw HTML for each of these related searches and pull them into Excel, then just do a MID formula to isolate anything within the B tags.

I know I could just grab the related searches as a whole and compare which words were and weren't part of the original search query, word by word, but it was a pain in the butt trying to come up with a way to reassemble the new words back into their respective phrases, in instances where Google added a phrase(s) to the query instead of just a single word or two.

I had forgotten about the custom data grabber in Scrapebox, I definitely need to go check out that tutorial again - do you think it would be able to do something like this? I can easily supply it with a page of Google serp page URLs.

FWIW to the original poster, perhaps if there is some sort of footprint in the rendered pages which contain a specific script you are looking for, you can harvest those pages first by that footprint, then grab the HTML with the custom data grabber in Scrapebox? For example, if the pages you are looking for that have a certain script also contain a message like "powered by..." in the actual page.

UPDATE: I tried the Custom Data Grabber for the situation described above and it worked like a charm. I just need to lower the amount of simultaneous connections, perhaps just let it do one at a time though, since a couple of the URLs in my list came back with 0 results. I used the simple method rather than Regex. Do you know if there are any easy to use Regular Expression generators that can create a basic regular expression (or at least a starting point) based on some highlighted text within a web page? In Chrome it's easy to get the XPath of something by simply highlighting it and right clicking, and choosing "Inspect" then copy the XPath. But for those of us who aren't up to speed on writing custom Regular Expressions, I'm wondering if there is something similar?

Atomic76 · Dec 22, 2015

clark123 said:
Hi there!

I was wondering is there any software like "Scrapebox" which I scrape websites which contain certain code in the source code?

I have scrapebox and I have been trying to do above task, seems it has its limitation.

Thanks for your input in advance.

It's not exactly what you were asking for, but another route could be to use the SEOTools For Excel plugin. You would still need to harvest a list of URL's to check, but once you have those URL's, the plugin adds a new function called "IsFoundOnPage", which is pretty simple to use - it's a bit simpler than using the Custom Data Grabber in Scrapebox. You would just import your list of URLs into a column in Excel, then write a basic formula in the following format:

=IsFoundOnPage("http://nielsbosma.se","google-analytics.com/ga.js")

There's just two parts of the formula, the URL you are looking to analyze (which you could just swap out for a cell reference in your spreadsheet), then a comma, and then the unique snippet of code you are looking for enclosed in quotes. The only caveat I've noticed thus far though is if your unique snippet of code has quotes already in it, it will break the function. So you will need to find some unique snippet of code from the script you are looking to identify which doesn't have any quote marks in it.

Joseph Lich · Dec 22, 2015

1. Custom Grabber has 2 masks: regex's speed is not so fast than before_after, this is what I observed.
What code you want to find? I try to compile it.
Now the scrapebox's harvesting speed is very fast.
Very likely, it can locate many of sites which has the code you want to find.

2. Or you need a distributed crawl system find you code. There are about 200000000 domains, if the
code is not on front page, that is a heavy job. Task like: find out which sites using a particular font.

loopline · Dec 22, 2015

sockpuppet said:
you looking for this: nerdydata.com
but the index is really small, search for "min.js" and you get "607,915 pages within the top million websites were found"
you probably don't want to search for things on the top million websites

i know of another one, but cannot find it in my bookmarks
but it has the same problem: super small index
i think they even mention on their website they use the top alex sites or so

Yes thats it. Yes I know there are 1 or 2 others but you are also correct, small index is the issue. Trying to find some other marker in the content or focusing on a niche and then scanning the results for the code with the page scanner or some other tactic is where I landed, just depends on the situation.

jamie3000 · Dec 22, 2015

I'd code something up in PHP with curl+xpath. Its a skill worth having

In_Training · Jan 5, 2016

Would I need a scraper or crawler to be able to get hidden data attributes from sites? Like say a site releases a product in 2 days but none of the product variants are visible, how am i able to get this data?

Ivenco · Jan 30, 2016

You can try PublicWWW - it indexed source codes of 158 millions websites.

Sheraf · Feb 8, 2016

You could try meanpath.com
you can have a free API access from mashape (https://market.mashape.com/meanpath)
I've used it before, it works really good

John_SE · Oct 13, 2018

loopline said:
Scrapebox is only limited by 2 things.

1.) The search engines.

2.) The person using it.

Any scraper, including hrefer (which is xrumers scraping product, which is inferior to scrapebox IMHO - and yes I own Xrumer and have worked with it a lot in the past) is subject to what search engines will give it. You can't search for html in any major search engine, only content, in fact thats why products like ... meh Im drawing a blank, somone else can chime in. But there are multiple paid search engines that index the html of the page and allow you to search it but they are all paid and private.

At any rate, the page scanner in scrapebox can scan the html of a page and tell you if a page contains certain markers, but you have to have a list of urls to feed it.

The custom data graber can scrape html from a page, but you have to have the urls you want to scrape data from and you need to build the module telling it what to scrape.

The harvester can scrape search engines for urls based on the content they contain.

No free service, that I have ever heard of, allows you to input html in a search box and then get back results. Point being if thats what your after, yes scrapebox is limited on that because no search engine can do it. Scrapebox doesn't make content it just scrapes it. For that matter, no script you build will be able to allow you to search html with a search engine and hreffer, gscraper and any other tool out there is all limited in this way as well. All of them, including Scrapebox scrape data, not make up data. So you can't scrape what doesn't exist and no search engine will provide such data.

So if you can give more clarity on what your doing I could tell you if scrapebox can do it.

I copy full site, only using scrapebox. It is magic. I crawl urls, i download images, i scrape html. It the only tool you will ever need

Source code scraper

Regular Member

Elite Member

Elite Member

Banned member 185991

Guest

Regular Member

Elite Member

Newbie

Elite Member

Banned - Selling outside marketplace

Registered Member

Junior Member

Registered Member

Registered Member

BANNED

Elite Member

Elite Member

Newbie

Newbie

Registered Member

Junior Member

Main Menu

Marketplace

Making Money

BlackHat World