1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Source code scraper

Discussion in 'Black Hat SEO Tools' started by clark123, Dec 18, 2015.

  1. clark123

    clark123 Junior Member

    Joined:
    Feb 27, 2012
    Messages:
    135
    Likes Received:
    103
    Hi there!

    I was wondering is there any software like "Scrapebox" which I scrape websites which contain certain code in the source code?

    I have scrapebox and I have been trying to do above task, seems it has its limitation.

    Thanks for your input in advance.
     
  2. itz_styx

    itz_styx Jr. VIP Jr. VIP

    Joined:
    May 8, 2012
    Messages:
    379
    Likes Received:
    140
    Occupation:
    CEO / Admin / Developer
    Location:
    /dev/mem
    Home Page:
    xrumer can do that, but you could also easily write a bash/php script using curl to do the same.
     
    • Thanks Thanks x 1
  3. loopline

    loopline Jr. VIP Jr. VIP

    Joined:
    Jan 25, 2009
    Messages:
    3,726
    Likes Received:
    1,993
    Gender:
    Male
    Home Page:
    Scrapebox is only limited by 2 things.

    1.) The search engines.

    2.) The person using it.


    Any scraper, including hrefer (which is xrumers scraping product, which is inferior to scrapebox IMHO - and yes I own Xrumer and have worked with it a lot in the past) is subject to what search engines will give it. You can't search for html in any major search engine, only content, in fact thats why products like ... meh Im drawing a blank, somone else can chime in. But there are multiple paid search engines that index the html of the page and allow you to search it but they are all paid and private.

    At any rate, the page scanner in scrapebox can scan the html of a page and tell you if a page contains certain markers, but you have to have a list of urls to feed it.

    The custom data graber can scrape html from a page, but you have to have the urls you want to scrape data from and you need to build the module telling it what to scrape.

    The harvester can scrape search engines for urls based on the content they contain.

    No free service, that I have ever heard of, allows you to input html in a search box and then get back results. Point being if thats what your after, yes scrapebox is limited on that because no search engine can do it. Scrapebox doesn't make content it just scrapes it. For that matter, no script you build will be able to allow you to search html with a search engine and hreffer, gscraper and any other tool out there is all limited in this way as well. All of them, including Scrapebox scrape data, not make up data. So you can't scrape what doesn't exist and no search engine will provide such data.

    So if you can give more clarity on what your doing I could tell you if scrapebox can do it.
     
    • Thanks Thanks x 1
  4. Ste Fishkin

    Ste Fishkin Jr. VIP Jr. VIP Premium Member

    Joined:
    May 14, 2011
    Messages:
    2,047
    Likes Received:
    10,422
    Not used scrapebox for a number of years but the footprints thing should be able to do this? I could be very wrong...
     
  5. clark123

    clark123 Junior Member

    Joined:
    Feb 27, 2012
    Messages:
    135
    Likes Received:
    103
    Thanks for the reply. I guess I can't do this task with scrapebox. I just want to scrape URLs which contain certain script code in their page. As you mention I need to feed URL first.


    Don't get me wrong tho. I love scrapebox! I been using it for two years now.




    Thanks for point me to right direction.
     
  6. loopline

    loopline Jr. VIP Jr. VIP

    Joined:
    Jan 25, 2009
    Messages:
    3,726
    Likes Received:
    1,993
    Gender:
    Male
    Home Page:
    FYI Xrumer can not do that nor can a custom script, unless you build a script that crawls the web and then build a search engine to search that, but its probably easier to find a better footprint that you can use or pay a service that already does this. Reinventing google doesn't make much sense.
     
  7. theskysoft2

    theskysoft2 Newbie

    Joined:
    Dec 9, 2015
    Messages:
    2
    Likes Received:
    0
    Home Page:
    Get 'web data miner' from "Theskysoft", you can extract text, images, contact etc from any website
     
  8. loopline

    loopline Jr. VIP Jr. VIP

    Joined:
    Jan 25, 2009
    Messages:
    3,726
    Likes Received:
    1,993
    Gender:
    Male
    Home Page:
    Thats not what he is asking, scrapebox can do all that. he is asking for a program that he can enter some html and it will query a search engine and return pages that have that html. No program can do this, because no search engine will return this data, they do not return results for html.

    There are however paid services that do this. For the life of me I can't remember the name of them though. Something about being "the search engine for geeks". Ill post it if I think of it.
     
  9. rodvan

    rodvan Jr. VIP Jr. VIP

    Joined:
    Jul 27, 2010
    Messages:
    1,293
    Likes Received:
    492
    Occupation:
    developer, marketing, automation, machine learning
    Location:
    Wizard of Bots
    Home Page:
    You can use Scrapy for Python, is very nice.
    Also there are many examples for crawling websites with different languages. PHP, Python, Perl, Node.JS, Ruby, etc.
     
  10. frynizy

    frynizy Registered Member

    Joined:
    Oct 3, 2013
    Messages:
    93
    Likes Received:
    28
    You can use Scrapebox to harvest keywords as many as you want, then use them to harvest urls. Now you can use Scrapebox page scanner for your purpose
     
  11. sockpuppet

    sockpuppet Junior Member

    Joined:
    Nov 7, 2011
    Messages:
    155
    Likes Received:
    145
    you looking for this: nerdydata.com
    but the index is really small, search for "min.js" and you get "607,915 pages within the top million websites were found"
    you probably don't want to search for things on the top million websites

    i know of another one, but cannot find it in my bookmarks
    but it has the same problem: super small index
    i think they even mention on their website they use the top alex sites or so
     
  12. Atomic76

    Atomic76 Registered Member

    Joined:
    May 24, 2014
    Messages:
    67
    Likes Received:
    37
    I was looking to do something similar to this recently - I was just trying to scrape all of the "related searches" at the bottom of the Google SERP pages. But more specifically I wanted to isolate the words and phrases that Google bolded in those related searches. These are words/phrases that are not part of the user's original search query. I was hoping I could just grab the raw HTML for each of these related searches and pull them into Excel, then just do a MID formula to isolate anything within the B tags.

    I know I could just grab the related searches as a whole and compare which words were and weren't part of the original search query, word by word, but it was a pain in the butt trying to come up with a way to reassemble the new words back into their respective phrases, in instances where Google added a phrase(s) to the query instead of just a single word or two.

    I had forgotten about the custom data grabber in Scrapebox, I definitely need to go check out that tutorial again - do you think it would be able to do something like this? I can easily supply it with a page of Google serp page URLs.

    FWIW to the original poster, perhaps if there is some sort of footprint in the rendered pages which contain a specific script you are looking for, you can harvest those pages first by that footprint, then grab the HTML with the custom data grabber in Scrapebox? For example, if the pages you are looking for that have a certain script also contain a message like "powered by..." in the actual page.

    UPDATE: I tried the Custom Data Grabber for the situation described above and it worked like a charm. I just need to lower the amount of simultaneous connections, perhaps just let it do one at a time though, since a couple of the URLs in my list came back with 0 results. I used the simple method rather than Regex. Do you know if there are any easy to use Regular Expression generators that can create a basic regular expression (or at least a starting point) based on some highlighted text within a web page? In Chrome it's easy to get the XPath of something by simply highlighting it and right clicking, and choosing "Inspect" then copy the XPath. But for those of us who aren't up to speed on writing custom Regular Expressions, I'm wondering if there is something similar?
     
    Last edited: Dec 22, 2015
  13. Atomic76

    Atomic76 Registered Member

    Joined:
    May 24, 2014
    Messages:
    67
    Likes Received:
    37
    It's not exactly what you were asking for, but another route could be to use the SEOTools For Excel plugin. You would still need to harvest a list of URL's to check, but once you have those URL's, the plugin adds a new function called "IsFoundOnPage", which is pretty simple to use - it's a bit simpler than using the Custom Data Grabber in Scrapebox. You would just import your list of URLs into a column in Excel, then write a basic formula in the following format:

    =IsFoundOnPage("http://nielsbosma.se","google-analytics.com/ga.js")

    There's just two parts of the formula, the URL you are looking to analyze (which you could just swap out for a cell reference in your spreadsheet), then a comma, and then the unique snippet of code you are looking for enclosed in quotes. The only caveat I've noticed thus far though is if your unique snippet of code has quotes already in it, it will break the function. So you will need to find some unique snippet of code from the script you are looking to identify which doesn't have any quote marks in it.
     
  14. Joseph Lich

    Joseph Lich BANNED BANNED

    Joined:
    Nov 25, 2015
    Messages:
    402
    Likes Received:
    79
    1. Custom Grabber has 2 masks: regex's speed is not so fast than before_after, this is what I observed.
    What code you want to find? I try to compile it.
    Now the scrapebox's harvesting speed is very fast.
    Very likely, it can locate many of sites which has the code you want to find.


    2. Or you need a distributed crawl system find you code. There are about 200000000 domains, if the
    code is not on front page, that is a heavy job. Task like: find out which sites using a particular font.
     
  15. loopline

    loopline Jr. VIP Jr. VIP

    Joined:
    Jan 25, 2009
    Messages:
    3,726
    Likes Received:
    1,993
    Gender:
    Male
    Home Page:
    Yes thats it. Yes I know there are 1 or 2 others but you are also correct, small index is the issue. Trying to find some other marker in the content or focusing on a niche and then scanning the results for the code with the page scanner or some other tactic is where I landed, just depends on the situation.
     
  16. jamie3000

    jamie3000 Supreme Member

    Joined:
    Jun 30, 2014
    Messages:
    1,311
    Likes Received:
    587
    Occupation:
    Finance coder looking for semi-retirement
    Location:
    uk
    I'd code something up in PHP with curl+xpath. Its a skill worth having :)
     
  17. In_Training

    In_Training Newbie

    Joined:
    Jan 3, 2012
    Messages:
    18
    Likes Received:
    0
    Would I need a scraper or crawler to be able to get hidden data attributes from sites? Like say a site releases a product in 2 days but none of the product variants are visible, how am i able to get this data?
     
  18. Ivenco

    Ivenco Newbie

    Joined:
    Jun 16, 2014
    Messages:
    9
    Likes Received:
    0
    You can try PublicWWW - it indexed source codes of 158 millions websites.
     
  19. Sheraf

    Sheraf Registered Member

    Joined:
    Jan 19, 2014
    Messages:
    61
    Likes Received:
    8