After years of web scraping and working with people who do data collection, data harvesting, data indexing, data aggregation, web crawling, screen scraping, or whatever you want to call it, I wanted to put together a very basic list of ideas on how anyone can profit from the info that’s already out there. First: What’s scraping? My definition is basic: scraping is intelligently, automatically taking content from somewhere, generally structured content, with the intention of reproducing it or examining it for trends or valuable information. Second: Why scraping? Because data is valuable. Knowledge is power. Yadayada. You know all this. What you might not know: scraping is often free. So, we're talking about free value. So here's what this guide is going to answer, very basically and quickly to get your mind working: Where to get the data What to do with it One popular cloud-based scraping product suggests these basic scraping categories: Method (site example) Machine learning (Google images) Price monitoring (Ebay) Lead generation (Yelp) [scraping contact info for local biz] Market research (Brewdog) [scraping types of beer and their ratings, for example) App Development (Realtor.com) [I can only assume scraping realty data and copying it] Academic Research (Techcrunch) Nice, but I’m going to break it down for you in terms of how to actually make money with this stuff. Here’s the basic categories I could come up with: Duplicating sites Offering scraped data as a service Lead gen Offering "scraping" itself as a service Scraping to get around APIs Duplicating whole sites This is an obvious one. No matter what website you want to create, there’s probably already one out there that’s similar. Here’s some site ideas that could benefit from reproducing scraped data: Forums Job boards Blogs Q&A Site Coupon Sites Knowledgebase/Wiki Sites Social network Review sites (think Yelp, Amazon, etc) Any site with data that you could reproduce and create a better interface/app/etc for A ton of sites you might have could use one of these, to look active, to get more traffic, for SEO, as part of a PBN, as a place to actually get the data to begin with (for a coupon site), etc. Offering scraped data as a service People want the info below. If you aggregate it regularly or quickly you’ve got yourself some value. Build a targeted search engine, for example, that pulls data from the top 10 or 20 providers of any kind of niche product and you’ve got something that probably doesn’t exist anywhere else. Consider: Stocks (Often sites require a cost to scrape anything past a certain date - but you could scrape this once and then provide it for free) Niche News Aggregation (pick a niche, like celebrity news sites, scrape the top 10 sites, etc) Daily News (pay for a subscription to get past major site paywalls, then make the data free or discounted) Anything with a paywall - if you’re a student, you can grab this for free - but be careful, because that’s what got Aaron Swartz in trouble Any kind of niche content to auto-send your mailing list, post via social media, etc (think a newsletter just for the top trends in blackhat IM, or a bot that auto-tweets new when a house gets sold in a specific zipcode) Offline, intranet, or hard-to-access data - any legacy database or collection of info can be scraped and converted into a new format and put online, and I’ve seen companies pay big bucks to have this done rather than pay to have entire legacy software systems rebuilt. Lead Generation This is a goldmine, and one that could be considered less than legal, but you wouldn’t believe the number of big companies who use this data for all sorts of things (import.io SUGGESTS you use Yelp for lead gen, despite scraping Yelp being against the TOS). Ever get targeted by a mailer because you bought a house, had a kid, moved, went to jail, started a business, etc? A lot of this is public info. You wouldn’t believe the number of lawyers and realtors I’ve talked to who use public databases to get clients. Those two groups, for example, usually have access to a poorly designed database that doesn’t export easily and requires scraping to go through the vast datasets. If you have access to a unique dataset, or you’re willing to pay for it, or you can grab something that’s public and re-form it, you’re in a great position. You could collect the data and sell it, or you could use it yourself by targeting the contacts directly with offers. Note - Learn regex. Many places are going to have contact info like email addresses throughout that isn’t easily scrapable. With regex and the right software, you can grab any email address from any dataset and copy ONLY that. Places to scrape: Social networks like Linkedin, Facebook, Twitter Public datasets/records like insurance data, criminal records and other law databases, voting records, tax records, gov’t spending databases. Realty (home foreclosures, new homes) Car / vehicle sales websites Review sites like Yelp “Scraping” as a service: This sounds like offering scraped data as a service but it's slightly different, essentially because it's time-based. A lot of SAAS companies out there are just scrapers or content aggregators. You can be too. For instance, you could: Monitor websites for updates or changes Proxies Sales data (Amazon, Ebay, etc) or any kind of item and product listings for competitive price monitoring and market research, a price comparison portal, price arbitrage (what/when can you buy from Amazon and sell on Ebay for a profit?) or inventory tracking Locate the highest-ranking keywords of your competitors on all major search engines Automate ad buying research Scraping to get around API’s: A lot of sites that have APIs have them because people are willing to pay for the data - if that's true, then just ask yourself, why? API’s are awesome, but they often cost money. If you had all the money in the world and plenty of time to code I’m sure you would use them. The great thing is that sites that have APIs usually have structured content on their site as well. If you need to get data fast and easy, and for basically free, skip the API and go straight for scraping the data directly. In fact, one way to get scraping site ideas is to look up sites that have APIs. Example: http://www.computersciencezone.org/50-most-useful-apis-for-developers/ http://www.programmableweb.com/news/most-popular-apis-least-one-will-surprise-you/2014/01/23 It’s all overwhelming. Where to start? 1. Start with what you know. If you’re into old cars, build a search engine / listing site for old cars for sale. See if you can automate it and monetize it. If you’re into gov’t spending or something related to legislation, here’s a few fun ideas: https://www.fcc.gov/licensing-databases/general/search-fcc-databases https://www.data.gov/ https://www.foia.gov/search.html 2. Play around. One reason I love scraping is that it’s fun. The programming part of it is annoying, but getting the data is fun. 3. Grab some data and put it into a word cloud. This can be fun. Here’s some data I scraped earlier today of scraping jobs. Sometimes it's useful to get a sense of what is popular. 4. Don’t freak out. Yes, there's a lot of data. Yes there are almost always sites that exist already that do something like what you plan to do, but usually they’re making money, and you could make some of that money, too. And if your idea is niche enough, you might actually be the first to aggregate the data or offer the service. Anyway, I hope this helps some of you who are looking for a method. Scraping is something pretty much anyone can do, and it’s how a lot of sites get started. (Facebook for example) I’m working on a larger list of ideas that I’ll share a link to once it’s ready. I'm sure there are plenty of things I missed, this is just a basic getting started guide. Thanks for reading.