1. This website uses cookies to improve service and provide a tailored user experience. By using this site, you agree to this use. See our Cookie Policy.
    Dismiss Notice

[Easy Trick] How to Scrape All Competitor URLs, Titles, Descriptions, and Meta KWs in 5 Minutes Max

Discussion in 'Black Hat SEO' started by MatthewGraham, Jan 18, 2018.

  1. MatthewGraham

    MatthewGraham BANNED BANNED Jr. VIP

    Joined:
    Oct 6, 2015
    Messages:
    1,762
    Likes Received:
    2,656
    Gender:
    Male
    Step 1: Find a Competitor's Sitemap
    Basically any site large enough to be worth scraping will have an XML sitemap. It should be located here
    • [domain].com/sitemap.xml
    Step 2: Copy the Full Sitemap
    Easiest way to do this is a simple Ctrl+A and Ctrl+C

    Step 3: Extract the URLs
    Go to this URL:
    • https://regex101.com
    Paste (Ctrl+V) your URLs into the "Test String" field. Next, paste this into the "Regular Expression" field.

    Code:
    http[^<"'\n\r]*
    
    If you're wondering, this will search for all strings that start with "http" and return everything from that point until the search runs into a "<", a single or double quote, or a linebreak. This will match all URLs.
    Step 4: Extract the Meta Tags
    This is one of various free tools that will do this:
    • http://tools.buzzstream.com/meta-tag-extractor
    Tool has no cap, no account required. Extracted 5,000 URLs in 1-3 minutes or so. Scroll to the bottom of the page to download as CSV

    For each URL, the CSV will contain the HTML title tag, the meta desctiption, and the meta keywords.

    Step 5: Exploit the Data for Profit
    At this point, you have the data. Use your imagination -- there's a lot that you can do with it -- check search volume, find seed keywords, etc.

    Bonus: Quickly Parse Common Data from Common Conventions/Locations
    Quick tricks to parse data in the scraped fields:
    • Meta Keywords
      • Some sites use the meta keywords field to keep track of keywords, which gives you a ton of comma-separated keywords in the CSV's "Meta Keywords" field to work with right off the bat
      • No parsing required
    • Title Tags
      • Title tags are also a great place to start and frequently contain keywords verbatim.
        • It's very common for title tags be formatted like these:
          • Some Keyword Here | Non-SEO Comment Here
          • Buy Green Shoes | We're the #1 Company!
          • How to Buy a Couch | Top 3 Tricks
      • You can split those up into multiple cells with this spreadsheet formula:
        • =SPLIT(B2,"|")
    • URL
      • Keywords are often included in the URL split by dashes. Those can be extracted with these formulas:
        • =REGEXREPLACE(REGEXEXTRACT(A2, "\.com\/(.*)"), "[^a-zA-Z0-9]", " ")
          • Removes the base URL and replaces all non-alphanumeric characters in the path with spaces:
        • =REGEXREPLACE(REGEXEXTRACT(A2, "\.com\/(.*)"), "[^a-zA-Z]", " ")
          • Removes all non-alphabetical (removes numbers as well)
        • Note: If the domain you scraped is not a .com (a .net or other TLD), change the ".com" in either of those to your TLD.
     
    • Thanks Thanks x 35
  2. godknowseverything

    godknowseverything Regular Member

    Joined:
    Oct 13, 2014
    Messages:
    470
    Likes Received:
    698
    Occupation:
    Coiner
    Location:
    Planet SEO
    Awesome share! Save me a ton of time. Thanks so much
     
  3. tshirtpromo

    tshirtpromo Jr. VIP Jr. VIP

    Joined:
    Aug 8, 2017
    Messages:
    115
    Likes Received:
    29
    Gender:
    Male
    Book marking this. Thanks a lot.
     
  4. Michael Jam

    Michael Jam Regular Member

    Joined:
    Sep 20, 2016
    Messages:
    298
    Likes Received:
    77
    Gender:
    Male
    Great tips . Thanks for sharing bro
     
  5. terrycody

    terrycody Elite Member

    Joined:
    Sep 29, 2012
    Messages:
    2,761
    Likes Received:
    876
    Occupation:
    marketer
    Location:
    Hell
    Hard to understand some parts, but this is damn useful, you always share good stuff.
     
  6. bobojonathan

    bobojonathan Power Member

    Joined:
    Sep 12, 2014
    Messages:
    524
    Likes Received:
    58
    Gender:
    Male
    Occupation:
    Internet Marketer
    Location:
    Everywhere
    Bookmarking this for future use. Thanks buddy.
     
  7. Samantha9

    Samantha9 Junior Member

    Joined:
    Nov 14, 2017
    Messages:
    111
    Likes Received:
    7
    Gender:
    Female
    nice share i will bookmark this thread
     
  8. MatthewGraham

    MatthewGraham BANNED BANNED Jr. VIP

    Joined:
    Oct 6, 2015
    Messages:
    1,762
    Likes Received:
    2,656
    Gender:
    Male
    Thanks!

    The regex / regular expression sections are easier to understand if you mess around with them a little. Some of the syntax seems confusing at first; once you look up the basics for how regexes work, they are fairly straightforward. You can get the basics down in an hour or two.

    If you https://www.google.com/search?q=intro+to+regex, there are introductory guides that will walk you through the basics. When using those tutorials, the site linked to in the original post (regex101.com) is a good resource to experiment with actually running regexes and explains/breaks down what the syntax and characters do as you write them.

    Regexes can save a ton of time. Things that you would pay a VA $3 per hour to spend 50 hours doing can often be done in 5-10 minutes with a regex. Things like:
    • Extracting all URLs from text (like in the original post)
    • Extracting all emails from text
    • Extracting all IP addresses from text
    • Extracting all [anything with a standard format] from text
    • Bulk renaming thousands of files
    • Extracting the root domains from a list of URLs
    • Extracting the path from a list of URLs
    • Converting data from one format to another (ex: "some-text_is_here" to "Some text is here.")
    • Reformatting data in a spreadsheet
    • Finding data in a spreadsheet that is formatted incorrectly
    • Etc.
    Would definitely recommend that anyone working in IM experiments with regexes.
     
    • Thanks Thanks x 5
  9. MatthewGraham

    MatthewGraham BANNED BANNED Jr. VIP

    Joined:
    Oct 6, 2015
    Messages:
    1,762
    Likes Received:
    2,656
    Gender:
    Male
    Bonus Tip
    If the site you want to crawl has no XML sitemap, there are online tools that will crawl a website and make a sitemap, such as this website:
    • https://www.xml-sitemaps.com/
    Just used that sitemap for a site with ~80 pages that didn't have a sitemap. Found ~70% of all pages. 56 pages found; site:domain.com found 79.
     
    • Thanks Thanks x 5
  10. topakins

    topakins Regular Member

    Joined:
    Jan 24, 2014
    Messages:
    321
    Likes Received:
    130
    Gender:
    Male
    Location:
    Somewhere in Africa
    To use Regex, must you have Python installed?
     
  11. MatthewGraham

    MatthewGraham BANNED BANNED Jr. VIP

    Joined:
    Oct 6, 2015
    Messages:
    1,762
    Likes Received:
    2,656
    Gender:
    Male
    Regex is supported in many tools; Python is not required (although Python does support regexes as well). There are many websites that will https://regex101.com. You can also run regexes in Google Sheets (and other spreadsheet applications). These formulas from the original post:
    • =REGEXREPLACE(REGEXEXTRACT(A2, "\.com\/(.*)"), "[^a-zA-Z0-9]", " ")
    • =REGEXREPLACE(REGEXEXTRACT(A2, "\.com\/(.*)"), "[^a-zA-Z]", " ")
    Are both for Google Sheets (should be the same in Excel).

     
    • Thanks Thanks x 1
  12. The Curator

    The Curator Supreme Member

    Joined:
    Dec 27, 2013
    Messages:
    1,480
    Likes Received:
    661
    I like to use the BuzzStream tool for title and meta. I also like to use this site's tool http://www.seoreviewtools.com/html-headings-checker/ to access a page's headers. I will then take all this info and put it in a spread sheet and compare it all for content creation.
     
  13. davids355

    davids355 Moderator Staff Member Moderator Jr. VIP

    Joined:
    Apr 25, 2011
    Messages:
    14,795
    Likes Received:
    13,270
    Home Page:
    Very nice guide, this will come in handy.
     
  14. LGNDFRVR23

    LGNDFRVR23 Jr. VIP Jr. VIP

    Joined:
    Jan 23, 2018
    Messages:
    264
    Likes Received:
    109
    Gender:
    Male
    Thanks!!
     
  15. MatthewGraham

    MatthewGraham BANNED BANNED Jr. VIP

    Joined:
    Oct 6, 2015
    Messages:
    1,762
    Likes Received:
    2,656
    Gender:
    Male
    10/10 thread OP. I still come back to this thread when I need to look up where/how to scrape this shit.
     
    • Thanks Thanks x 4
  16. PureHustle

    PureHustle Power Member

    Joined:
    Feb 12, 2015
    Messages:
    515
    Likes Received:
    407
    Location:
    Location
    this is a phenomenal way to do competitor research. you can easily use excel as well to clean urls and figure out what keywords your competitors are targetting (from a high-level).

    I've been doing this for years.
     
  17. Dribber

    Dribber Registered Member

    Joined:
    Mar 6, 2018
    Messages:
    75
    Likes Received:
    42
    Gender:
    Male
    Awesome share. Thank you...
     
  18. whoami

    whoami Regular Member

    Joined:
    Mar 4, 2010
    Messages:
    232
    Likes Received:
    124
    Gender:
    Male
    Wow, this is an awesome idea and this will be so helpful for me. I appreciate the time it took to share this. Thanks!
     
  19. riseofempire

    riseofempire BANNED BANNED

    Joined:
    May 30, 2017
    Messages:
    276
    Likes Received:
    74
    Or just use SB if you have it hehe
     
  20. juniorfast1

    juniorfast1 Regular Member

    Joined:
    Apr 23, 2013
    Messages:
    332
    Likes Received:
    113
    Gender:
    Male
    Location:
    10.0.0.1
    I did this once for a large site affiliated with amazon, I was able to get +4000 Kw with the word "best". I still keep it ...
     
    • Thanks Thanks x 1