Regular expressions for mixed content detection
A previous post Moving to HTTPS? Need to crawl-check the site for mixed content? proposes three options to scout your website for mixed content:
- Checking inside the browser using developer tools
- Using apps or browser extensions designed to discover mixed content
- Using an SEO crawler / scraper / extractor
While the third option usually allows to run a specific mixed content crawl, this post is about setting up custom mixed content detection with the help of regular expressions.
Regular expressions definition
A regular expression, regex or regexp (sometimes called a rational expression) is, in theoretical computer science and formal language theory, a sequence of characters that define a search pattern. Usually this pattern is then used by string searching algorithms for "find" or "find and replace" operations on strings, or for input validation.Wikipedia - Regular expressions
Scope of mixed content detection
The regular expressions below are written and tested on 40+ sites to discover the following active and passive mixed content:
- favicons and tiles
- meta-refresh
- canonicals
- inline JS redirects even with non-ahrefs
- flash
- audio, video tags
- RSS links
- Open Graph tags, FB widgets
- iframe src
- backgrounds a border images in inline css
Regular expressions for mixed content:
Detection of mixed content scripts:
scripts - external or internal, which do not follow document.location.protocol
(?i)(?:<script[^<>]*http:[^<>]*>(?:(?!<\/script>)[\s\S])*(?:(?!<\/script>)[\s\S])*<\/script>)|(?i)(?:<script[^<>]*>(?:(?!<\/script>)[\s\S])*http:(?:(?!<\/script>)[\s\S])*<\/script>)
Detection of a href links:
This harvests a href links leading to http://* with linktext, or whatever may be between the starting and ending tag, to allow better discovery in source code. It is intended for internal links on a specific domain, so don't forget to first replace .DOMAIN with domain name without .com, .org, .pizza, or whatever your dot+TLD is. Mind the dot.
(?i)<a [^<>]*href="http:\/\/.{0,8}.DOMAIN[^<>]*>(?:(?!<\/a>)[\s\S])*<\/a>
Detection of other mixed content
elements with values containing http://* inside src, href, data-href, content, url, srcset, location, movie and value, . Ignores XMLNS and DOCTYPE.
(?i)<(?!a )[^<>]*(?:src|href|content|location|url|origin)[="\(]{0,2}http:(?:(?!§)[^<>])*>
Disclaimer: While these have proven useful on many sites, there is no guarantee it will work in every occasion. Possibly, some nested tags may cause trouble. Also, some all types of mixed content referencing may not be covered. Anyway to be 100% sure your site if mixed content free, check with your browser's developer tools.
Important update:: As several imperfections to the above regex has been discovered, we are moving it onto Regex101, to allow improvements and forks.