Match URL patterns with RegEx

Matching URLs is a quite common task. One needs to check if the entered URL for a Form matches the semantics for URLs or if a entered URL is valid, for example if your App allows user accounts, etc. Matching URLs is also a great practice for learning Regular Expressions.

Checking for URLs

When there's a form there's a chance it contains a input field for URLs or E-Mails. Checking for URLs is easier than E-Mails so we'll use this example.

var ulrs = [  
    'http://kevingimbel.com',
  'timpietrusky.com',
  'http://www.google.com'
]

In the example above we've defined an Array containing three URLs, all written in a valid but different Pattern. Let's construct a RegEx to match any of them, step by step.

Defining what we need to match

First we should define the different parts we want to match and what they can be made of.

  1. Protocol
  2. Name
  3. Top Level Domain
Protocol

The Protocol can be either HTTP or HTTPS (when we're talking websites) so we need to match these two things followed by an :// that is required. BUT the protocol can also be missing so it is optional!

Name

The Domain Name is the part after the protocol and before the Top Level Domain, for example in the URL http://MyDomainName.com, MyDomainName is the Name we want to match. This Name can contain every letter defined in the Unicode spec (This includes Emojis, btw).

Top Level Domain

The Top Level Domain is probably the easiest to match. It is minimum 2 letter word that has a dot "in front" of it.(http://MyDomainName.com => com)

Basic Pattern

So let's first match the protocol.

var pattern = /(http(s)?)/gi  

You may wonder why we we need the ? at the end instead of using (http|s)? The Problem is that (http|s) would match http and the character s. So we can either write (http|https) or (http(s)?) - the (s)? here means "There could be an s but it is not necessarily there`. Using this method we can match both full URLs from our list, http://kevingimbel.com and http://www.myxotod.com

See the Pen 33bd4e5c25c1873b0e836743eec21967 by Kevin Gimbel (@kevingimbel) on CodePen.

Now comes the hard part: We need to match different types of domains since there are at least 5 schemes of protocols that are valid, simply put:

  1. https
  2. https + www
  3. http
  4. http + www
  5. www

The 6th schema would be no protocol at all. mysite.com is a valid URL at least so our script needs to catch these, too!

Quite some work here! So we have the http/https match set up, let's add the www part. We will create a JavaScript Object to store the different parts of our RegEx to keep a better overview. Later we will create the RegEx using the new RegExp() function. The basic idea works like this

// our pattern object
var patterns = {};  
// 
patterns.protocol = /(http(s)?)/gi;  
patterns.domain = /(\w)+/;  
// ...
var regex = new RegExp(patterns.protocol + patterns.domain, 'gi');  

This is a lot more readable with long regular expressions compared to writing them into one line. See the next Pen's code for the working example.

See the Pen RegEx.wtf - Match URLs (Part 2) by Kevin Gimbel (@kevingimbel) on CodePen.

In this pen you can see the things that are matched (green) and the original URLs (grey). Looking good so far! Let's rebuild this regex step by step.

Domain name

The domain name can basically be everything: Words (a-zA-Z), Numbers (0-9), Underscores (_), dashes (-) and dots (.) - as well as non-latin characters (such as greek URLs like ουτοπία.δπθ.gr - these will not be matched by our script!).

Now that we know what to match we need to express it in regular expression.

// ...
patterns.name = /[a-zA-Z0-9_-\.]/gi  

This regex matches letters (a-zA-Z), Numbers (0-9), Underscores (_), Dashes (-) and Dots (\.). The Dot needs to be escaped because otherwise it would match every character (RegEx with . / RegEx with escaped dot).

Next up is the TLD (Top-Level Domain). Before the rise of the "new TLDs" this part was rather easy since TLDs were everything alphabetic from A to Z with a minimum of 2 characters and a maximum of 4. This rule, however, doesn't fit anymore since the new TLDs are "real" words such as .academy, .shop, .lol, .xyz and so on. A list of new TLDs can be found on gandi.net, a registrant and hosting company.

With all these new rules one stays the same: An TLD is minimum 2 characters, so we need to match for that. In Regular Expression this can be done by adding curly brackets and a number followed by a comma behind the pattern.

// ...
// {2,} means "minimum 2"
patterns.domain = /[a-zA-Z0-9_-\.]{2,}/gi  

Yeah! So now we can match protocols and domains! Let's see this part in action.

See the Pen RegEx.wtf - Match URLs (Part 2) by Kevin Gimbel (@kevingimbel) on CodePen.

Sweet regex we've got here! By now our RegEx Object looks like this.

var patterns = {}  
patterns.protocol: '^(http(s)?(:\/\/))?(www\.)?',  
patterns.domain: '[a-zA-Z0-9-_\.]+'

var url_regex = new RegEx(patterns.protocol + patterns.domain, 'gi');  

You may notice that this pattern also matches the TLD. This is kind of a side effect because we also match sub domains (sub.domain.tld). I'll leave it like that for this post.

GET all the parameters!

Parameters! We all love them! They're essential for Web Applications to work so they sure can be in a URL. A parameter can, classically, be domain.tld?some_parameter=some_value: here the parameter is ?some_parameter=some_value which results in a mapping of some_parameter = some_value when passed to a script. It's common these days that parameters are re-written using .htaccess or another server "feature". The previous URL would more likely be domain.tld/some_parameter/some_value - anyway, we need to match both.

For parameters we add a new value to the Patterns Object.

patterns.params = /([-a-zA-Z0-9:%_\+.~#?&//=]*)/  

Looks complex! And indeed it is. We have all kinds of parameter and URL paths here. Let's break things down:

  1. Dashes (-, e.g. domain.co/a-b-c)
  2. Everything alphabetic and numeric (a-zA-Z0-9, e.g. domain.co/abc1)
  3. Port Delimiters (:, e.g. domain.co:1337)
  4. Percent signs (%, e.g. domain.co/image%20%with%20%spaces.png)
  5. Underscores ( _, e.g. domain.co/abc)
  6. Plus signs (+, e.g. domain.co/some+path)
  7. Dots (., e.g. domain.co/my.script.php)
  8. Tildes (~, e.g. domain.co/~path)
  9. Hash Values (#, e.g. domain.co/#home)
  10. Question marks (?, e.g. domain.co/?param)
  11. Ampersands (&, e.g domain.co/&whatever)
  12. Slases (//, e.g. domain.co/routes/to/nowhere)
  13. Equal signs (=, e.g. domain.co/?param=value)

That is mostly everything a URL path consists of. With this pattern we can match all out paths and parameters as the next Pen shows.

See the Pen RegEx.wtf - Match URLs (Part 3) by Kevin Gimbel (@kevingimbel) on CodePen.

You can find every pen created for this Blog in the regex.wtf collection at CodePen.