Regular Expressions You Should Know

Posted on | November 11, 2011

Regular expressions are a language of their own. When you learn a new programming language, they’re this little sub-language that makes no sense at first glance. Many times you have to read another tutorial, article, or book just to understand the “simple” pattern described. Today, we’ll review eight regular expressions that you should know for your next coding project.

Regular expressions can be scary…really scary. Fortunately, once you memorize what each symbol represents, the fear quickly subsides. If you fit the title of this article, there’s much to learn! Let’s get started.

Section 1: Learning the Basics

The key to learning how to effectively use regular expressions is to just take a day and memorize all of the symbols. This is the best advice I can possibly offer. Sit down, create some flash cards, and just memorize them! Here are the most common:

  • . – Matches any character, except for line breaks if dotall is false.
  • * – Matches 0 or more of the preceding character.
  • + – Matches 1 or more of the preceding character.
  • ? – Preceding character is optional. Matches 0 or 1 occurrence.
  • \d – Matches any single digit
  • \w – Matches any word character (alphanumeric & underscore).
  • [XYZ] – Matches any single character from the character class.
  • [XYZ]+ – Matches one or more of any of the characters in the set.
  • $ – Matches the end of the string.
  • ^ – Matches the beginning of a string.
  • [^a-z] – When inside of a character class, the the ^ means NOT; in this case, match anything that is NOT a lowercase letter.

Yep – it’s not fun, but just memorize them. You’ll be thankful if you do!

Tools

You can be certain that you’ll want to rip your hair out at one point or another when an expression doesn’t work, no matter how much it should – or you think it should! Downloading the RegExr Desktop app is essential, and is really quite fun to fool around with. In addition to real-time checking, it also offers a sidebar which details the definition and usage of every symbol. Download it!.

Section 2: Regular Expressions and JavaScript

 

In this final section, we’ll review a handful of the most important JavaScript methods for working with regular expressions.

1. Test()

This one accepts a single string parameter and returns a boolean indicating whether or not a match has been found. If you don’t necessarily need to perform an operation with the a specific matched result – for instance, when validating a username – “test” will do the job just fine.

Example

var username = 'JohnSmith'; alert(/[A-Za-z_-]+/.test(username)); // returns true

Above, we begin by declaring a regular expression which only allows upper and lower case letters, an underscore, and a dash. We wrap these accepted characters within brackets, which designates acharacter class. The “+” symbol, which proceeds it, signifies that we’re looking for one or more of any of the preceding characters. We then test that pattern against our variable, “JohnSmith.” Because there was a match, the browser will display an alert box with the value, “true.”

2. Split()

You’re most likely already familiar with the split method. It accepts a single regular expression which represents where the “split” should occur. Please note that we can also use a string if we’d prefer.

var str = 'this is my string'; alert(str.split(/\s/)); // alerts "this, is, my, string"

By passing “\s” – representing a single space – we’ve now split our string into an array. If you need to access one particular value, just append the desired index.

var str = 'this is my this string'; alert(str.split(/\s/)[3]); // alerts "string"

3. Replace()

As you might expect, the “replace” method allows you to replace a certain block of text, represented by a string or regular expression, with a different string.

Example

If we wanted to change the string “Hello, World” to “Hello, Universe,” we could do the following:

var someString = 'Hello, World'; someString = someString.replace(/World/, 'Universe'); alert(someString); // alerts "Hello, Universe"

It should be noted that, for this simple example, we could have simply used .replace(’World’, ‘Universe’). Also, using the replace method does not automatically overwrite the value the variable, we must reassign the returned value back to the variable, someString.

Example 2

For another example, let’s imagine that we wish to perform some elementary security precautions when a user signs up for our fictional site. Perhaps we want to take their username and remove any symbols, quotation marks, semi-colons, etc. Performing such a task is trivial with JavaScript and regular expressions.

var username = 'J;ohnSmith;@%'; username = username.replace(/[^A-Za-z\d_-]+/, ''); alert(username); // JohnSmith;@%

Given the produced alert value, one might assume that there was an error in our code (which we’ll review shortly). However, this is not the case. If you’ll notice, the semi-colon immediately after the “J” was removed as expected. To tell the engine to continue searching the string for more matches, we add a “g” directly after our closing forward-slash; this modifier, or flag, stands for “global.” Our revised code should now look like so:

var username = 'J;ohnSmith;@%'; username = username.replace(/[^A-Za-z\d_-]+/g, ''); alert(username); // alerts JohnSmith

Now, the regular expression searches the ENTIRE string and replaces all necessary characters. To review the actual expression – .replace(/[^A-Za-z\d_-]+/g, ”); – it’s important to notice the carot symbol inside of the brackets. When placed within a character class, this means “find anything that IS NOT…” Now, if we re-read, it says, find anything that is NOT a letter, number (represented by \d), an underscore, or a dash; if you find a match, replace it with nothing, or, in effect, delete the character entirely.

4. Match()

Unlike the “test” method, “match()” will return an array containing each match found.

Example

var name = 'JeffreyWay'; alert(name.match(/e/)); // alerts "e"

The code above will alert a single “e.” However, notice that there are actually two e’s in the string “JeffreyWay.” We, once again, must use the “g” modifier to declare a “global search.

var name = 'JeffreyWay'; alert(name.match(/e/g)); // alerts "e,e"

If we then want to alert one of those specific values with the array, we can reference the desired index after the parentheses.

var name = 'JeffreyWay'; alert(name.match(/e/g)[1]); // alerts "e"

Example 2

Let’s review another example to ensure that we understand it correctly.

var string = 'This is just a string with some 12345 and some !@#$ mixed in.'; alert(string.match(/[a-z]+/gi)); // alerts "This,is,just,a,string,with,some,and,some,mixed,in"

Within the regular expression, we created a pattern which matches one or more upper or lowercase letters – thanks to the “i” modifier. We also are appending the “g” to declare a global search. The code above will alert “This,is,just,a,string,with,some,and,some,mixed,in.” If we then wanted to trap one of these values within the array inside of a variable, we just reference the correct index.

var string = 'This is just a string with some 12345 and some !@#$ mixed in.'; var matches = string.match(/[a-z]+/gi); alert(matches[2]); // alerts "just"

Splitting an Email Address

Just for practice, let’s try to split an email address – demo@karthikeyans.co.in – into its respective username and domain name: “demo,” and “karthikeyans.”

var email = 'demo@karthikeyans.co.in'; alert(email.replace(/([a-z\d_-]+)@([a-z\d_-]+)\.[a-z]{2,4}/ig, '$1, $2')); // alerts "demo, karthikeyans"

If you’re brand new to regular expressions, the code above might look a bit daunting. Don’t worry, it did for all of us when we first started. Once you break it down into subsets though, it’s really quite simple. Let’s take it piece by piece.

.replace(/([a-z\d_-]+)

Starting from the middle, we search for any letter, number, underscore, or dash, and match one ore more of them (+). We’d like to access the value of whatever is matched here, so we wrap it within parentheses. That way, we can reference this matched set later!

@([a-z\d_-]+)

Immediately following the preceding match, find the @ symbol, and then another set of one or more letters, numbers, underscore, and dashes. Once again, we wrap that set within parentheses in order to access it later.

\.[a-z]{2,4}/ig,

Continuing on, we find a single period (we must escape it with “\” due to the fact that, in regular expressions, it matches any character (sometimes excluding a line break). The last part is to find the “.com.” We know that the majority, if not all, domains will have a suffix range of two – four characters (com, edu, net, name, etc.). If we’re aware of that specific range, we can forego using a more generic symbol like * or +, and instead wrap the two numbers within curly braces, representing the minimum and maximum, respectively.

 '$1, $2')

This last part represents the second parameter of the replace method, or what we’d like to replace the matched sets with. Here, we’re using $1 and $2 to refer to what was stored within the first and second sets of parentheses, respectively. In this particular instances, $1 refers to “nettuts,” and $2 refers to “tutsplus.”

Creating our Own Location Object

For our final project, we’ll replicate the location object. For those unfamiliar, the location object provides you with information about the current page: the href, host, port, protocol, etc. Please note that this is purely for practice’s sake. In a real world site, just use the preexisting location object!

We first begin by creating our location function, which accepts a single parameter representing the url that we wish to “decode;” we’ll call it “loc.”

function loc(url) { }

Now, we can call it like so, and pass in a gibberish url :

var l = loc('http://www.somesite.com?somekey=somevalue&anotherkey=anothervalue#theHashGoesHere');

Next, we need to return an object which contains a handful of methods.

function loc(url) {       return {        } }

Search

Though we won’t create all of them, we’ll mimic a handful or so. The first one will be “search.” Using regular expressions, we’ll need to search the url and return everything within the querystring.

return {       search : function() {             return url.match(/\?(.+)/i)[1];                // returns "somekey=somevalue&anotherkey=anothervalue#theHashGoesHere"       } }

Above, we take the passed in url, and try to match our regular expressions against it. This expression searches through the string for the question mark, representing the beginning of our querystring. At this point, we need to trap the remaining characters, which is why the (.+) is wrapped within parentheses. Finally, we need to return only that block of characters, so we use [1] to target it.

Hash

Now we’ll create another method which returns the hash of the url, or anything after the pound sign.

hash : function() {       return url.match(/#(.+)/i)[1]; // returns "theHashGoesHere" },

This time, we search for the pound sign, and, once again, trap the following characters within parentheses so that we can refer to only that specific subset – with [1].

Protocol

The protocol method should return, as you would guess, the protocol used by the page – which is generally “http” or “https.”

protocol : function() {       return url.match(/(ht|f)tps?:/i)[0]; // returns 'http:' },

This one is slightly more tricky, only because there are a few choices to compensate for: http, https, and ftp. Though we could do something like – (http|https|ftp) – it would be cleaner to do:(ht|f)tps?
This designates that we should first find either an “ht” or the “f” character. Next, we match the “tp” characters. The final “s” should be optional, so we append a question mark, which signifies that there may be zero or one instance of the preceding character. Much nicer.

Href

For the sake of brevity, this will be our last one. It will simply return the url of the page.

href : function() {       return url.match(/(.+\.[a-z]{2,4})/ig); // returns "http://www.somesite.com" }

Here we’re matching all characters up to the point where we find a period followed by two-four characters (representing com, au, edu, name, etc.). It’s important to realize that we can make these expressions as complicated or as simple as we’d like. It all depends on how strict we must be.

Our Final Simple Function:

function loc(url) {       return {             search : function() {                   return url.match(/\?(.+)/i)[1];             },              hash : function() {                   return url.match(/#(.+)/i)[1];             },              protocol : function() {                   return url.match(/(ht|f)tps?:/)[0];             },              href : function() {                   return url.match(/(.+\.[a-z]{2,4})/ig);             }       } }

With that function created, we can easily alert each subsection by doing:

var l = loc('http://www.karthikeyans.co.in/?key=value#hash');  alert(l.href()); //
http://www.karthikeyans.co.in
 alert(l.protocol()); // http:  ...etc.

Background Info on Regular Expressions

This is what Wikipedia has to say about them:

In computing, regular expressions provide a concise and flexible means for identifying strings of text of interest, such as particular characters, words, or patterns of characters. Regular expressions (abbreviated as regex or regexp, with plural forms regexes, regexps, or regexen) are written in a formal language that can be interpreted by a regular expression processor, a program that either serves as a parser generator or examines text and identifies parts that match the provided specification.

Now, that doesn’t really tell me much about the actual patterns. The regexes I’ll be going over today contains characters such as \w, \s, \1, and many others that represent something totally different from what they look like.

If you’d like to learn a little about regular expressions before you continue reading this article, I’d suggest watching the Regular Expressions for Dummies screencast series.

The eight regular expressions we’ll be going over today will allow you to match a(n): username, password, email, hex value (like #fff or #000), slug, URL, IP address, and an HTML tag. As the list goes down, the regular expressions get more and more confusing. The pictures for each regex in the beginning are easy to follow, but the last four are more easily understood by reading the explanation.

The key thing to remember about regular expressions is that they are almost read forwards and backwards at the same time. This sentence will make more sense when we talk about matching HTML tags.

Note: The delimiters used in the regular expressions are forward slashes, “/”. Each pattern begins and ends with a delimiter. If a forward slash appears in a regex, we must escape it with a backslash: “\/”.

Matching a Username

Pattern:

/^[a-z0-9_-]{3,16}$/

Description:

We begin by telling the parser to find the beginning of the string (^), followed by any lowercase letter (a-z), number (0-9), an underscore, or a hyphen. Next, {3,16} makes sure that are at least 3 of those characters, but no more than 16. Finally, we want the end of the string ($).

String that matches:

my-us3r_n4m3

String that doesn’t match:

th1s1s-wayt00_l0ngt0beausername (too long)

Matching a Password

Pattern:

/^[a-z0-9_-]{6,18}$/

Description:

Matching a password is very similar to matching a username. The only difference is that instead of 3 to 16 letters, numbers, underscores, or hyphens, we want 6 to 18 of them ({6,18}).

String that matches:

myp4ssw0rd

String that doesn’t match:

mypa$$w0rd (contains a dollar sign)

Matching a Hex Value

Pattern:

/^#?([a-f0-9]{6}|[a-f0-9]{3})$/

Description:

We begin by telling the parser to find the beginning of the string (^). Next, a number sign is optional because it is followed a question mark. The question mark tells the parser that the preceding character — in this case a number sign — is optional, but to be “greedy” and capture it if it’s there. Next, inside the first group (first group of parentheses), we can have two different situations. The first is any lowercase letter between a and f or a number six times. The vertical bar tells us that we can also have three lowercase letters between a and f or numbers instead. Finally, we want the end of the string ($).

The reason that I put the six character before is that parser will capture a hex value like #ffffff. If I had reversed it so that the three characters came first, the parser would only pick up #fff and not the other three f’s.

String that matches:

#a3c113

String that doesn’t match:

#4d82h4 (contains the letter h)

Matching a Slug

Pattern:

/^[a-z0-9-]+$/

Description:

You will be using this regex if you ever have to work with mod_rewrite and pretty URL’s. We begin by telling the parser to find the beginning of the string (^), followed by one or more (the plus sign) letters, numbers, or hyphens. Finally, we want the end of the string ($).

String that matches:

my-title-here

String that doesn’t match:

my_title_here (contains underscores)

Matching an Email

Pattern:

/^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/

Description:

We begin by telling the parser to find the beginning of the string (^). Inside the first group, we match one or more lowercase letters, numbers, underscores, dots, or hyphens. I have escaped the dot because a non-escaped dot means any character. Directly after that, there must be an at sign. Next is the domain name which must be: one or more lowercase letters, numbers, underscores, dots, or hyphens. Then another (escaped) dot, with the extension being two to six letters or dots. I have 2 to 6 because of the country specific TLD’s (.ny.us or .co.uk). Finally, we want the end of the string ($).

String that matches:

john@doe.com

String that doesn’t match:

john@doe.something (TLD is too long)

Matching a URL

Pattern:

/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/

Description:

This regex is almost like taking the ending part of the above regex, slapping it between “http://” and some file structure at the end. It sounds a lot simpler than it really is. To start off, we search for the beginning of the line with the caret.

The first capturing group is all option. It allows the URL to begin with “http://”, “https://”, or neither of them. I have a question mark after the s to allow URL’s that have http or https. In order to make this entire group optional, I just added a question mark to the end of it.

Next is the domain name: one or more numbers, letters, dots, or hypens followed by another dot then two to six letters or dots. The following section is the optional files and directories. Inside the group, we want to match any number of forward slashes, letters, numbers, underscores, spaces, dots, or hyphens. Then we say that this group can be matched as many times as we want. Pretty much this allows multiple directories to be matched along with a file at the end. I have used the star instead of the question mark because the star says zero or more, not zero or one. If a question mark was to be used there, only one file/directory would be able to be matched.

Then a trailing slash is matched, but it can be optional. Finally we end with the end of the line.

String that matches:

 

http://www.karthikeyans.co.in/

String that doesn’t match:

http://google.com/some/file!.html (contains an exclamation point)

Matching an IP Address

Pattern:

/^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$/

Description:

Now, I’m not going to lie, I didn’t write this regex; I got it from here. Now, that doesn’t mean that I can’t rip it apart character for character.

The first capture group really isn’t a captured group because

?:

was placed inside which tells the parser to not capture this group (more on this in the last regex). We also want this non-captured group to be repeated three times — the {3} at the end of the group. This group contains another group, a subgroup, and a literal dot. The parser looks for a match in the subgroup then a dot to move on.

The subgroup is also another non-capture group. It’s just a bunch of character sets (things inside brackets): the string “25″ followed by a number between 0 and 5; or the string “2″ and a number between 0 and 4 and any number; or an optional zero or one followed by two numbers, with the second being optional.

After we match three of those, it’s onto the next non-capturing group. This one wants: the string “25″ followed by a number between 0 and 5; or the string “2″ with a number between 0 and 4 and another number at the end; or an optional zero or one followed by two numbers, with the second being optional.

We end this confusing regex with the end of the string.

String that matches:

73.60.124.136 (no, that is not my IP address )

String that doesn’t match:

256.60.124.136 (the first group must be “25″ and a number between zero and five)

Matching an HTML Tag

Pattern:

/^<([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)$/

Description:

One of the more useful regexes on the list. It matches any HTML tag with the content inside. As usually, we begin with the start of the line.

First comes the tag’s name. It must be one or more letters long. This is the first capture group, it comes in handy when we have to grab the closing tag. The next thing are the tag’s attributes. This is any character but a greater than sign (>). Since this is optional, but I want to match more than one character, the star is used. The plus sign makes up the attribute and value, and the star says as many attributes as you want.

Next comes the third non-capture group. Inside, it will contain either a greater than sign, some content, and a closing tag; or some spaces, a forward slash, and a greater than sign. The first option looks for a greater than sign followed by any number of characters, and the closing tag. \1 is used which represents the content that was captured in the first capturing group. In this case it was the tag’s name. Now, if that couldn’t be matched we want to look for a self closing tag (like an img, br, or hr tag). This needs to have one or more spaces followed by “/>”.

The regex is ended with the end of the line.

String that matches:

<a href=”http://images.google.com/”>Nettuts+</a>

String that doesn’t match:

<img src=”img.jpg” alt=”My image>” /> (attributes can’t contain greater than signs)

  • Google Ads


  • Ads by Google