Prerequisite: Find Extended characters in Notepad++

Notepad++ Secrets: Find Regular Expressions

© 2012, Martin Rinehart

One of the features of the great old programming editors (with legendary Unix names like Vi and Emacs) was their ability to use regular expressions (aka regex) in search and replace operations. One of the great features of programmer's editor Notepad++ is that it matches these old veterans' regex strengths without hiding them in a forest of cryptic commands.

If you've never used regex before, we won't teach you too much in this one short page, but we'll get you started with a basic use. Regular expressions are used for matching string patterns. The "Find" part of a Find/Replace dialog is one simple example.

Suppose your HTML included width=120px. You want that upgraded to an XHTML compatible attribute width='120px'. (In HTML, the quotes are seldom needed. in XHTML, they are mandatory.) A simple Find/Replace would do the trick, but you've also got width=60px and lots of other widths. You need regex.

This is the most common regex case. You have some text to find, part of which is fixed and part of which is variable. You need to change the fixed part but preserve the variable part. So here's a tiny regex summary (there are whole books written on the subject!) to get you started.

Regex Pattern Matchers

Punctuation marks sometimes have a special meaning. Whether or not they are special, if you want the mark itself, precede it with a backslash. (If you are not sure, which is often the case, "escape" the punctuation mark—precede it with a backslash.)

The single, lowercase letters that identify special character classes may be reversed in meaning by using the uppercase letter:

Multipliers

Some regex characters provide a repetition factor, called a "multiplier". The most common are:

Alternatives and Grouping

The | operator means "or". Parentheses group things. (a|b) means an "a" or a "b".

Character Classes

Enclosing a list of characters in brackets means, "match exactly one of these characters." Example: [iou]. This is a shorthand for (i|o|u).

The pattern d[iou]g matches "dig" or "dog" or "dug". It does not match "drag" or "dragon".

You may use hyphens to indicate ranges of characters. [A-Z] matches any uppercase letter.

A caret ("^") in the first position of a character class negates the class. It means "match any character EXCEPT one of these. d[^iou]g matches any d.g string ("d", then any character, then "g") except "dig", "dog" or "dug". [^A-Z] matches any character EXCEPT an uppercase letter.

Backreferences

When a regex contains parentheses, the characters matched within the parentheses can be used later. This is called a "backreference". The characters inside the group that starts with the first "(" is called either "$1" (in Perl and languages that copy Perl closely) or "RegExp.$1", in JavaScript or "\1" in the Replace string in an editor such as Notepad++.

Using Regex with Backreferences

Let's return to our problem: we want to find all occurences of "width=" followed by a width specification. We want to replace it with the same specification but the width value must be in single quotes. To keep it simple, assume that your width specs are all on a single line and don't include embedded quotes.

The easy part is matching "width=". The regex for that is just width=. If you are not sure if the equal sign has a special regex meaning (it doesn't), a safer version is width\=.

But what next? You'll have some digits and the letters "px". That would be \d+px (assuming all your widths were in pixels). Time to create a test file and try this out.

Test Data

Test one: flush left
<div width=100px>

Test two: not left, other attributes
other HTML here <tag other=thing1 width=200px that=thing2>

Test three: other width dimension
	<div width=50%>

Regex in Notepad++

Selecting the regex Search Mode in Notepad++.

Fixing the px

This specification will quote the pixel-delimited widths:

Quoting the pixel widths.

Try that yourself. You should get the first two test cases correctly converted to quoted widths. (Yes, this is the hard way to fix two minor problems. But picture applying this to a large HTML file with dozens of widths.)

Fixing All the Widths

What happens if some widths are in pixels, some in em or percent or points (and yes, there are other possibilities, too). We could try to make an exhaustive list: (px|pt|%|...). But that is the hard way. A width specification must be terminated by the tag's closing > or by whitespace preceding another attribute. width\=([^\s>]+) should work. That's "width=", followed by one or more characters EXCEPT those specified in the class. The class specifies any whitespace character or a ">".

We told you at the start that regex was both powerful and cryptic. Here we have just a simple example and you're already seeing both. So let's fix all three test cases at once:

Quoting all the widths.

That's better!

Test one: flush left
<div width='100px'>

Test two: not left, other attributes
other HTML here <tag other=thing1 width='200px' that=thing2>

Test three: other width dimension
	<div width='50%'>

Reiterate

Our search pattern is width.... That finds the string "width" (followed by "...", which is part of this explanation, not part of regex—it means we'll get there next).

We followed with an escaped punctuation mark: width\=.... That simply says: "width" followed by "an equal sign, no special meaning, please."

Then, we used parentheses to group a subexpression: width\=(...). That makes the subexpression available in the Replace string as \1. (If we had more groups, the second one would be \2 and so on. We could even have nested them, if we needed to. If you're not sure what goes in which group, count the open parentheses.)

Now let's look inside that subexpression: (...). We used a character class, followed by a plus sign: ([...]+). A character class matches a single character, one of the characters in the class. The plus sign is a multiplier that says "use one or more of the preceding."

Now we dive into that character class, shown above as [...]. It starts with a caret, [^...], meaning "use any characters EXCEPT those specified in this class." It continues with an escaped "s", \s, which is any whitespace character, and then with a ">" which is itself, a "greater than" sign. So the negated class will match any character EXCEPT whitespace or a ">".

Summary

A lot of work? Yes. But if you pictured this as converting all the widths in a large HTML file, as we suggested above, you can easily see that it is a lot less work than making all these changes one at a time.

Now we want you to picture this regex-based find/replace used in combination with the Find in Files feature (see the menu above if you haven't been there yet). You test on a fragment, then a small file. It works. You backup your files, then click "Find in Files", check the box "In all sub-folders" and bingo!, you've converted an entire website.

Now we ask again, a lot of work? Well, maybe it's a way to do in minutes what would otherwise take hours. Time to grind some of those high-mountain Arabica beans you roasted last night and make that perfect cup of Joe.


Feedback: MartinRinehart at gmail dot com.

# # #