Prerequisite: Find Extended characters in Notepad++
Notepad++ Secrets: Find Regular Expressions
© 2012, Martin Rinehart
One of the features of the great old programming editors (with legendary Unix names like Vi and Emacs) was their ability to use regular expressions (aka regex) in search and replace operations. One of the great features of programmer's editor Notepad++ is that it matches these old veterans' regex strengths without hiding them in a forest of cryptic commands.
If you've never used regex before, we won't teach you too much in this one short page, but we'll get you started with a basic use. Regular expressions are used for matching string patterns. The "Find" part of a Find/Replace dialog is one simple example.
Suppose your HTML included
width=120px. You want that upgraded to an XHTML compatible attribute
width='120px'. (In HTML, the quotes are seldom needed. in XHTML, they are mandatory.) A simple Find/Replace would do the trick, but you've also got
width=60px and lots of other widths. You need regex.
This is the most common regex case. You have some text to find, part of which is fixed and part of which is variable. You need to change the fixed part but preserve the variable part. So here's a tiny regex summary (there are whole books written on the subject!) to get you started.
Regex Pattern Matchers
athe letter "a" matches itself—most characters match just themselves
.the period matches any character at all
\s(two characters, read as "escape s") matches any whitespace (space, tab, return or newline)
\d(any digit, 0 through 9)
\w(w stands for "word") any alphabetic (upper or lowercase), digit (0 through 9) or an underscore
Punctuation marks sometimes have a special meaning. Whether or not they are special, if you want the mark itself, precede it with a backslash. (If you are not sure, which is often the case, "escape" the punctuation mark—precede it with a backslash.)
\+a plus sign
The single, lowercase letters that identify special character classes may be reversed in meaning by using the uppercase letter:
\sany whitespace character
\Sany non-whitespace character
\dany decimal digit
\Dany character EXCEPT a decimal digit
Some regex characters provide a repetition factor, called a "multiplier". The most common are:
+(one or more)
\d+is "one or more digits", a pattern that matches every positive decimal integer
*(zero or more)
.*is "zero or more of any character"
?(zero or one) the preceding character is optional
Alternatives and GroupingThe
|operator means "or". Parentheses group things.
(a|b)means an "a" or a "b".
Enclosing a list of characters in brackets means, "match exactly one of these characters." Example:
[iou]. This is a shorthand for
d[iou]g matches "dig" or "dog" or "dug". It does not match "drag" or "dragon".
You may use hyphens to indicate ranges of characters.
[A-Z] matches any uppercase letter.
A caret ("^") in the first position of a character class negates the class. It means "match any character EXCEPT one of these.
d[^iou]g matches any
d.g string ("d", then any character, then "g") except "dig", "dog" or "dug".
[^A-Z] matches any character EXCEPT an uppercase letter.
Using Regex with Backreferences
Let's return to our problem: we want to find all occurences of "width=" followed by a width specification. We want to replace it with the same specification but the width value must be in single quotes. To keep it simple, assume that your width specs are all on a single line and don't include embedded quotes.
The easy part is matching "width=". The regex for that is just
width=. If you are not sure if the equal sign has a special regex meaning (it doesn't), a safer version is
But what next? You'll have some digits and the letters "px". That would be
\d+px (assuming all your widths were in pixels). Time to create a test file and try this out.
Test one: flush left <div width=100px> Test two: not left, other attributes other HTML here <tag other=thing1 width=200px that=thing2> Test three: other width dimension <div width=50%>
Regex in Notepad++
This specification will quote the pixel-delimited widths:
Try that yourself. You should get the first two test cases correctly converted to quoted widths. (Yes, this is the hard way to fix two minor problems. But picture applying this to a large HTML file with dozens of widths.)
Fixing All the Widths
What happens if some widths are in pixels, some in em or percent or points (and yes, there are other possibilities, too). We could try to make an exhaustive list:
(px|pt|%|...). But that is the hard way. A width specification must be terminated by the tag's closing
> or by whitespace preceding another attribute.
width\=([^\s>]+) should work. That's "width=", followed by one or more characters EXCEPT those specified in the class. The class specifies any whitespace character or a ">".
We told you at the start that regex was both powerful and cryptic. Here we have just a simple example and you're already seeing both. So let's fix all three test cases at once:
Test one: flush left <div width='100px'> Test two: not left, other attributes other HTML here <tag other=thing1 width='200px' that=thing2> Test three: other width dimension <div width='50%'>
Our search pattern is
width.... That finds the string "width" (followed by "...", which is part of this explanation, not part of regex—it means we'll get there next).
We followed with an escaped punctuation mark:
width\=.... That simply says: "width" followed by "an equal sign, no special meaning, please."
Then, we used parentheses to group a subexpression:
width\=(...). That makes the subexpression available in the Replace string as
\1. (If we had more groups, the second one would be
\2 and so on. We could even have nested them, if we needed to. If you're not sure what goes in which group, count the open parentheses.)
Now let's look inside that subexpression:
(...). We used a character class, followed by a plus sign:
([...]+). A character class matches a single character, one of the characters in the class. The plus sign is a multiplier that says "use one or more of the preceding."
Now we dive into that character class, shown above as
[...]. It starts with a caret,
[^...], meaning "use any characters EXCEPT those specified in this class." It continues with an escaped "s",
\s, which is any whitespace character, and then with a ">" which is itself, a "greater than" sign. So the negated class will match any character EXCEPT whitespace or a ">".
A lot of work? Yes. But if you pictured this as converting all the widths in a large HTML file, as we suggested above, you can easily see that it is a lot less work than making all these changes one at a time.
Now we want you to picture this regex-based find/replace used in combination with the Find in Files feature (see the menu above if you haven't been there yet). You test on a fragment, then a small file. It works. You backup your files, then click "Find in Files", check the box "In all sub-folders" and bingo!, you've converted an entire website.
Now we ask again, a lot of work? Well, maybe it's a way to do in minutes what would otherwise take hours. Time to grind some of those high-mountain Arabica beans you roasted last night and make that perfect cup of Joe.
Feedback: MartinRinehart at gmail dot com.
# # #