Readex—Readable Regular Expressions

© 2012, Martin Rinehart

This article and the compiler are dedicated to Stephen Kleene, Ph.D. Professor Kleene was the American mathemetician who invented, among other things, regular expressions.

Readex to Regex Compiler


Go ahead and click the compile button. Your readex has now been compiled into regex, ready to copy/paste into your program.

Example

The example readex in the compiler is a trivially simple email address, but it illustrates the important points. Our simple email address is a name, an "@" sign, another name, a type and an optional country code. Types (top-level domains if country codes are not supplied) are one of ".com" or ".gov" and optional country codes are ".us" or ".ca".

Example Regex

A regex to match our simplified example pattern would be:

/^[^@]+@[^\.]+((\.com)|(\.gov))((\.us)|(\.ca))?$/

The regex is basically unreadable. To be useful in code, regex must be extensively commented.

For those who are not expert regexers, I'll explain a bit. / is start of regex. ^ outside a character class is BOS (beginning of string). The [] enclose a character class. The ^ at the start of a character class means "any character not in this class", so [^@] means any character except the "@" sign. The plus sign means "one or more." So [^@]+ means one or more non-"@" sign characters.

Seven characters explained and 42 to go. Let's turn to the readex.

Example Readex

A readex is a regular expression, but written in a form that is more readable. The same pattern, expressed as a readex, is:


/* whomever@wherever.com (or .gov, optionally .us or .ca)
BOS
1-      ^@
1       @
1-      ^.
1       ((.com)|(.gov))
0-1     ((.us)|(.ca))
EOS
*/

Readex start with a begin-multiline-comment pair, /* and end with an end-multiline-comment pair, */. They optionally include beginning/end-of-string markers BOS and EOS. All of these may be followed by comments.

The hard-working middle readex lines all have this format:


multiplier      pattern     [// optional comment]

Multipliers come in three flavors. A 1 multiplier means exactly one. 1- means one or more. 0-1 means zero or one (an optional item). You can use any numbers you like, not just the zero and ones in the example. Now on to patterns.

A "@" character pattern means an "@" sign. The caret is a unary not, so ^@ means any character except the "@" sign. Our first four lines are Beginning of String; one or more non-"@" characters; one "@" character and one or more non-period characters.

If you guess that the vertical bar means "or" you can read the rest of the readex without any more help from me. Readex are much simpler than regex.

Examples from the Compiler

The readex to regex compiler is a superb example of how regex can help make short work of complex pattern matching. It took just a morning to write the compiler and much of this article. Thank you, Dr. Kleene! Unfortunately, the compiler was also a good example of how totally unreadable traditional regex are. After the compiler was working I wrote the readex for all its regex, compiled the readex and left the readex in the code for documentation. The compiler is now reasonably readable. (After you finish this section, use your browser's View Source option. The code is at the end of this file's <body>.)

Non-Multiplier Readex

The starting comment characters use this simple readex:


/* first line of every JavaScript readex
BOS
0-      ws
1       /*
*/

After the comment is BOS, beginning of string. Without it, this readex would match any string that included "/*". We want "/*" on the left, whitespace excepted. This readex does not match any characters after "/*" which leaves the rest of the line for comment. A what-this-readex-does comment on the top line is a good practice.

Whitespace, ws, is one of the predefined readex character classes. It matches any space, tab, formfeed and so on. 0- ws matches zero or more whitespace characters.

The 1 /* matches one occurence of the character string "/*". (Note to regex writers: those are not metacharacters in readex.) The readex for "BOS", "EOS" and "*/" are the same except for the character string they match.

Multipliers

There are three multiplier forms: "digit(s)", "digit(s)-" and "digit(s)-digit(s)". Here's the readex for the just-a-number multiplier, including optional leading whitespace:


/*
BOS
0-      ws
1-      [0-9]     // the number
1-      ws
*/

The multiplier will be matched by one of these patterns:


// a single number followed by whitespace
1-      [0-9]
1-      ws

// a single number followed by a hyphen, then whitespace
1-      [0-9]
1       "-"
1-      ws

// number hyphen number, then whitespace
1-      [0-9]
1       "-"
1-      [0-9]
1-      ws

[0-9] is a "character class" (explained below) that matches any decimal digit.

Patterns

Readex and regex allow "character classes". They match if the current character matches any character in the class. [0-9] is shorthand for [0123456789]. In readex, the caret is always the unary not. If you want to match any character except decimal digits you would specify ^[0-9]. That compiles to regex [^0-9].

Readex predefined character classes are substituted first. The pattern throws will become regex thro\s ("thro" followed by a whitespace character). To include the characters "w" and "s" together use two readex lines, one for throw and another for s.

Groups

Readex and regex allow you to insert parentheses to create groups. Content matched by the first parenthesized group is available after a successful match in the variable $1. For the second group it's $2 and so on. (In JavaScript these are returned as RegExp.$1, RegExp.$2 and so on.) Readex groups may span multiple lines. Groups may nest. $1 is the group started with the first left parenthesis.

Our readex with groups is this:


/*
BOS
0-      ws
(1-     ^ws)       // $1, the multiplier
1-      ws
(1-     ^ws)       // $2, the pattern
0-      ws
EOS
*/

If you are looking critically, you see that the above readex doesn't allow for an optional open parenthesis preceding the multiplier. (Optional whitespace between the parenthesis and the number would be nice, too.) The compiler uses a simple readex to pick off the open parenthesis:


/*
BOS
0-      ws
1       "("
(0-     ac)     // $1 = the rest of the line
*/

That introduces two new concepts. First, if a metacharacter, such as an open parenthesis, is required to be itself, not a metacharacter, it must be enclosed in double quotes. (In regex it would be "escaped"—preceded by a backslash character. Enclosing in double quotes is more readable save for the odd """ when you need the double quote character as itself.) These characters are readex metacharacters:

The second new item in our example is the new class ac. Readex recognizes ac as another character class meaning "any character".

Readex in Code

JavaScript doesn't support multi-line strings, so we can't write a function to directly operate on readex. However, the readex makes an excellent comment. For example:


        // remove pre-multiplier open paren
        /*
        BOS
        0-      ws
        1       "("
        (0-     ac)     // $1 = the rest of the line
        */
        if ( line.match( /^\s*\((.*)/  ) ) {
            regex_text += "(";
            line = RegExp.$1;
        }

Readex are, compared to regex at least, readable. They are also easier to write. In a language like Python or Ruby I would write my readex as multi-line strings and pass them directly to a function that compiles them. In Python it would look like this:


match_range = readex( '''
        BOS
        0-      ws
        (1-     [0-9])  // $1
        1       -
        (1-     [0-9])  // $2
        1-      ws
        (0-     ac)     // $3 ''')

In the above, match_range is the regex equivalent of the readex that matches "digit(s)-digit(s)". In JavaScript we don't have multi-line strings. Next best is to write the readex as a comment, copy it into the compiler and paste the compiled version back into the code. Extra steps? Yes. But the result is well documented regex. (With readex I've stopped trying to read the regex. I just read the readex.)

Readex and Regex in Detail

Hardcore Perl hackers will see immediately that my regex is a small subset of their regex. I write, and readex supports, a regex subset that is common to languages such as JavaScript, Python and Ruby.

Readex Compared to Regex
Category Readex Regex
Format /*
[BOS]
multiplier1     pattern1  [//comment1]
multiplier2     pattern2  [//comment2]
...
[EOS]
*/
/[^]pattern1multiplier1pattern2multiplier2...[$]/

Some implementations support separate lines for each
patternmultiplier pair, a big step in the right direction.
Some even allow embedded comments.
Multipliers
Specification Meaning
m exactly m
m- m or more
m-n m through n, inclusive
Specification Meaning
defaults to 1
* zero or more
+ one or more
? zero or one
{m} exactly m
{m,} m or more
{m,n} m through n, inclusive
{,n} not more than n
Multiplier
Equivalences
1
0-
1-
0-1
m
m-
m-n
0-n

*
+
?
{m}
{m,}
{m,n}
{,n}
Predefined
Character
Classes
ws
^ws
ac
[0-9]
(regex classes may be used¹)
\s
\S
.
\d
(many more, may be specified in readex¹)
Negated
Character
Classes
^[...] [^...]
Negated
Characters
^x [^x]
Meta
Characters
[
]
(
)
^
-
|
readex only: "
[
]
(
)
^
-
|
regex only: * + ? { } $ / \ .

¹ Just because you can do something doesn't mean it's a good idea. Larding the readex patterns with numerous regex patterns does not improve readability. (Do you know what \w means? I do, but it's cryptic.) See "Unimplemented Feature", below, for a better way.

The first version of the compiler tipped the scales at about 400 lines. I created this table after the compiler. Studying this table gave me several ideas which I used to write a second version. The later version, readex documenting every regex, tips the scales at about 230 lines.

Unimplemented Feature

In my design it says that you can define named character classes at the top of a readex, like this:


/* HTML color spec
hexdigit = [0-9a-fA-F]
1       #
6       hexdigit
*/

That would be nice, no? I'll get after the programmer to add it.

Readex in Other Languages

These days I'm doing JavaScript so readex were born in JavaScript. I do hope that readex get implemented in lots of other languages. If you are good with X, writing readex compiler for X would be great. It's not a big project. My compiler is about 230 lines of JavaScript, many of them readex comments. You can reuse all the readex I wrote for the compiler, so you're way ahead.

But you've got to follow these two rules:

1) The first/last lines of the readex are language-dependent. They should begin/end your language's multiline string. In languages that don't support multiline strings that becomes multiline comments.

2) The rest of the readex matches the specs here, exactly or your creation must not be called "readex". (I will readily relax that rule if you tell me, for example, what needs to be done to fully support Perl regex.)

This article is copyrighted by me. Copyrights do not cover the ideas discussed; they cover the discussion. Ideas may be patented, but if I were in charge there would be no software patents. Readex are not patented. If you implement readex for your language, you owe me nothing. I owe you a big, "Thank you!".

MartinRinehart at gmail dot com

Chronology

Author's Regex Evolution
Period Opinion
Paleolithic Regex are cool! You can do a lot, and do it fast!
Neolithic Cool and fast, true. But totally unreadable. Avoid them if possible.
Early 21st c. Cool, fast, and require detailed documentation. Detailed doc is not fast.
2010 Doc plus regex == easy to get out of sync. There's got to be a better way.
November, 2010 Readex! Always in sync!


Special thanks to reader João Carvalho for debugging help with the compiler.

Feedback: MartinRinehart at gmail dot com

# # #