There are two steps to creating a pattern. First, come up
with a pattern that will match exactly what you want. Second, translate that
abstract pattern into a specific syntax. The goal of the EasyPattern language is
to make the second step as easy as possible, to let you create patterns that are
easy to read and easy to write. The first step is more elusive. While some
patterns are extraordinarily simple to create and understand, others can be
quite challenging. Pattern matching is both an art and a skill. It takes
thought. Experience counts. This document attempts to shed a little light onto
the art of pattern matching.
Any interesting bit of text can be described by multiple patterns. For
example, each of the following patterns (and more) correctly describes
"978-692-1256":
[1+ char]
[12 chars]
[1+ not whitespace]
[1+ paragraphChar]
[1+ digit or punctuation]
[12 digit or punctuation]
[1+ digit or '-']
[12 digit or '-']
[3 digits, punctuation, 3 digits, punctuation, 4 digits]
[3 digits, '-', 3 digits, '-', 4 digits]
978[1+ digit or punctuation]
978[punctuation, 3 digits, '-', 4 digits]
978-[3 digits, '-', 4 digits]
978[punctuation]692[punctuation]1256
How do you decide which one to use? There are (initially) two considerations:
The above patterns are loosely arranged from least specific to most specific: the first will match almost anything, the last will match only a few variations on the original. Experienced pattern matchers probably start somewhere in the middle and then move "up" or "down" as they find other cases to match or similar text that should not be matched.
For example, start with [1+ digit or '-']. That pattern would probably suffice to match telephone numbers in plain text. However, it would also match social security numbers (###-##-####) and even single digits. Moving slightly more specific, [12 digit or '-'] would solve both problems. But, it's likely that we want to match not just this telephone number, but all telephone numbers. Temporarily ignoring the possibility of a leading left parentheses, we still know that / and . are likely, yielding [12 digit or '-' or '/' or '.']. Sometimes it's easier to be general rather than specific, e.g. [12 digit or punctuation]. However, that could long dollar amounts or numeric IDs. For North American phone numbers, the sequence forms a clear pattern, leading to [3 digits, punctuation, 3 digits, punctuation, 4 digits].
Remember consideration #1: what other text must be matched? In this case, perhaps worldwide telephone numbers. I'm not familiar with all the variations but here's an attempt: [6+ digit or space or '+' or '/' or '.' or '-']. Now balance with consideration #2: would this pattern find to much? Maybe. It would match 6 spaces. If the document might have this many spaces in a row, perhaps they could first be converted to a tab character. Or, perhaps it's better to make the pattern more specific, e.g. to require at least 2 digits in a row: [2 digits, 4+ digit or space or '+' or '/' or '.' or '-']. What about a set of digits that isn't a telephone number? There's no easy answer here. Perhaps the telephone number is always labelled with "tel" or "telephone", perhaps it always appears on its own line. Different documents may require different patterns -- or even a multi-step process, such as that provided by TextPipe.
Having matched the text, what do you want to do with it? Deleting is easy; just replace with nothing. Inserting text before or after (or both) is equally simple, use $0 in the replacement pattern with the new text either before or after it. What if you want to make changes inside the match? That's the third consideration.
With telephone numbers, one common task is to match multiple formats and
convert to a single format. To do so, each part of the text to be kept must be
grouped and labelled (with a number from 1 to 20). If you have read
the complete EasyPattern docs and looked at the examples, this pattern is by now
familiar:
replace "[(3 digits)1 punctuation (3 digits)2
punctuation (4 digits)3]"
with "$1.$2.$3"