Regex Bootcamp
Introduction
JEMH uses Regex in a very simple way, the following are key aspects that anyone using JEMH must understand.
Regex | Description |
---|---|
. | The dot character matches any single character. |
* | The asterisk matches 0 or more of the proceeding character /expression. |
+ | The plus matches one or more of the proceeding character / expression. |
\x2c | This is a single character match that is the hex code for the literal character. See How to use comma characters in JEMH regexp's. |
[a-z] | This is a range expression matching any character between a and z (lower case). |
[A-Z] | This is a range expression matching any character between A and Z (Upper case). |
[0-9]{10} | This is a match on a specific number of characters that in this case are 0 through 9. |
.* | This is the regex wildcard, meaning 0 or more of any character. Depending on context this could mean all content or all content on a line. |
This matches 0 or more of the matching characters between a and z. | |
This matches any sequence of alpha numeric text regardless of case, but does not match accented characters. | |
This matches characters explicitly showing how a match can be limited to one line (\n is the shortcut expression for a new line). | |
This matches all inputs that match the pattern some@.* with the exception of some@address.com |
Capture Groups
JEMH uses Regex Capture Groups to match on a subset on an overall expression, e.g. getting the value from one line where there is also a key:
Expression | Description |
---|---|
This matches on lines starting with Hello World with one space, a colon and another space, followed by a sequence of 6 numbers. The round brackets indicate the capture group that can be extracted, only when the overall expression matches. |
Common applications of regex in JEMH
There are many ways you can utilise Regex within JEMH. The following are some examples
Catch email addresses
The following are some examples of how to, and how not to use regular expressions for a catch email within JEMH.
Regex | Good/Bad | Description |
---|---|---|
*@domain.com | Asterisk is a regex that matches on any of the proceeding characters, as there are nothing before that asterisk, this is not valid. | |
.*@domain.com | Whilst this is a valid regular expression, it also matches every recipient in domain.com, this will cause problems because JEMH filters all mailbox addresses from email processing. The catch email expression must only match mailbox addresses, the following strategies can be used to resolve this problem:
| |
example@domain.com | This is a non regular expression exact match | |
.*-support@domain.com | Matches any mailbox with a suffix of -support |
Matching replied-to content
See Use Project Mapping Cleanup and Body Delimiters for more details.
Matching replied to content requires an expression that matches the start of the line (it doesn't require a match on the full line, but a too general match will cause unexpected clipping of content).
All delimiter expressions are prefixed by JEMH with the new line (\n) character to limit the number of potential matches to the number of lines in the content.
Example content
Regex | Good/Bad | Description |
---|---|---|
On | This is way too general | |
\nOn | This requires two empty lines preceding the On, it is still to general as an expression. | |
This will match the majority of the lead in to the line. It is using English short names for day and month, also the year and time formats which are specific to the sending mail client. Additional languages require new expressions, different email clients that format dates and times differently require new expressions also. |