Handling UTF-8 multi-byte characters with a MySQL database

Summary

The standard UTF-8 character set used by MySQL databases (called utf8) does not truly support all UTF-8 characters - it can only use a maximum of 3 bytes per character. This leaves out the remaining 4-byte characters, including all “emojis” (😕 for example).

Both Atlassian and ourselves recommend using PostgreSQL, an SQL database that fully supports all possible UTF-8 characters. According to Atlassian, as of Jira 7.3 they also support MySQL 5.7, which should apparently work with the utf8mb4 character set. However if upgrading Jira and MySQL is not possible, there are some things that can be done using JEMH to alleviate the problems.

Cleaning email subjects using the "MySQL Subject Cleaner" pre-processing task

One of JEMH's great features is its modular pre-processing task system. Particular email processing problems can be overcome by enabling specific tasks to run before the main email processing begins.

The MySQL Subject Cleaner pre-processing task has been added to JEMH. See JEMH-5291 for the versions it was added in. This task filters out 4-byte characters from email subjects, meaning that Jira should not have a problem storing the resulting issue summary. To see what versions this was added in, check the above improvement issue.

To enable the pre-processing task:

Go to the "Auditing" tab in JEMH and expand the "Auditing Enablement" section if it is not already
Click "Enable" to enable JEMH auditing
Go to your JEMH profile and edit the "Email" configuration section
Under the "Pre-processing" section, enable "Use Reprocessed Message"
Select the "MySQL Subject Cleaner" task from the "Pre Processing Tasks" list. If for some reason you need more than one task enabled, control+click to select multiple. Note that processing tasks should only be enabled if you are sure they are needed!
Save changes by clicking "Submit" at the bottom of the page

Cleaning email body content using a Body Cleanup Regular Expression

If your Jira is running on MySQL, unsupported 4-byte characters in the email body could also be a problem. Jira will try to save the content as the description or a comment, and may fail if such characters are present and unsupported. If you suspect this to be the case, you can use the Body Cleanup Regexps setting found under Profile>Email to cut out these characters, allowing successful processing.

In Java, the UTF-16 representation is used for characters. Long story short, this means that multi-byte characters are represented with a pair of "surrogate" values. We can use a regular expression to match these values:

[\x{10000}-\x{10FFFF}]+

It has come to our attention that the previous regular expression suggested ([\uD800-\uDBFF\uDC00-\uDFFF]+) was not correct for Java's regular expression implementation. This incorrectly matched hyphens "-". Please use the above regular expression instead.

See the following blog post for more information: Using Body Cleanup Regexps to remove 4 byte characters?

Entering this as a regular expression in the above mentioned "Body Cleanup Regexps" setting will remove these pairs, leaving what should be valid characters for your database to store.