Featured image of post Guide to Escaping Characters in Java RegExps

Guide to Escaping Characters in Java RegExps

Overview

The regular expressions API in Java, java.util.regex is widely used for pattern matching. To discover more, you can follow this article.

In this article, we will focus on escaping characters withing a regular expression and show how it can be done in Java.

Special RegExp Characters

According to the Java regular expressions API documentation, there is a set of special characters also known as metacharacters present in a regular expression.

When we want to allow the characters as is instead of interpreting them with their special meanings, we need to escape them. By escaping these characters, we force them to be treated as ordinary characters when matching a string with a given regular expression.

当我们想让这些字符保持原样而不是用它们的特殊含义来解释它们时,我们需要转义它们。通过转义这些字符,在用给定的正则表达式匹配字符串时,我们强制将它们作为普通字符处理。

The metacharacters that we usually need to escape in this manner { 以这种方式 } are: <([{\^-=$!|]})?*+.>

Let’s look at a simple code example where we match an input String with a pattern expressed in a regular expression.

This test shows that for a given input string foof when the pattern foo. (foo ending with a dot character) is matched, it returns a value of true which indicates that the match is successful.

1
2
3
4
5
6
7
@Test
public void givenRegexWithDot_whenMatchingStr_thenMatches() {
    String strInput = "foof";
    String strRegex = "foo.";
      
    assertEquals(true, strInput.matches(strRegex));
}

You may wonder why is the match successful when there is no dot (.) character present in the input String?

The answer is simple. The dot (.) is a metacharacter – the special significance of dot here is that there can be ‘any character’ in its place. Therefore, it’s clear how the matcher determined that a match is found.

Let’s say that we do not want to treat the dot (.) character with its unique meaning. Instead, we want it to be interpreted as a dot sign. This means that in the previous example, we do not want to let the pattern foo. to have a match in the input String.

How would we handle a situation like this? The answer is: we need to escape the dot (.) character so that its special meaning gets ignored.

Let’s dig into it in more detail in the next section. { 让我们在下一节中更详细地研究它。 }

Escaping Characters

According to the Java API documentation for regular expressions, there are two ways in which we can escape characters that have special meaning. In other words, to force them to be treated as ordinary characters.

Let’s see what they are:

  1. Precede a metacharacter with a backslash (\)
  2. Enclose a metacharacter with \Q and \E

This just means that in the example we saw earlier, if we want to escape the dot character, we need to put a backslash character before the dot character. Alternatively, we can place the dot character in between \Q and \E.

Escaping Using Backslash

This is one of the techniques that we can use to escape metacharacters in a regular expression. However, we know that the backslash character is an escape character in Java String literals as well. Therefore, we need to double the backslash character when using it to precede any character (including the \ character itself).

Hence in our example, we need to change the regular expression as shown in this test:

1
2
3
4
5
6
7
@Test
public void givenRegexWithDotEsc_whenMatchingStr_thenNotMatching() {
    String strInput = "foof";
    String strRegex = "foo\\.";

    assertEquals(false, strInput.matches(strRegex));
}

Here, the dot character is escaped, so the matcher simply treats it as a dot and tries to find a pattern that ends with the dot (i.e. foo.).

In this case, it returns false since there is no match in the input String for that pattern.

Escaping Using \Q & \E

Alternatively { used for making another suggestion }, we can use \Q and \E to escape the special character. \Q indicates that all characters up to \E needs to be escaped and \E means we need to end the escaping that was started with \Q.

This just means that whatever { 任何;不管什么 } is in between \Q and \E would be escaped.

In the test shown here, the split() of the String class does a match using the regular expression provided to it.

Our requirement is to split the input string by the pipe (|) character into words. Therefore, we use a regular expression pattern to do so.

The pipe character is a metacharacter that needs to be escaped in the regular expression.

Here, the escaping is done by placing the pipe character between \Q and \E:

1
2
3
4
5
6
7
@Test
public void givenRegexWithPipeEscaped_whenSplitStr_thenSplits() {
    String strInput = "foo|bar|hello|world";
    String strRegex = "\\Q|\\E";
    
    assertEquals(4, strInput.split(strRegex).length);
}

The Pattern.quote(String S) Method

The Pattern.Quote(String S) Method in java.util.regex.Pattern class converts a given regular expression pattern String into a literal pattern String. This means that all metacharacters in the input String are treated as ordinary characters.

Using this method would be a more convenient alternative than using \Q & \E as it wraps the given String with them.

Let’s see this method in action:

1
2
3
4
5
6
7
@Test
public void givenRegexWithPipeEscQuoteMeth_whenSplitStr_thenSplits() {
    String strInput = "foo|bar|hello|world";
    String strRegex = "|";

    assertEquals(4,strInput.split(Pattern.quote(strRegex)).length);
}

In this quick test, the Pattern.quote() method is used to escape the given regex pattern and transform it into a String literal. In other words, it escapes all the metacharacters present in the regex pattern for us. It is doing a similar job to \Q & \E.

The pipe character is escaped by the Pattern.quote() method and the split() interprets it as a String literal by which it divides { 划分 } the input.

As we can see, this is a much cleaner approach and also the developers do not have to remember all the escape sequences.

We should note that Pattern.quote encloses the whole block with a single escape sequence. If we wanted to escape characters individually, we would need to use a token replacement algorithm.

Additional Examples

Let’s look at how the replaceAll() method of java.util.regex.Matcher works.

If we need to replace all occurrences of a given character String with another, we can use this method by passing a regular expression to it.

Imagine we have an input with multiple occurrences of the $ character. The result we want to get is the same string with the $ character replaced by £.

This test demonstrates how the pattern $ is passed without being escaped:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
@Test
public void givenRegexWithDollar_whenReplacing_thenNotReplace() {
 
    String strInput = "I gave $50 to my brother."
      + "He bought candy for $35. Now he has $15 left.";
    String strRegex = "$";
    String strReplacement = "£";
    String output = "I gave £50 to my brother."
      + "He bought candy for £35. Now he has £15 left.";
    
    Pattern p = Pattern.compile(strRegex);
    Matcher m = p.matcher(strInput);
        
    assertThat(output, not(equalTo(m.replaceAll(strReplacement))));
}

The test asserts that $ is not correctly replaced by £.

Now if we escape the regex pattern, the replacing happens correctly, and the test passes as shown in this code snippet:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
@Test
public void givenRegexWithDollarEsc_whenReplacing_thenReplace() {
 
    String strInput = "I gave $50 to my brother."
      + "He bought candy for $35. Now he has $15 left.";
    String strRegex = "\\$";
    String strReplacement = "£";
    String output = "I gave £50 to my brother."
      + "He bought candy for £35. Now he has £15 left.";
    Pattern p = Pattern.compile(strRegex);
    Matcher m = p.matcher(strInput);
    
    assertEquals(output,m.replaceAll(strReplacement));
}

Note the \\$ here, which does the trick { 技巧;诡计;诀窍;把戏 } by escaping the $ character and successfully matching the pattern.

Conclusion

In this article, we looked at escaping characters in regular expressions in Java.

We discussed why regular expressions need to be escaped, and the different ways in which it can be achieved. { 我们讨论了为什么正则表达式需要转义,以及实现转义的不同方法。 }

As always { 一如既往 } , the source code related to this article can be found over on GitHub

References

Guide to Escaping Characters in Java RegExps

Licensed under CC BY-NC-SA 4.0
comments powered by Disqus
Built with Hugo
Theme Stack designed by Jimmy