Overview
In this article, we will discuss the Java Regex API and how regular expressions can be used in Java programming language.
In the world of regular expressions, there are many different flavors to choose from, such as grep, Perl, Python, PHP, awk and much more.
This means that a regular expression that works in one programming language may not work in another. The regular expression syntax in the Java is most similar to that found in Perl.
Setup
To use regular expressions in Java, we do not need any special setup. The JDK contains a special package java.util.regex totally dedicated { 专用的 } to regex operations. We only need to import it into our code.
Moreover, the java.lang.String class also has inbuilt regex support that we commonly use in our code.
Java Regex Package
The java.util.regex package consists of three classes: Pattern, Matcher and PatternSyntaxException:
- Pattern object is a compiled regex. The Pattern class provides no public constructors. To create a pattern, we must first invoke one of its public static compile methods, which will then return a Pattern object. These methods accept a regular expression as the first argument.
- Matcher object interprets { 解析 } the pattern and performs match operations against an input String. It also defines no public constructors. We obtain a Matcher object by invoking the matcher method on a Pattern object.
- PatternSyntaxException object is an unchecked exception that indicates a syntax error in a regular expression pattern.
We will explore these classes in detail; however, we must first understand how a regex is constructed in Java.
If you are already familiar with regex from a different environment, you may find certain differences, but they are minimal.
Simple Example
Let’s start with the simplest use case for a regex. As we noted earlier, when a regex is applied to a String, it may match zero or more times.
The most basic form of pattern matching supported by the java.util.regex API is the match of a String literal. For example, if the regular expression is foo and the input String is foo, the match will succeed because the Strings are identical:
|
|
We first create a Pattern object by calling its static compile method and passing it a pattern we want to use.
Then we create a Matcher object be calling the Pattern object’s matcher method and passing it the text we want to check for matches.
After that, we call the method find in the Matcher object.
The find method keeps advancing through the input text and returns true for every match, so we can use it to find the match count as well:
find 方法在输入文本中不断前进,对每一个匹配都返回true,所以我们也可以用它来查找匹配数。
|
|
Since we will be running more tests, we can abstract the logic for finding number of matches in a method called runTest:
|
|
When we get 0 matches, the test should fail, otherwise, it should pass.
Meta Characters
Meta characters affect the way a pattern is matched, in a way adding logic to the search pattern. The Java API supports several metacharacters, the most straightforward being the dot “.” which matches any character:
|
|
Considering the previous example where regex foo matched the text foo as well as foofoo two times. If we used the dot metacharacter in the regex, we would not get two matches in the second case:
|
|
Notice the dot after the foo in the regex. The matcher matches every text that is preceded by foo since the last dot part means any character after. So after finding the first foo, the rest is seen as any character. That is why there is only a single match.
The API supports several other meta characters <([{\^-=$!|]})?*+.>
which we will be looking into further in this article.
Character Classes
Browsing through the official Pattern class specification, we will discover summaries of supported regex constructs. Under character classes, we have about 6 constructs.
OR Class
Constructed as [abc]. Any of the elements in the set is matched:
|
|
If they all appear in the text, each is matched separately with no regard to order:
|
|
They can also be alternated as part of a String. In the following example, when we create different words by alternating the first letter with each element of the set, they are all matched:
|
|
NOR Class
The above set is negated by adding a caret { 脱字号 } as the first element:
|
|
Another case:
|
|
Range Class
We can define a class that specifies a range within which the matched text should fall using a hyphen { 连字符 }(-), likewise { 同样地;也;类似地;还 }, we can also negate a range.
Matching uppercase letters:
|
|
Matching lowercase letters:
|
|
Matching both upper case and lower case letters:
|
|
Matching a given range of numbers:
|
|
Matching another range of numbers:
|
|
Union Class
A union character class is a result of combining two or more character classes:
|
|
The above test will only match 6 out of the 9 integers because the union set skips 4, 5, and 6.
Intersection Class
Similar to the union class, this class results from picking common elements between two or more sets. To apply intersection, we use the &&:
|
|
We get 4 matches because the intersection of the two sets has only 4 elements.
Subtraction Class
We can use subtraction to negate one or more character classes, for example matching a set of odd decimal numbers:
|
|
Only 1,3,5,7,9 will be matched.
Predefined Character Classes
The Java regex API also accepts predefined character classes. Some of the above character classes can be expressed in shorter form though making the code less intuitive { 直观的 } . One special aspect of the Java version of this regex is the escape character.
As we will see, most characters will start with a backslash, which has a special meaning in Java. For these to be compiled by the Pattern class – the leading backslash must be escaped i.e. \d
becomes \\d
.
Matching digits, equivalent to [0-9]
:
|
|
Matching non-digits, equivalent to [^0-9]
:
|
|
Matching white space:
|
|
Matching non-white space:
|
|
Matching a word character, equivalent to [a-zA-Z_0-9]
(Note the underline)
|
|
Matching a non-word character:
|
|
Quantifiers
The Java regex API also allows us to use quantifiers { 限量词 } . These enable us to further tweak { 轻微调整 } the match’s behavior by specifying the number of occurrences to match against.
To match a text zero or one time, we use the ? quantifier:
|
|
Alternatively, we can use the brace syntax, also supported by the Java regex API:
|
|
This example introduces the concept of zero-length matches. It so happens that if a quantifier’s threshold { 起点;开端 } for matching is zero, it always matches everything in the text including an empty String at the end of every input. This means that even if the input is empty, it will return one zero-length match.
This explains why we get 3 matches in the above example despite having a String of length two. The third match is zero-length empty String.
To match a text zero or limitless times, we us *
quantifier, it is just similar to ?:
|
|
Supported alternative:
|
|
The quantifier with a difference is +, it has a matching threshold of 1. If the required String does not occur at all, there will be no match, not even a zero-length String:
|
|
Supported alternative:
|
|
As it is in Perl and other languages { 正如在 Perl 和其他语言中一样 }, the brace syntax can be used to match a given text a number of times:
|
|
In the above example, we get two matches since a match occurs only if a appears three times in a row. However, in the next test we won’t get a match since the text only appears two times in a row:
|
|
When we use a range in the brace, the match will be greedy, matching from the higher end of the range:
|
|
We’ve specified at least two occurrences but not exceeding three, so we get a single match instead where the matcher sees a single aaa and a lone a which can’t be matched.
However, the API allows us to specify a lazy or reluctant { 不情愿的;勉强的 } approach such that the matcher can start from the lower end of the range in which case matching two occurrences as aa and aa:
|
|
Capturing Groups
The API also allows us to treat multiple characters as a single unit through capturing groups.
It will attache numbers to the capturing groups and allow back referencing using these numbers.
In this section, we will see a few examples on how to use capturing groups in Java regex API.
Let’s use a capturing group that matches only when an input text contains two digits next to each other:
|
|
The number attached to the above match is 1, using a back reference to tell the matcher that we want to match another occurrence of the matched portion of the text. This way, instead of:
|
|
Where there are two separate matches for the input, we can have one match but propagating the same regex match to span the entire length of the input using back referencing:
|
|
Where we would have to repeat the regex without back referencing to achieve the same result:
|
|
Similarly, for any other number of repetitions, back referencing can make the matcher see the input as a single match:
|
|
But if you change even the last digit, the match will fail:
|
|
It is important not to forget the escape backslashes, this is crucial in Java syntax.
Boundary Matchers
The Java regex API also supports boundary matching. If we care about where exactly in the input text the match should occur, then this is what we are looking for. With the previous examples, all we cared about was whether a match was found or not.
To match only when the required regex is true at the beginning of the text, we use the caret ^.
This test will fail since the text dog can be found at the beginning:
|
|
The following test will fail:
|
|
To match only when the required regex is true at the end of the text, we use the dollar character $. A match will be found in the following case:
|
|
And no match will be found here:
|
|
If we want a match only when the required text is found at a word boundary, we use \\b
regex at the beginning and end of the regex:
Space is a word boundary:
|
|
The empty string at the beginning of a line is also a word boundary:
|
|
These tests pass because the beginning of a String, as well as space between one text and another, marks a word boundary, however, the following test shows the opposite:
|
|
Two-word characters appearing in a row does not mark a word boundary, but we can make it pass by changing the end of the regex to look for a non-word boundary:
|
|
Pattern Class Methods
Previously, we have only created Pattern objects in a basic way. However, this class has another variant of the compile method that accepts a set of flags alongside the regex argument affecting the way the pattern is matched.
These flags are simply abstracted integer values. Let’s overload the runTest method in the test class so that it can take a flag as the third argument:
|
|
In this section, we will look at the different supported flags and how they are used.
Pattern.CANON_EQ
This flag enables canonical equivalence. When specified, two characters will be considered to match if, and only if, their full canonical decompositions match.
Consider the accented { 重音 } Unicode character é. Its composite code point is u00E9. However, Unicode also has a separate code point for its component characters e, u0065 and the acute accent, u0301. In this case, composite character u00E9 is indistinguishable { 无法分辨的 } from the two character sequence u0065 u0301.
By default, matching does not take canonical equivalence into account:
|
|
But if we add the flag, then the test will pass:
|
|
Pattern.CASE_INSENSITIVE
This flag enables matching regardless of case. By default matching takes case into account:
|
|
So using this flag, we can change the default behavior:
|
|
We can also use the equivalent, embedded flag expression to achieve the same result:
|
|
Pattern.COMMENTS
The Java API allows one to include comments using # in the regex. This can help in documenting complex regex that may not be immediately obvious to another programmer.
The comments flag makes the matcher ignore any white space or comments in the regex and only consider the pattern. In the default matching mode the following test would fail:
|
|
This is because the matcher will look for the entire regex in the input text, including the spaces and the # character. But when we use the flag, it will ignore the extra spaces and the every text starting with # will be seen as a comment to be ignored for each line:
|
|
There is also an alternative embedded flag expression for this:
|
|
Pattern.DOTALL
By default, when we use the dot “.” expression in regex, we are matching every character in the input String until we encounter a new line character.
Using this flag, the match will include the line terminator as well. We will understand better with the following examples. These examples will be a little different. Since we are interested in asserting against the matched String, we will use matcher‘s group method which returns the previous match.
First, we will see the default behavior:
|
|
As we can see, only the first part of the input before the line terminator is matched.
Now in dotall mode, the entire text including the line terminator will be matched:
|
|
We can also use an embedded flag expression to enable dotall mode:
|
|
Pattern.LITERAL
When in this mode, matcher gives no special meaning to any metacharacters, escape characters or regex syntax. Without this flag, the matcher will match the following regex against any input String:
|
|
This is the default behavior we have been seeing in all the examples. However, with this flag, no match will be found, since the matcher will be looking for (. *)
instead of interpreting it:
|
|
Now if we add the required string, the test will pass:
|
|
There is no embedded flag character for enabling literal parsing.
Pattern.MULTILINE
By default ^ and $ metacharacters match absolutely at the beginning and at the end respectively of the entire input String. The matcher disregards any line terminators:
|
|
The match fails because the matcher searches for dog at the end of the entire String but the dog is present at the end of the first line of the string.
However, with the flag, the same test will pass since the matcher now takes into account line terminators. So the String dog is found just before the line terminates, hence success:
|
|
Here is the embedded flag version:
|
|
Matcher Class Methods
In this section, we will look at some useful methods of the Matcher class. We will group them according to functionality for clarity. { 我们将根据功能对它们进行分组。 }
Index Methods
Index methods provide useful index values that show precisely where the match was found in the input String . In the following test, we will confirm the start and end indices of the match for dog in the input String :
|
|
Study Methods
Study methods go through the input String and return a boolean indicating whether or not the pattern is found. Commonly used are matches and lookingAt methods.
The matches and lookingAt methods both attempt to match an input sequence against a pattern. The difference, is that matches requires the entire input sequence to be matched, while lookingAt does not.
Both methods start at the beginning of the input String :
|
|
The matches method will return true in a case like so:
|
|
Replacement Methods
Replacement methods are useful to replace text in an input string. The common ones are replaceFirst and replaceAll.
The replaceFirst and replaceAll methods replace the text that matches a given regular expression. As their names indicate, replaceFirst replaces the first occurrence, and replaceAll replaces all occurrences:
|
|
Replace all occurrences:
|
|
The replaceAll method allows us to substitute { 取代 } all matches with the same replacement. If we want to replace matches on a case by basis { 逐一替换匹配 }, we’d need a token replacement technique.
Conclusion
In this article, we have learned how to use regular expressions in Java and also explored the most important features of the java.util.regex package.
The full source code for the project including all the code samples used here can be found in the GitHub project.