Regular Expression Matching (REMATCH)
Match and extract parts of a string using regular expressions.
R-egular E-xpression MATCH-ing (the first many times I read the word "rematch", I just could not help my thoughts drifting back to Hulk Hogan taking on André the Giant at WrestleMania IV- those were the days...) is performed using commands on the form:
[[ string =~ regexp ]]
Where regexp is an extended regular expression. If the expression matches the string, the matched part of the string is stored in the BASH_REMATCH array.
Exit Status
The exit status is 0 if the regexp matches, 1 if it doesn't, and 2 if the expression is invalid (e.g. *a, since * means "any number of occurrences of what came before", and in the example there is nothing before the *). This can be used as the condition in an if command:
if [[ string =~ regexp ]]; then # match! else # no match fi
From the source: The GNU bash manual, Conditional Constructs and Bash Variables.
The BASH_REMATCH Array
If the latest [[]]-expression matched the string, the matched part of the string is stored in the BASH_REMATCH array.
If the expression did not match, the exit status was 1 and the array is empty.
The entire matched string ( BASH_REMATCH[0])
The first entry in the BASH_REMATCH array contains the entire matched string:
[[ "abcde" =~ b.d ]] # BASH_REMATCH[0] is now "bcd"
As an example, given a line of code, let us say you want to extract the end-of-line comment if there is one, and output "No comment!" otherwise:
line="command argument1 argument2 # Do something" if [[ "${line}" =~ \#.*$ ]]; then echo "${BASH_REMATCH[0]}" else echo 'No comment!' fi
See Extended Regular Expressions for more information about how the #.*$ syntax works.
Parts of the matched string
Enclosing part of the regular expression in ()-s makes the matched substring available in the subsequent entries in the BASH_REMATCH array:
[[ "abcdef" =~ (b)(.)(d)e ]] # Now: # BASH_REMATCH[0]=bcde # BASH_REMATCH[1]=b # BASH_REMATCH[2]=c # BASH_REMATCH[3]=d
Extended Regular Expressions
- The string to be matched should be appropriately quoted, to prevent various shell expansions from being performed.
- The
# character must be escaped when it occurs
as the first character in a regular expression, to avoid it being
taken for an end-of-line comment marker:
[[ string =~ #ng ]] # Invalid: syntax error near `]]' [[ string =~ \#ng ]] # Valid: but 'string' does not match '#ng'
Matching characters that have special meaning
Some characters have special meaning in regular expressions (at least ^$?*.(){}\, see below). They must be escaped (prefixed with \) to be matched literally:
# Valid, but does not match > [[ 'a[^b]c' =~ a[^b]c ]]; echo $? 1 # Valid and matches > [[ 'a[^b]c' =~ a\[\^b\]c ]]; echo $? 0
Matching the end of a string ( $)
Use $ to match the end of a string:
if [[ "abc/" =~ /$ ]]; then echo 'Ends with a /!' fi
Matching the beginning of a string ( ^)
Use ^ to match the beginning of a string:
if [[ "/abc" =~ ^/ ]]; then echo 'Begins with a /!' fi
Matching the entire string ( ^...$)
Use ^...$ to match the entire string:
if [[ "/home/jcdusse/index.html" =~ ^/home/.*\.html$ ]]; then echo 'Start is /home/ and end is .html!' fi
Use parenthesis to extract useful substrings.
Matching any character but... ( [^...])
Use [^...] to match any character but the listed one(s):
[[ "abc" =~ a[^c]c ]] # matches 'abc' [[ "abc" =~ a[^b]c ]] # no match [[ "abc" =~ a[^cb]c ]] # no match [[ "abc" =~ a[^cd]c ]] # matches 'abc'
This is useful for e.g. extracting the filename and extension from a path:
# Match: anything that is not a '/', then '.', then anything which is not a '.' [[ "/abc/index.html" =~ ([^/]*)\.([^\.]+)$ ]];
The substrings index and html are matched and available as BASH_REMATCH[1] and BASH_REMATCH[2].
Matching any number of occurrences ( c*)
Use * to match any number of occurrences:
# 'b*' matches 'zero or more bs' [[ "ac" =~ ab*c ]] # match [[ "abc" =~ ab*c ]] # match [[ "abbc" =~ ab*c ]] # match [[ "bc" =~ ab*c ]] # no match
Match the top-level directory
Zero or more characters which are not /, that are surrounded by /:
[[ /home/jcdusse/etc =~ ^/([^/]*)/ ]] # BASH_REMATCH[1] is 'home'
Split an absolute path into directory, base filename and extension
Match "Anything slash Anything dot Anything":
# [[ "/home/jcdusse/index.html" =~ ^(.*)/(.*)\.(.*)$ ]]; echo ${BASH_REMATCH[@]} # Now: # BASH_REMATCH[0]=/home/jcdusse/index.html # BASH_REMATCH[1]=/home/jcdusse # BASH_REMATCH[2]=index # BASH_REMATCH[3]=html
Notice that wildcards ( *) are greedy with the leftmost taking precedence. Hence, the first .* matches all the directories until the last /.
Matching one or more occurrences ( c+)
Use + to match one or more occurrences:
# 'b+' matches 'one or more bs' [[ "ac" =~ ab+c ]] # no match [[ "abc" =~ ab+c ]] # match [[ "abbc" =~ ab+c ]] # match [[ "bc" =~ ab+c ]] # no match
Extract non-empty base filename and extension from a path
Extract "Anything that is non-empty and contain no slashes dot anything that is non-empty and contains no dots":
[[ "/abc/index.html" =~ ([^/]+)\.([^\.]+)$ ]];
The substrings index and html are matched and available as BASH_REMATCH[1] and BASH_REMATCH[2], but a string like ~/.emacs would not be matched.
Matching a specific number of occurrences ( c{n, m})
I never needed this in real life, so I'll keep this section brief:
[[ "abc" =~ ab{2,3}c ]]; # no match [[ "abbc" =~ ab{2,3}c ]]; # match [[ "abbbc" =~ ab{2,3}c ]]; # match [[ "abbbbc" =~ ab{2,3}c ]]; # no match
Match an optional character ( c?)
Use ? to match a single optional character:
[[ "ac" =~ ab?c ]]; # match [[ "abc" =~ ab?c ]]; # match [[ "abbc" =~ ab?c ]]; # no match
Extract the top-level directory from any path
Allow for the path to be absolute, relative, or simply /:
# BASH_REMATCH[1] is: [[ "/abc/..." =~ ^/?([^/]*)/ ]] # 'abc' [[ "abc/..." =~ ^/?([^/]*)/ ]] # 'abc' [[ "/index.html" =~ ^/?([^/]*)/ ]] # '/' [[ "index.html" =~ ^/?([^/]*)/ ]] # no match
Negating an expression
Use ! to negate an expression:
# exit status 0 [[ ! "a" =~ b ]];
Note that the following two statements are equivalent, if the expression is simple or contained in parentheses:
[[ ! expression ]]; # same as ! [[ expression ]];
This is also useful for negating a subexpression contained in a composite expression - see AND and OR.
Attention: !expression takes precedence over || and &&, but it is probably better to use possibly redundant parentheses if there is the slightest risk of confusion - a couple of years down the line, somebody might spend an unreasonable amount of time deciphering the original intent.
The BASH_REMATCH array is set as if the negation was not there (only the exit status changes), which I suppose is the least insane thing to do.
Matching alternatives
Use (a|b) to match a or b:
# check that the string ends in ".jpg" or ".png" [[ "clean.png" =~ \.(jpg|png)$ ]]; [[ "clean.jpg" =~ \.(jpg|png)$ ]]; [[ ! "file.png.txt" =~ \.(jpg|png)$ ]];
Examples
The previous sections contained code samples that were tailored to illustrate a point. These are examples of practical problems I have solved using regular expression matching. Note that these examples are not meant to be perfect solutions - making outrageous assumptions and imposing draconic restrictions on the input is a great way to get stuff done quicker.
Match strings which are not empty and do not contain slashes
Ensure that the argument ( dir_name) is not empty and contains neither slashes nor dots:
if ! [[ "${dir_name}" =~ ^[^/]+$ && "${dir_name}" =~ ^[^\.]+$ ]]; then echo 'Invalid argument: '"${dir_name}" fi # dir_name="abc": OK # dir_name="abc/def": KO # dir_name="abc.def": KO
Extract the base filename
Assuming the filename is preceded by at least one slash, and that there is an extension, extract the base filename:
if [[ "abc/def/test.html" =~ .*/(.+)\..* ]]; then echo "${BASH_REMATCH[1]}" fi # outputs "test"
Extract the top directory
Given a path, extract the top directory:
if [[ "${path}" =~ ^/?([^/]+)/? ]]; then echo "${BASH_REMATCH[1]}" fi # the above outputs "abc" for the following inputs: # path="abc/def/test.html" # path="/abc/def/test.html" # path="abc" # path="/abc" # path="abc/" # path="/abc/"
A simpler, but less robust expression would be ([^/]+), which gets all the above cases right, but fails on paths like //abc- which is, of course, only invalid if it is a Unix style file path- in xpath, this is perfectly legal.
Filename must have an .html extension
Check that a filename has an .html extension:
if ! [[ "${filename}" =~ [^\.]+\.html$ ]]; then echo -e "\n\"${filename}\": must have .html extension" >&2 exit 42 fi # filename="abc.html": OK # filename="abc": KO
The expression matches "a non-empty sequence of anything that is not a dot", followed by ".html", followed by the end of the string.
Extract the value of an XML element attribute
Extract the value of an XML attribute:
line='<something att1="value1" />' if [[ "${line}" =~ att1=\"([^\"]*)\" ]]; then echo "${BASH_REMATCH[1]}" fi
This example does not handle attributes using single quotes.