wstribizew
55 supporters
Extracting Text between Two Strings with ...

Extracting Text between Two Strings with Regular Expressions

Feb 06, 2021

An extremely common regex task is extracting a streak of text between two strings, or between two patterns. Before you start conjuring a regular expression for your scenario, consider answering the following questions:

  • What are your expected match boundaries? In other words, what are the left- and right-hand delimiters? When a match should start? When should it end or where should it stop?

  • If you cannot think of a viable right-hand delimiter (i.e. if you do not know where to stop matching), can you come up with a clear description of the match itself, what characters or patterns it can consist of?

  • Can the matches span across multiple lines? If yes, are you passing the right string to the regex engine?

  • Do you need to include the boundaries (delimiters) into the match?

  • Can your matches overlap?

  • Do you need to get all, multiple matches from the input string?


As you can see, the problem is many-faceted, and no wonder so many people keep on asking this question at StackOverflow time and again.

It is usually clear where a match should start: it is either start of the string (then you use ^ at the start of your expression), or a specific word or pattern. The right-hand delimiter is often the end of a line. In these cases, (.*) should suffice: a capturing group defined with the help of two unescaped parentheses (in POSIX BRE, \(.*\) should be used, but let's only cover NFA regex flavor here) and the .* pattern that matches any zero or more characters other than line break characters as many times as possible. So, keyword:\s*(.*) will match "keyword:", zero or more whitespaces and then will capture the rest of the line into Group 1, and all you need is to grab it from the match data object using your programming language means (here is how you do it in JavaScript).

If your right-hand delimiter is unknown, but you know what kind of patterns to expect, building the right pattern is usually a piece of cake. If you want to match a number after "key":" you can use "key":"[-+]?\d*\.?\d+. If you do not know how to describe the pattern, but you know your expected match should stop at the first occurrence of some string or word, you can use .*? lazy dot pattern. \bdog\b(.*?)\bcat\b will match and capture into Group 1 any zero or more characters other than line break characters, as few as possible, between two whole words dog and cat (see a JavaScript-related SO answer).

If your expected matches may span across  multiple lines, do not forget to check how your regex engine treats the . metacharacter. See the "How do I match any character across multiple lines in a regular expression?" SO question. In most NFA regex flavors, use the DOTALL modifier, or add (?s) (or (?m) in Ruby only) inline modifier at the start of the cat.*?dog like patterns, or use common workarounds like [\s\S] / [\d\D] / [\w\W]. Although [\s\S] is used more often, it is only [\w\W] that works (as of the moment of writing) in Visual Studio Code, and [\s\S] does not (but [\s\S\r] does). It is important to check how you are serving text to the regex method: if you expect multiline matches, make sure you are passing a multiline text to the regex engine. For example, if you use for line in file in Python code, you cannot expect any multiline matches.

Capturing groups are a universal mechanism to extract substrings in specific contexts only. Sometimes, frameworks or tools do not allow an easy way to process capturing groups, or it may be easier to process a whole match value rather than meddle with several groups. In these cases, lookarounds (lookaheads and lookbehinds) can be used provided the regex library supports these constructs. They match strings without adding the matched texts to the overall match data object, keep the text out of the match. Also, the regex index remains where it was before trying the lookaround patterns, so you may chain lookarounds to set multiple conditions or contexts to one and the same location. Here is a good example of how lookarounds work and why they are sometimes better is matching whitespaces in between digits. A good lookahead explanation is given in the "Find 'word' not followed by a certain character" question.

If your matches can overlap, you have to use lookarounds. In Python, if you use the PyPi regex module, you may use an overlapped=True argument with regex.findall and regex.finditer, but it is just a wrapper for the same mechanism: (?=(your_pattern_here)). E.g. (?=(\w+abc)) will return a [123abc, 23abc, 3abc] list for a 123abc string. See more examples of extracting overlapping matches in JavaScript, C#, Java, C++, PythonPHP, Ruby. Note regex is not capable of extracting all possible matches that share the same start location, (?=(abc\w+)) will find a single abcedf match in abcedf string and not [abcedf, abced, abce, abc]. There is one exception, though, Raku.

A regular expression is just a text string that is used by a regex library to search for matching strings. If you want to get the first, second or all matches, you need to use specific regex functions or methods that your programming environment provides. Usually, programming languages provide at least two regex matching functions, one for getting the first match, and the other to match all of them. There are Regex.Match and Regex.Matches in C#, re.search and re.finditer in Python, Regexp#scan and Regexp#match in Ruby, preg_match and preg_match_all in PHP, stringr::str_extract and stringr::str_extract_all in R, etc. In JavaScript, in order to get multiple matches, the pattern must be defined with the /g global flag. The String#match method will return all overall matches and drop all capturing group values if capturing groups were defined in the pattern, while String#matchAll will keep the information about all the captures returning an array of match objects.

Happy regexing!

Enjoy this post?

Buy wstribizew a smoothie

More from wstribizew