Pick the Best RegEx Library to Discover Sensitive Data

When improving your product’s security, processing large quantities of information in log files and internet packets is inevitable. You might need to find a specific data type when analyzing logs or even search for digital evidence if a security incident occurs. One of the key ways to help your team quickly detect desired information within tons of data is by using regular expressions (RegEx).

However, creating efficient regular expressions and choosing the most suitable RegEx library (engine) is not that easy.

In this article, we briefly explore the advantages of working with regular expressions in cybersecurity and take a look at five popular RegEx libraries. We provide examples of RegEx to search for different types of data and use these examples to check how each RegEx engine performs. Apart from finding the fastest RegEx library, you’ll also find comparison tables with test results and our recommendations on which libraries to use for which occasions.

This article will be helpful for technical leaders whose development teams are working on improving product cybersecurity and enhancing the protection of sensitive data.

The role of regular expressions in a product’s cybersecurity

What are regular expressions?

A regular expression, also called RegEx, is a string of text consisting of one or more characters that creates a search pattern. Developers use such patterns to find matches in input texts with the help of RegEx engines.

Most regular expressions consist of different combinations of constants (sets of strings) and operators (symbols that denote operations over these strings), making RegEx a powerful tool for pattern matching.

You can apply RegEx for various purposes like parsing large amounts of text to find specific character patterns or adding extracted strings to a collection to generate a report. Regular expressions also help engineers find all text files in a file manager, search for specific data types, and validate text to ensure it matches a predefined pattern. For example, you can use regular expressions for finding social security numbers.

All scripting languages including Perl, Python, PHP, and JavaScript support RegEx. Apart from scripts, regular expressions can appear in the main code of a program, especially if your project is focused on cybersecurity. You can also use regular expressions when:

Working with Java
Searching text in word processors like Microsoft Word
Working directly from the command line and in text editors to find text within a file

However, RegEx isn’t suitable for working with HTML and XML.

How can you use RegEx for your product’s cybersecurity?

Let’s explore the most common ways your team can use RegEx to improve your product’s cybersecurity posture:

Ways to use regular expressions for cybersecurity purposes

Search for numerical patterns. Searching for numerical strings, such as credit card, social security, or phone numbers, is often tricky. The reason is that this data can be presented in different formats. But RegEx offers a convenient way to find matches. Later in this article, we show detailed examples of using regular expressions to search for credit card numbers and phone numbers.
Analyze log files. RegEx is useful during analysis of log files, which usually contain thousands of recorded events, when there’s no possibility to use some kind of a tool for automated searches. Performing log analysis using regular expressions instead of searching for multiple simple terms can help forensic experts find accurate results in log files when investigating a security incident.
Configure firewall rules. Another way to use RegEx is to specify firewall rules. For instance, you can use RegEx to create rules to block requests for certain file types.
Set proxy rules. Thoroughly written RegEx can help you filter traffic on debugging proxies. Instead of going through numerous requests via your proxy windows, you can isolate requests going to a specific subdomain of your web application.
Scan for malware. Regular expressions can also assist your team with identifying malware. For example, some RegEx-driven tools are able to identify malware by creating descriptions that look for certain characteristics. RegEx patterns help detect specific text or binary patterns in files that might indicate a file is malicious.
Pinpoint relevant evidence. When an incident happens, cybersecurity specialists usually have to deal with large volumes of data in different formats. With regular expressions, you can define rules on what to match in a search operation, specifying metacharacters and quantifiers along with plain text. Thus, you can relatively quickly find digital evidence within tons of data.

Applying RegEx for forensic examination is a proven way to locate data because regular expressions are great at searching for data that matches a specific pattern. During such an examination, investigators use keywords to find exact string (word) matches and use regular expressions to find strings that match a pattern. To do that, they can apply tools that offer RegEx search engines.

Researchers also study different applications of regular expressions for forensic purposes. For example, A Regular Expression Search Primer for Forensic Analysts [PDF] investigates the use of regular expressions and some Linux commands for locating and extracting text. Another research paper, Regex: an experimental approach for searching in cyber forensic, explores ways to reduce the search space by identifying and ﬁltering known ﬁles to speed up evidence identiﬁcation.

Whatever your project’s purpose, you need to choose the most suitable RegEx library to efficiently use regular expressions. In the next section, we take a look at a few popular libraries and compare them using practical examples.

Want to improve your product’s security?

Make sure your sensitive data is protected and all non-trivial technical tasks are solved. Entrust your project to Apriorit’s cybersecurity professionals.

Practical comparison of RegEx libraries

For this article, we chose five popular C++ regular expressions libraries. Before we start our practical comparison, let’s take a look at their main characteristics:

Boost (Boost.Xpressive) is an advanced, object-oriented RegEx template library for C++. It allows you to write regular expressions as strings that are parsed at runtime, or as expression templates that are parsed at compile time.
Regular expressions library is a RegEx engine that provides a class that represents regular expressions, which are a kind of mini-language used to perform pattern matching within strings. Almost all operations with regexes can be characterized by operating on several objects like target sequence, pattern, matched array, and replacement string.
Lightgrep is a RegEx engine designed for digital forensics. This library helps users search for many patterns simultaneously, search binary data as a stream (not as discrete lines of text), and search for patterns in many different encodings.
RE2 is a fast, secure, thread-friendly RegEx C++ library designed as an alternative to backtracking regular expression engines like those used in PCRE, Perl, and Python.
JPCRE2 is a C++ wrapper for the PCRE2 library, which originally is a set of functions written in C that implement regular expression patterns. JPCRE2 provides some C++ wrapper classes and functions to perform RegEx operations such as match and replace.

Library	Operating system	License	Installation
Boost (Boost.Xpressive)	macOS, Linux, Windows	Boost Software License v1.0	By adding file headers
Regular expressions library (STD)	macOS, Linux, Windows	Apache License v2.0	By adding file headers
Lightgrep	macOS, Linux, Windows*	GNU General Public License v3.0	By installing dll/so files
RE2	macOS, Linux, Windows	BSD 3-Clause “New” or “Revised” License	By installing dll/so files
JPCRE2	macOS, Linux, Windows	BSD 3-Clause “New” or “Revised” License	By installing dll/so files

* complicated to build for Windows

Now that we know a bit more about these libraries, we need to install them. Boost.Xpressive and STD are the easiest to install, as you only need to add necessary file headers to your project. To use Lightgrep, RE2, and JPCRE2, you need to install dll/so files in your project. Building these libraries on different operating systems is not challenging, except for building Lightgrep on Windows. According to the Lightgrep documentation on GitHub, Windows builds are easier with a Linux-hosted cross-compiler.

To compare how these libraries perform when searching for information using regular expressions and to find out which is the best C++ RegEx library, let’s:

Create several types of regular expressions
Run tests using the created RegEx

Creating regular expressions

Using RegEx to discover sensitive data is very convenient, as some types of confidential information have predictable search patterns. Common examples of such data types are passwords, addresses, biometric data, and keys (like SSH).

For this article, we’ve decided to try RegEx to search for credit card numbers, phone numbers, and email addresses.

To create our sample datasets, we’ll start with a piece of lorem ipsum text. In this text, we’ll add data of different types in a random order. Here are the four sample datasets we’ll use in our library comparison:

With added credit card numbers
With added email addresses
With added phone numbers
With no added data

To identify sensitive data of different types, you need to pick regular expressions that work with the confidential information you’re searching for. But note that specific search patterns can be complicated, slowing down performance of even the best RegEx library. Therefore, developers should test chosen patterns for their efficiency.

By pattern complexity, we mean how difficult it is to describe the pattern as a block scheme. A complex algorithm usually has many conditions with different types, branches, cycles, and filters.

Let’s move to creating simple and complex regular expressions for the chosen data types.

Related project

Improving a SaaS Cybersecurity Platform with Competitive Features and Quality Maintenance

See how Apriorit helped a worldwide cybersecurity platform provider improve the user experience and enhance platform stability.

Project details

Improving a SaaS Cybersecurity Platform with Competitive Features and Quality Maintenance

1. RegEx to search for credit card numbers

Payment card numbers can be composed of 8 to 19 digits, depending on the country and bank. For the purposes of this article, we’ll be working with 16-digit numbers, as they are the most commonly used, and with 13-digit numbers to add some variety.

Let’s pick a simple regular expression for a credit card number:

4[0-9]{15}

This describes a string pattern starting with the digit 4 and having 15 digits in total that can have values from 0 to 9.

Image 1. A simple RegEx for 16-digit credit card numbers

An example of a string that meets the conditions of this RegEx for credit card numbers is 4123456789012345.

Now, let’s use a complex regular expression to search for both 13-digit and 16-digit credit card numbers:

4[0-9]{12}(?:[0-9]{3})?

This is relevant for strings that begin with the digit 4 and have 12 more digits with possible values from 0 to 9. After this sequence, a string can have or not have three more digits with values from 0 to 9. Thus, we can find not only credit card numbers with 16 digits but those with 13 digits as well.

Image 2. A complex RegEx for 13-digit and 16-digit credit card numbers

Examples of strings that would meet this RegEx are 4123456789012 and 4123456789012321.

2. RegEx to search for email addresses

As a simple expression to describe strings that have an email address format, we’ll use

\S+@\S+\.\S+

This RegEx searches for lines that have:

A sequence of symbols without spaces before the @ symbol
The @ symbol
A sequence of symbols without spaces after the @ symbol
A . symbol
A sequence of symbols without spaces

Image 3. A simple RegEx for email addresses

Examples of strings that would meet such conditions are user@example.com and asda?sd@asda0sd.com.

A complex RegEx for email addresses is much longer:

(?:[a-z0-9!#$%&'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])

This regular expression is also relevant for strings that have an email address format but includes additional bypasses, cycles, and filters. Here’s a description of several constructions used in this RegEx:

(?:) — Makes a grouping that cannot be referenced
[a-z] — Sets possible options for characters
? — Makes the expression optional
| — Sets alternation of two expressions on the left and right side of |
* — Means that an expression matches zero or more of the preceding character

Image 4. A complex RegEx for email addresses

An example of a string that would meet the conditions of this complex regular expression is user@example.com. The other example we mentioned for the simple expression, asda?sd@asda0sd.com, wouldn’t match the complex regular expression, as it has stricter rules.

How to Secure and Manage File Access with SELinux

Explore how monitoring and restricting access to a potentially malicious file can save your product from hacks, data leaks, and breaches.

Learn more

How to Secure and Manage File Access with SELinux

3. RegEx to search for phone numbers

For this data type, let’s use the following simple RegEx:

$\d{3}$ \d{3}-\d{4}

It describes lines with the (###) ### – #### format, where ### stands for three digits, and #### stands for four digits.

Image 5. A simple RegEx for phone numbers

An example of a line that would match this regular expression is (123) 456-7890.

For a complex regular expression for phone numbers, let’s use the following:

$([0-9]{1,4})$([ .-]?)([0-9]{1,4})([ .-]?)([0-9]{1,4})

It describes a line in the (####)%####%#### format, where #### could be a sequence from one to four digits, and the % symbol stands for one of three possible separation symbols: space, dot/period, or hyphen.

Image 6. A complex RegEx for phone numbers

A few examples of lines that meet such conditions are (123) 456-7890, (123) 456 7890, and (1234)-456-70.

4. RegEx to search for words

Apart from exploring the work of regular expressions designed to search for specific data types, we’d also like to check how different alternative RegEx conditions influence the search.

Let’s start with two alternative options for simple word searching:

1. A RegEx to search for matches with a given letter from a to z:

[a-z]+-?[a-z]*

Image 7. A RegEx for words that contain a given letter from a to z

2. A RegEx to search for matches with one of the symbols listed in square brackets and divided by the | symbol that represents an alternative for matching the part to the left and the part to the right of the | symbol. An alternative will include options from a to z.

[a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z]+-?[a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z]*

Image 8. A complex RegEx for words search

Now, let’s move to RegEx examples that work with capture groups — mechanisms that allow you to highlight and save matching text for further use. When a RegEx matches the text, any content within a capture group is saved in temporary variables. You can use those variables later in code.

Let’s create two more regular expressions to check whether the search will slow down if a RegEx includes a link to a capture group (marked as \w):

3. A RegEx that points directly to a capture group:

\w+-?\w*

4. A RegEx that includes a link to capture group №1:

\w+-?1*

A RegEx that points directly to a capture group / with a link to a capture group — Image 10. A RegEx with a link to a capture group

With all regular expressions and samples prepared, let’s finally move to testing and see the RegEx libraries in action.

Related project

Developing a Custom Secrets Management Desktop Application for Secure Password Sharing and Storage

Discover the benefits of creating a custom application for secrets management and data protection. Find out what challenges you’ll need to overcome along the way and how.

Project details

Developing a Custom Secrets Management Desktop Application for Secure Password Sharing and Storage

Testing created RegEx in chosen libraries

To see how fast each of the chosen libraries works, we need to test their speed of processing data samples N times using regular expressions.

We’ve prepared four sample types and regular expressions for them:

Sample with added credit card numbers — simple and complex credit card expressions
Sample with added email addresses — simple and complex email expressions
Sample with added phone numbers — simple and complex phone number expressions
Sample with no sensitive data added — word expressions 1, 2, 3, and 4

Here’s what the workflow for our tests will look like:

We will now run the tests for each of the chosen libraries using this workflow. Below, we show the results in the comparison tables and discuss them.

Test 1. Running samples with sensitive data

Here’s a table with results, showing the speed at which each library processes the samples. We ran 1 million iterations.

Library name	Simple credit card RegEx, seconds	Complex credit card RegEx, seconds	Simple email RegEx, seconds	Complex email RegEx, seconds	Simple phone number RegEx, seconds	Complex phone number RegEx, seconds
Boost (Boost.Xpressive)	64.5	64.8	242.6	1902.5	72.4	76.5
Regular expressions library	403.4	403.4	1707.5	2284.7	406.5	475.6
Lightgrep	10.7	N/A*	241.3	N/A	9.6	14.9
RE2	2.8	2.8	12.3	12.1	3.3	3.6
JPCRE2	1.3	1.6	159.3	160.4	1.9	3.4

*Some results have a Not Applicable (N/A) value because the liblightgrep library doesn’t support non-capturing groups that are used in some RegEx we tested. For more details on that, check out the cheat sheet [PDF] from the liblightgrep library.

It’s hard to determine the exact number of steps required for a single RegEx iteration, as libraries don’t provide APIs for accessing this information. To test RegEx, we used regexlearn.com, and to find the approximate number of steps, we used regex101.com. Both of these utilities are free.

	Simple credit card RegEx	Complex credit card RegEx	Simple email RegEx	Complex email RegEx	Simple phone number RegEx	Complex phone number RegEx
Number of steps for one iteration	56	110	16926	23655	192	432

Now, let’s see the approximate results for each library in steps per second.

Library name	Simple credit card RegEx, steps per second	Complex credit card RegEx, steps per second	Simple email RegEx, steps per second	Complex email RegEx, steps per second	Simple phone number RegEx, steps per second	Complex phone number RegEx, steps per second
Boost (Boost.Xpressive)	0.868217M	1.697531M	69.76917M	12.43364M	2.651934M	5.647059M
Regular expressions library	0.13882M	0.272075M	9.912738M	10.35366M	0.472325M	0.908326M
Lightgrep	5.233645M	N/A	70.14505M	N/A	20M	28.99329M
RE2	20M	39.28571M	1376.098M	1954.959M	58.18182M	120M
JPCRE2	43.07692M	68.75M	106.2524M	147.4751M	101.0526M	127.0588M

Results: On average, Boost (Boost.Xpressive) and Regular expressions library showed the worst results. The RE2 library performed with decent and more or less stable speed for different samples, but JPCRE2 worked a little faster. You can also notice that speed slows down when libraries process samples with simple and complex email regular expressions.

Test 2. Running samples with word expressions

Here’s a table showing the speed of each library at processing samples for word search. We ran 1 million iterations this time as well.

Library name	Word RegEx [a-z], seconds	Word RegEx [a\|b…\|y\|z], seconds	Word RegEx that points to a capture group, seconds	Word RegEx with a link to a capture group, seconds
Boost (Boost.Xpressive)	406.5	402.7	370.2	378.2
Regular expressions library	766.2	774.7	792	792.3
Lightgrep	153.9	157.5	169.7	163
RE2	69	69.9	76.3	79.7
JPCRE2	42	41.7	46.6	47.4

Once again, let’s find the approximate number of steps for handling one iteration using regex101.com:

	Word RegEx [a-z]	Word RegEx [a\|b…\|y\|z]	Word RegEx that points to a capture group	Word RegEx with a link to a capture group
Number of steps for one iteration	2380	2380	2584	2712

And here are the approximate results for each library in steps per second:

Library name	Word RegEx [a-z], steps per second	Word RegEx [a\|b…\|y\|z], steps per second	Word RegEx that points to a capture group, steps per second	Word RegEx with a link to a capture group, steps per second
Boost (Boost.Xpressive)	5.854859M	5.910107M	6.980011M	7.170809M
Regular expressions library	3.106239M	3.072157M	3.262626M	3.422946M
Lightgrep	15.46459M	15.11111M	15.22687M	16.63804M
RE2	34.49275M	34.04864M	33.86632M	34.0276M
JPCRE2	56.66667M	57.07434M	55.45064M	57.21519M

Results: On average, the speed of processing both types of word regular expressions that use capture groups is very similar to the speed of processing the first two word RegEx types. However, you can notice that Boost (Boost.Xpressive) works a little slower with the first two word RegEx types than with other RegEx types. The rest of the libraries are slower when searching for word RegEx with a link to the capture group.

So, which RegEx library should you choose?

Let’s summarize the results of both tests and finalize our recommendations:

If your priority is speed, choose RE2 or JPCRE2, as they showed the fastest results during both tests.
If you don’t want to use third-party libraries and your project is based on Linux, consider Lightgrep.
If you need to quickly start working and speed is not a priority, Boost (Boost.Xpressive) and Regular expressions library are the simplest to connect to your project.

Conclusion

With a wisely chosen library, your team can leverage regular expressions to save lots of time, searching through huge datasets and quickly finding required data.

As we’ve shown in this article, some engines work slower than others, and some engines like Lightgrep can be challenging to install on Windows. Knowing these and many other nuances is essential before creating regular expressions.

At Apriorit, we have experienced cybersecurity specialists and seasoned C++ developers who will gladly assist you with improving your project’s security and efficiency. We’ll take care of all technical details so you have more time to pay attention to your business goals.

Want to build secure and solid software?

Delegate the task to Apriorit’s C++ developers and cybersecurity experts and turn your vision into an efficient and competitive solution.

Regular Expressions for Cybersecurity: What Is the Best RegEx Library to Discover Sensitive Data?