When improving your productโs security, processing large quantities of information in log files and internet packets is inevitable. You might need to find a specific data type when analyzing logs or even search for digital evidence if a security incident occurs. One of the key ways to help your team quickly detect desired information within tons of data is by using regular expressions (RegEx).
However, creating efficient regular expressions and choosing the most suitable RegEx library (engine) is not that easy.
In this article, we briefly explore the advantages of working with regular expressions in cybersecurity and take a look at five popular RegEx libraries. We provide examples of RegEx to search for different types of data and use these examples to check how each RegEx engine performs. Apart from finding the fastest RegEx library, youโll also find comparison tables with test results and our recommendations on which libraries to use for which occasions.
This article will be helpful for technical leaders whose development teams are working on improving product cybersecurity and enhancing the protection of sensitive data.
Contents:
The role of regular expressions in a productโs cybersecurity
What are regular expressions?
A regular expression, also called RegEx, is a string of text consisting of one or more characters that creates a search pattern. Developers use such patterns to find matches in input texts with the help of RegEx engines.
Most regular expressions consist of different combinations of constants (sets of strings) and operators (symbols that denote operations over these strings), making RegEx a powerful tool for pattern matching.
You can apply RegEx for various purposes like parsing large amounts of text to find specific character patterns or adding extracted strings to a collection to generate a report. Regular expressions also help engineers find all text files in a file manager, search for specific data types, and validate text to ensure it matches a predefined pattern. For example, you can use regular expressions for finding social security numbers.
All scripting languages including Perl, Python, PHP, and JavaScript support RegEx. Apart from scripts, regular expressions can appear in the main code of a program, especially if your project is focused on cybersecurity. You can also use regular expressions when:
- Working with Java
- Searching text in word processors like Microsoft Word
- Working directly from the command line and in text editors to find text within a file
However, RegEx isnโt suitable for working with HTML and XML.
How can you use RegEx for your productโs cybersecurity?
Letโs explore the most common ways your team can use RegEx to improve your productโs cybersecurity posture:
- Search for numerical patterns. Searching for numerical strings, such as credit card, social security, or phone numbers, is often tricky. The reason is that this data can be presented in different formats. But RegEx offers a convenient way to find matches. Later in this article, we show detailed examples of using regular expressions to search for credit card numbers and phone numbers.
- Analyze log files. RegEx is useful during analysis of log files, which usually contain thousands of recorded events, when thereโs no possibility to use some kind of a tool for automated searches. Performing log analysis using regular expressions instead of searching for multiple simple terms can help forensic experts find accurate results in log files when investigating a security incident.
- Configure firewall rules. Another way to use RegEx is to specify firewall rules. For instance, you can use RegEx to create rules to block requests for certain file types.
- Set proxy rules. Thoroughly written RegEx can help you filter traffic on debugging proxies. Instead of going through numerous requests via your proxy windows, you can isolate requests going to a specific subdomain of your web application.
- Scan for malware. Regular expressions can also assist your team with identifying malware. For example, some RegEx-driven tools are able to identify malware by creating descriptions that look for certain characteristics. RegEx patterns help detect specific text or binary patterns in files that might indicate a file is malicious.
- Pinpoint relevant evidence. When an incident happens, cybersecurity specialists usually have to deal with large volumes of data in different formats. With regular expressions, you can define rules on what to match in a search operation, specifying metacharacters and quantifiers along with plain text. Thus, you can relatively quickly find digital evidence within tons of data.
Applying RegEx for forensic examination is a proven way to locate data because regular expressions are great at searching for data that matches a specific pattern. During such an examination, investigators use keywords to find exact string (word) matches and use regular expressions to find strings that match a pattern. To do that, they can apply tools that offer RegEx search engines.
Researchers also study different applications of regular expressions for forensic purposes. For example, A Regular Expression Search Primer for Forensic Analysts [PDF] investigates the use of regular expressions and some Linux commands for locating and extracting text. Another research paper, Regex: an experimental approach for searching in cyber forensic, explores ways to reduce the search space by identifying and ๏ฌltering known ๏ฌles to speed up evidence identi๏ฌcation.
Whatever your projectโs purpose, you need to choose the most suitable RegEx library to efficiently use regular expressions. In the next section, we take a look at a few popular libraries and compare them using practical examples.
Want to improve your productโs security?
Make sure your sensitive data is protected and all non-trivial technical tasks are solved. Entrust your project to Aprioritโs cybersecurity professionals.
Practical comparison of RegEx libraries
For this article, we chose five popular C++ regular expressions libraries. Before we start our practical comparison, letโs take a look at their main characteristics:
- Boost (Boost.Xpressive) is an advanced, object-oriented RegEx template library for C++. It allows you to write regular expressions as strings that are parsed at runtime, or as expression templates that are parsed at compile time.
- Regular expressions library is a RegEx engine that provides a class that represents regular expressions, which are a kind of mini-language used to perform pattern matching within strings. Almost all operations with regexes can be characterized by operating on several objects like target sequence, pattern, matched array, and replacement string.
- Lightgrep is a RegEx engine designed for digital forensics. This library helps users search for many patterns simultaneously, search binary data as a stream (not as discrete lines of text), and search for patterns in many different encodings.
- RE2 is a fast, secure, thread-friendly RegEx C++ library designed as an alternative to backtracking regular expression engines like those used in PCRE, Perl, and Python.
- JPCRE2 is a C++ wrapper for the PCRE2 library, which originally is a set of functions written in C that implement regular expression patterns. JPCRE2 provides some C++ wrapper classes and functions to perform RegEx operations such as match and replace.
Library | Operating system | License | Installation |
---|---|---|---|
Boost (Boost.Xpressive) | macOS, Linux, Windows | Boost Software License v1.0 | By adding file headers |
Regular expressions library (STD) | macOS, Linux, Windows | Apache License v2.0 | By adding file headers |
Lightgrep | macOS, Linux, Windows* | GNU General Public License v3.0 | By installing dll/so files |
RE2 | macOS, Linux, Windows | BSD 3-Clause โNewโ or โRevisedโ License | By installing dll/so files |
JPCRE2 | macOS, Linux, Windows | BSD 3-Clause โNewโ or โRevisedโ License | By installing dll/so files |
Now that we know a bit more about these libraries, we need to install them. Boost.Xpressive and STD are the easiest to install, as you only need to add necessary file headers to your project. To use Lightgrep, RE2, and JPCRE2, you need to install dll/so files in your project. Building these libraries on different operating systems is not challenging, except for building Lightgrep on Windows. According to the Lightgrep documentation on GitHub, Windows builds are easier with a Linux-hosted cross-compiler.
To compare how these libraries perform when searching for information using regular expressions and to find out which is the best C++ RegEx library, letโs:
- Create several types of regular expressions
- Run tests using the created RegEx
Creating regular expressions
Using RegEx to discover sensitive data is very convenient, as some types of confidential information have predictable search patterns. Common examples of such data types are passwords, addresses, biometric data, and keys (like SSH).
For this article, weโve decided to try RegEx to search for credit card numbers, phone numbers, and email addresses.
To create our sample datasets, we’ll start with a piece of lorem ipsum text. In this text, weโll add data of different types in a random order. Here are the four sample datasets weโll use in our library comparison:
- With added credit card numbers
- With added email addresses
- With added phone numbers
- With no added data
To identify sensitive data of different types, you need to pick regular expressions that work with the confidential information youโre searching for. But note that specific search patterns can be complicated, slowing down performance of even the best RegEx library. Therefore, developers should test chosen patterns for their efficiency.
By pattern complexity, we mean how difficult it is to describe the pattern as a block scheme. A complex algorithm usually has many conditions with different types, branches, cycles, and filters.
Letโs move to creating simple and complex regular expressions for the chosen data types.
Related project
Improving a SaaS Cybersecurity Platform with Competitive Features and Quality Maintenance
See how Apriorit helped a worldwide cybersecurity platform provider improve the user experience and enhance platform stability.
1. RegEx to search for credit card numbers
Payment card numbers can be composed of 8 to 19 digits, depending on the country and bank. For the purposes of this article, weโll be working with 16-digit numbers, as they are the most commonly used, and with 13-digit numbers to add some variety.
Letโs pick a simple regular expression for a credit card number:
4[0-9]{15}
This describes a string pattern starting with the digit 4 and having 15 digits in total that can have values from 0 to 9.
An example of a string that meets the conditions of this RegEx for credit card numbers is 4123456789012345.
Now, letโs use a complex regular expression to search for both 13-digit and 16-digit credit card numbers:
4[0-9]{12}(?:[0-9]{3})?
This is relevant for strings that begin with the digit 4 and have 12 more digits with possible values from 0 to 9. After this sequence, a string can have or not have three more digits with values from 0 to 9. Thus, we can find not only credit card numbers with 16 digits but those with 13 digits as well.
Examples of strings that would meet this RegEx are 4123456789012 and 4123456789012321.
2. RegEx to search for email addresses
As a simple expression to describe strings that have an email address format, weโll use
\S+@\S+\.\S+
This RegEx searches for lines that have:
- A sequence of symbols without spaces before the @ symbol
- The @ symbol
- A sequence of symbols without spaces after the @ symbol
- A . symbol
- A sequence of symbols without spaces
Examples of strings that would meet such conditions are [email protected] and [email protected].
A complex RegEx for email addresses is much longer:
(?:[a-z0-9!#$%&'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
This regular expression is also relevant for strings that have an email address format but includes additional bypasses, cycles, and filters. Hereโs a description of several constructions used in this RegEx:
- (?:) โ Makes a grouping that cannot be referenced
- [a-z] โ Sets possible options for characters
- ? โ Makes the expression optional
- | โ Sets alternation of two expressions on the left and right side of |
- * โ Means that an expression matches zero or more of the preceding character
An example of a string that would meet the conditions of this complex regular expression is [email protected]. The other example we mentioned for the simple expression, [email protected], wouldnโt match the complex regular expression, as it has stricter rules.
Read also
How to Secure and Manage File Access with SELinux
Explore how monitoring and restricting access to a potentially malicious file can save your product from hacks, data leaks, and breaches.
3. RegEx to search for phone numbers
For this data type, letโs use the following simple RegEx:
\(\d{3}\) \d{3}-\d{4}
It describes lines with the (###) ### – #### format, where ### stands for three digits, and #### stands for four digits.
An example of a line that would match this regular expression is (123) 456-7890.
For a complex regular expression for phone numbers, letโs use the following:
\(([0-9]{1,4})\)([ .-]?)([0-9]{1,4})([ .-]?)([0-9]{1,4})
It describes a line in the (####)%####%#### format, where #### could be a sequence from one to four digits, and the % symbol stands for one of three possible separation symbols: space, dot/period, or hyphen.
A few examples of lines that meet such conditions are (123) 456-7890, (123) 456 7890, and (1234)-456-70.
4. RegEx to search for words
Apart from exploring the work of regular expressions designed to search for specific data types, weโd also like to check how different alternative RegEx conditions influence the search.
Letโs start with two alternative options for simple word searching:
1. A RegEx to search for matches with a given letter from a to z:
[a-z]+-?[a-z]*
2. A RegEx to search for matches with one of the symbols listed in square brackets and divided by the | symbol that represents an alternative for matching the part to the left and the part to the right of the | symbol. An alternative will include options from a to z.
[a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z]+-?[a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z]*
Now, letโs move to RegEx examples that work with capture groups โ mechanisms that allow you to highlight and save matching text for further use. When a RegEx matches the text, any content within a capture group is saved in temporary variables. You can use those variables later in code.
Letโs create two more regular expressions to check whether the search will slow down if a RegEx includes a link to a capture group (marked as \w
):
3. A RegEx that points directly to a capture group:
\w+-?\w*
4. A RegEx that includes a link to capture group โ1:
\w+-?1*
With all regular expressions and samples prepared, letโs finally move to testing and see the RegEx libraries in action.
Related project
Developing a Custom Secrets Management Desktop Application for Secure Password Sharing and Storage
Discover the benefits of creating a custom application for secrets management and data protection. Find out what challenges you’ll need to overcome along the way and how.
Testing created RegEx in chosen libraries
To see how fast each of the chosen libraries works, we need to test their speed of processing data samples N times using regular expressions.
Weโve prepared four sample types and regular expressions for them:
- Sample with added credit card numbers โ simple and complex credit card expressions
- Sample with added email addresses โ simple and complex email expressions
- Sample with added phone numbers โ simple and complex phone number expressions
- Sample with no sensitive data added โ word expressions 1, 2, 3, and 4
Hereโs what the workflow for our tests will look like:
We will now run the tests for each of the chosen libraries using this workflow. Below, we show the results in the comparison tables and discuss them.
Test 1. Running samples with sensitive data
Hereโs a table with results, showing the speed at which each library processes the samples. We ran 1 million iterations.
Library name | Simple credit card RegEx, seconds | Complex credit card RegEx, seconds | Simple email RegEx, seconds | Complex email RegEx, seconds | Simple phone number RegEx, seconds | Complex phone number RegEx, seconds |
---|---|---|---|---|---|---|
Boost (Boost.Xpressive) | 64.5 | 64.8 | 242.6 | 1902.5 | 72.4 | 76.5 |
Regular expressions library | 403.4 | 403.4 | 1707.5 | 2284.7 | 406.5 | 475.6 |
Lightgrep | 10.7 | N/A* | 241.3 | N/A | 9.6 | 14.9 |
RE2 | 2.8 | 2.8 | 12.3 | 12.1 | 3.3 | 3.6 |
JPCRE2 | 1.3 | 1.6 | 159.3 | 160.4 | 1.9 | 3.4 |
It’s hard to determine the exact number of steps required for a single RegEx iteration, as libraries don’t provide APIs for accessing this information. To test RegEx, we used regexlearn.com, and to find the approximate number of steps, we used regex101.com. Both of these utilities are free.
Simple credit card RegEx | Complex credit card RegEx | Simple email RegEx | Complex email RegEx | Simple phone number RegEx | Complex phone number RegEx | |
---|---|---|---|---|---|---|
Number of steps for one iteration | 56 | 110 | 16926 | 23655 | 192 | 432 |
Now, letโs see the approximate results for each library in steps per second.
Library name | Simple credit card RegEx, steps per second | Complex credit card RegEx, steps per second | Simple email RegEx, steps per second | Complex email RegEx, steps per second | Simple phone number RegEx, steps per second | Complex phone number RegEx, steps per second |
---|---|---|---|---|---|---|
Boost (Boost.Xpressive) | 0.868217M | 1.697531M | 69.76917M | 12.43364M | 2.651934M | 5.647059M |
Regular expressions library | 0.13882M | 0.272075M | 9.912738M | 10.35366M | 0.472325M | 0.908326M |
Lightgrep | 5.233645M | N/A | 70.14505M | N/A | 20M | 28.99329M |
RE2 | 20M | 39.28571M | 1376.098M | 1954.959M | 58.18182M | 120M |
JPCRE2 | 43.07692M | 68.75M | 106.2524M | 147.4751M | 101.0526M | 127.0588M |
Results: On average, Boost (Boost.Xpressive) and Regular expressions library showed the worst results. The RE2 library performed with decent and more or less stable speed for different samples, but JPCRE2 worked a little faster. You can also notice that speed slows down when libraries process samples with simple and complex email regular expressions.
Test 2. Running samples with word expressions
Hereโs a table showing the speed of each library at processing samples for word search. We ran 1 million iterations this time as well.
Library name | Word RegEx [a-z], seconds | Word RegEx [a|bโฆ|y|z], seconds | Word RegEx that points to a capture group, seconds | Word RegEx with a link to a capture group, seconds |
---|---|---|---|---|
Boost (Boost.Xpressive) | 406.5 | 402.7 | 370.2 | 378.2 |
Regular expressions library | 766.2 | 774.7 | 792 | 792.3 |
Lightgrep | 153.9 | 157.5 | 169.7 | 163 |
RE2 | 69 | 69.9 | 76.3 | 79.7 |
JPCRE2 | 42 | 41.7 | 46.6 | 47.4 |
Once again, letโs find the approximate number of steps for handling one iteration using regex101.com:
Word RegEx [a-z] | Word RegEx [a|bโฆ|y|z] | Word RegEx that points to a capture group | Word RegEx with a link to a capture group | |
---|---|---|---|---|
Number of steps for one iteration | 2380 | 2380 | 2584 | 2712 |
And here are the approximate results for each library in steps per second:
Library name | Word RegEx [a-z], steps per second | Word RegEx [a|bโฆ|y|z], steps per second | Word RegEx that points to a capture group, steps per second | Word RegEx with a link to a capture group, steps per second |
---|---|---|---|---|
Boost (Boost.Xpressive) | 5.854859M | 5.910107M | 6.980011M | 7.170809M |
Regular expressions library | 3.106239M | 3.072157M | 3.262626M | 3.422946M |
Lightgrep | 15.46459M | 15.11111M | 15.22687M | 16.63804M |
RE2 | 34.49275M | 34.04864M | 33.86632M | 34.0276M |
JPCRE2 | 56.66667M | 57.07434M | 55.45064M | 57.21519M |
Results: On average, the speed of processing both types of word regular expressions that use capture groups is very similar to the speed of processing the first two word RegEx types. However, you can notice that Boost (Boost.Xpressive) works a little slower with the first two word RegEx types than with other RegEx types. The rest of the libraries are slower when searching for word RegEx with a link to the capture group.
So, which RegEx library should you choose?
Letโs summarize the results of both tests and finalize our recommendations:
- If your priority is speed, choose RE2 or JPCRE2, as they showed the fastest results during both tests.
- If you donโt want to use third-party libraries and your project is based on Linux, consider Lightgrep.
- If you need to quickly start working and speed is not a priority, Boost (Boost.Xpressive) and Regular expressions library are the simplest to connect to your project.
Conclusion
With a wisely chosen library, your team can leverage regular expressions to save lots of time, searching through huge datasets and quickly finding required data.
As weโve shown in this article, some engines work slower than others, and some engines like Lightgrep can be challenging to install on Windows. Knowing these and many other nuances is essential before creating regular expressions.
At Apriorit, we have experienced cybersecurity specialists and seasoned C++ developers who will gladly assist you with improving your projectโs security and efficiency. Weโll take care of all technical details so you have more time to pay attention to your business goals.
Want to build secure and solid software?
Delegate the task to Aprioritโs C++ developers and cybersecurity experts and turn your vision into an efficient and competitive solution.