Logo
blank Skip to main content

Boolean text search queries and their processing

C++

This article describes the development of the library for performing text search based on Boolean search queries. Boolean search query contains a list of search terms combined with the operators of Boolean logic (AND, OR , NOT). Examples:

โ€œtestโ€
โ€œfoo AND barโ€
โ€œfoo AND (bar OR test)โ€
โ€œfoo AND NOT barโ€

For example, this approach is used in the web search engines. The purpose of our library is to determine if given text matches given search query. Also, information about positions of found terms in input text may be useful.

Before evaluating search query, we must find all occurrences of search terms in input text. There are several algorithms for string matching. The next section of this article is dedicated to choosing an appropriate algorithm for our library.

Choosing string matching algorithm

Simple string matching

First, letโ€™s consider the simplest possible algorithm for string matching. Letโ€™s assume, we have a string S and a sought string (or pattern) M. The simplest string matching algorithm works as follows:

For each possible offset k within a string S (k is a number in range from 0 to len(S) – len(M))compare characters S[k] .. S[k + len(M) – 1] with corresponding characters from the sought string (M[0] .. M[len(M) – 1] ).

This algorithm is illustrated on the following picture.

simple string matching

The running time of simple string matching is O(len(M) * (len(S)len(M) – 1)).

The major problem with described algorithm is that it discards the information that was acquired for previous values of k. For example, if we are searching for the substring M = โ€œaaabโ€and we have found a match at offset k = 0, there certainly cannot be any match at offsets 1, 2, 3 becauseS[4] = โ€˜bโ€™.

There are several effective string matching algorithms that use the information about previous matches. All of them perform some kind of pre-processing of the sought string before the search phase.

Knuth-Morris-Pratt algorithm (KMP algorithm)

Knuth-Morris-Pratt algorithm is one of the most popular effective string matching algorithms. To use the information about previous matches, KMP algorithm uses a prefix function. Prefix function stores the information about how sought pattern matches itself after the shift. This information can be used to skip unnecessary comparisons.

Prefix function f(q) for the pattern M defines the longest proper prefix of string M[0..q-1] which matches its suffix.

Prefix function for string โ€œABABAAโ€:

q

1

2

3

4

5

6

f(q)

0

0

1

2

3

1

The following picture illustrates how prefix function is used to accelerate string matching:

prefix function

As shown on the picture, q = 5 characters from the pattern  were matched at offset s = 4, but the 6-th character wasnโ€™t matched. Knowing that q characters  were matched we can skip offsets which are certainly invalid. For example, offset s + 1 is certainly invalid, because we already know that character at offset s + 1 matches with second pattern character and it wonโ€™t match with the first pattern character. On the other hand, we cannot skip offset s + 2, because first 3 pattern characters match with 3 last checked characters from previous offset.

If q symbols were matched at offset k and (q + 1)-th symbol wasnโ€™t matched, then the next possible valid offset is k + (qf(q)), where f is a prefix function.

Knuth-Moriss-Pratt algorithm can be represented as a finite state machine:

Knuth-Moriss-Pratt algorithm

State 0 is an initial state, state 6 is an acceptance state (which corresponds to the discovery of the sought string). When k-th pattern character is matched, transition to state k + 1 occurs (green arrows).When k-th pattern character isnโ€™t matched, the prefix function determines the next state (red arrows). The function used to determine the next state after the failed match is sometimes called failure function.

Performance:

  • Computing prefix function –  O(len(M)), where M is a sought string.
  • Searching – O(len(S)) where S is an input string.

Aho-Corasick algorithm

Aho-Corasick algorithm is a generalization of Knuth-Morris-Pratt algorithm for searching multiple patterns in input string. The algorithm uses a finite state machine, which is based on the data structure called prefix tree (ot trie).

Prefix tree for a set of strings contains a node for each string that is a prefix for some string from that set. All children of a particular node have a common prefix of the string associated with that node.

Failure function for Aho-Corasick algorithm is defined as the longest suffix that is also a prefix of any string from a prefix tree. Next picture shows the prefix tree for strings โ€œhersโ€, โ€œhisโ€, โ€œsheโ€. Red arrows represent failure transitions.

Aho-Corasick algorithm failure function

Performance:

  • Computing failure function – O(n), where n is the sum of lengths of sought words
  • Searching – O(len(S)) where S is an input string.

Aho-Corasick algorithm effectively solves the problem of the multiple strings searching. It will be used in our library.

Implementation

Searching consists of 3 stages:

  • Building an expression tree from a search query. This tree is used for extracting search terms and for evaluating query result
  • Scanning input text for occurrences of search terms
  • Evaluation of a search query

Expression tree

Each node of the expression tree represents the sub-expression or individual search term.

expression tree nodes

Expression (defined in expressions.hpp) is the base abstract class for all expression tree nodes. It defines several methods and the most important of them are:

evaluate() – this method checks if the expression, which corresponds to this node, is true

childrenCount(), child() โ€“ these methods give access to nodeโ€™s child nodes.

There are several classes which inherit Expression:

SoughtExpression โ€“ represents a node which corresponds to individual search term. All leaf nodes in expression tree are instances of SoughtExpression. SoughtExpression contains a list of positions where corresponding search term was found in the input text. This list is filled during the text scanning stage. SoughtExpression::evaluate() returns true if match list isnโ€™t empty.

sought expression node

AndExpression, OrExpression, NotExpression โ€“ represent the nodes which correspond to the logical operators. Their evaluate() method calls evaluate() for all child nodes and combines them with a corresponding logical function.

NearExpression โ€“ checks if offsets for their sub-expressions is close enough.

Expression Parser

The purpose of Expression Parser class is to build an expression tree from a given query string.  It uses parser automatically generated by bison++ tool from the grammar definition (parser_impl.y). bison++ is an extension to GNU bison tool that can generate parser in C++ class. You can download bison++ for win32  here: http://www.kohsuke.org/flex++bison++/. Generated parser code is located in parser_impl.hpp and parser_impl.cpp files.

Search query language specification:

Supported operators:

a AND b

Search terms a and b are present in input text

a OR b

At least one search term is present in input text

NOT a

Search term a isnโ€™t present in input text

a NEAR /n b

a and b are both present in input text and distance between them is not greater than n

  

Operator priority:

1. NOT
2. AND
3. NEAR
4. OR

Other:

  • Brackets โ€œ()โ€ can be used to define priorities
  • Quotation marks are used to define the expression that should be found exactly as it is
  • Backslash is used to quote special symbols

String matcher implementation

Aho-Corasick pattern matching algorithm implementation consists of 2 classes:

C
typedef std::pair< std::wstring, Expression* > Match;
typedef std::set< Match > MatchList;
typedef stdext::hash_map< wchar_t, ACPMMachineState* > StatesHash;
typedef std::queue< ACPMMachineState* > StatesQueue;
struct ACPMMachineState {
    ACPMMachineState();
    ACPMMachineState( const ACPMMachineState* parent, wchar_t character );
    /* Character, associated with this state*/
    wchar_t character_;
    /* Child states */
    StatesHash transitions_;
    /* For acceptance states - contains patterns, associated with this state */
    MatchList output_;
    /* Parent state */
    const ACPMMachineState* parent_;
    /* Failure transition state */
    ACPMMachineState* failure_;
};
class ACPMMachine {
public:
    ACPMMachine();
    ~ACPMMachine();
    /* Adds a string to the prefix tree */
    void addPattern( const std::wstring& pattern, Expression* associatedObject );
    /* Computes failure transitions for each state */
    void computeStates();
    /* Changes current FSM state */
    bool feedCharacter( wchar_t );
    /* Returns list of patterns associated with current state*/
    const MatchList& getCurrentMatch() const;
    /* Resets current state to root state*/
    void reset();
    
private:
    ACPMMachine( const ACPMMachine& );
    ACPMMachine& operator= ( const ACPMMachine& );
private:
    static ACPMMachineState* FindTransition ( ACPMMachineState* state, wchar_t character );
    static void AddTransition ( ACPMMachineState* from, ACPMMachineState* to );
private:
    ACPMMachineState* currentState_;
    ACPMMachineState* root_;
}; 

Boolean Matcher

BooleanMatcher connects pattern matcher with expression tree. Each time when a match is found, it updates corresponding expression tree node. Also, it acts as a facade to represent the library interface. The following sequence diagram shows how BooleanMatcher interacts with other classes:

Boolean matcher class interaction

Usage

C
#include <cstdio>
#include "matcher.hpp"
int main(int argc, char** argv)
{
    const wchar_t query[] = L"foo AND ( bar OR quux )";
    // Parse search query
    ExpressionParser parser(query, wcslen(query), false); 
    
    // Initialize matcher
    BooleanMatcher matcher(parser, false);
    // Input text
    const wchar_t text[] = L"jdfhgusdhfgjksdghrfjhiodfjgbkldfhfootuxcklnvsdh sduibarrgysiudfhguisdhrt iv";
    
    // For each character of input text call feedCharacter
    const wchar_t* ptr = text;
    while (*ptr++)
    {
        matcher.feedCharacter(*ptr);
    }
    // Check if text matches our search query
    if (matcher.evaluate())
    {
        printf("evaluate() == true\n");
        // Get positions where search terms were found
        MatchType res;
        matcher.match(res);
        printf("Match positions:\n");
        for(int i = 0; i < res.size(); i++)
        {
            printf("offset - %I64d, size - %d\n", res[i]._pos, res[i]._size);
        }
    }
    else
    {
        printf("evaluate() == false\n");
    }
}

References

  • Thomas H. Cormen; Charles E. Leiserson, Ronald L. Rivest, Clifford Stein (2001). Introduction to Algorithms
  • Aho, Alfred V.; Sethi, Ravi; and Ullman, Jeffrey D., Compilers: Principles, Techniques and Tools

Download Sources (ZIP, 90.6KB)

Have a question?

Ask our expert!

Tell us about your project

Send us a request for proposal! Weโ€™ll get back to you with details and estimations.

Book an Exploratory Call

Do not have any specific task for us in mind but our skills seem interesting?

Get a quick Apriorit intro to better understand our team capabilities.

Book time slot

Contact us