This tutorial explains how to build a regex Pattern Matching Engine (PME) where ciphertext is the content that is evaluated.
A regex PME is an essential tool for programmers. It allows you to perform complex searches on content. A less powerful simple search on string can only find matches of the exact given sequence of characters (e.g., your browser's default search function). Regex PMEs are more powerful, allowing searches on certain structures of text, where a structure may take any form in multiple possible sequences of characters. The structure to be searched is defined with the regex, a very concise language.
Here are some example regexes to give you an idea of what is possible:
Regexes are powerful enough to be able to express structures like email address formats. This capability is what makes regexes useful for many programming solutions.
There are two main components identifiable in a PME:
- The pattern that is to be matched has to be parsed, translated from a
textual representation into a recursively structured object (an Abstract Syntax Tree, or AST).
- This AST must then be applied to the text that it is to be matched against,
resulting in a [.c-inline-code]yes[.c-inline-code] or[.c-inline-code]no[.c-inline-code] to whether the pattern has matched (in the case of our FHE implementation, this result is an encrypted [.c-inline-code]yes[.c-inline-code] or an encrypted [.c-inline-code]no[.c-inline-code]).
Parsing is a well understood problem. There are a couple of different approaches possible here. Regardless of the approach chosen, it starts with figuring out what language we want to support. That is, what are the kinds of sentences we want our regex language to include? A few example sentences we definitely want to support are, for example: [.c-inline-code]/a/[.c-inline-code], [.c-inline-code]/a?bc/[.c-inline-code], [.c-inline-code]/^ab$/[.c-inline-code], [.c-inline-code]/ab|cd/[.c-inline-code], however example sentences don't suffice as a specification because they can never be exhaustive (they're endless). We need something to specify exactly the full set of sentences our language supports.
There exists a language that can help us describe our own language's structure exactly: Grammar.
The Grammar and data structure
It is useful to start with defining the Grammar before starting to write code for the parser because the code structure follows directly from the Grammar. A Grammar consists of a generally small set of rules. For example, a very basic Grammar could look like this:
This describes a language that only contains the sentence [.c-inline-code]a[.c-inline-code]. Not a very interesting language.
We can make it more interesting though by introducing choice into the Grammar with [.c-inline-code]|[.c-inline-code] (called a 'pipe') operators. If we want the above Grammar to accept either [.c-inline-code]a[.c-inline-code] or [.c-inline-code]b[.c-inline-code]:
So far, only Grammars with a single rule have been shown. However, a Grammar can consist of multiple rules. Most languages require it. So let's consider a more meaningful language, one that accepts sentences consisting of one or more digits. We could describe such a language with the following Grammar:
The [.c-inline-code]+[.c-inline-code] after Digit is another Grammar operator. With it, we specify that Digit must be matched one or more times. Here are all the Grammar operators that are relevant for this tutorial:
In the case of the example PME, the Grammar is as follows (notice the unquoted [.c-inline-code]?[.c-inline-code] and quoted [.c-inline-code]?[.c-inline-code], etc. The unquoted characters are Grammar operators, and the quoted are characters we are matching in the parsing).
We will refer occasionally to specific parts in the Grammar listed above by [.c-inline-code]<rule name>.<variant index>[.c-inline-code] (where the first rule variant has index 1).
With the Grammar defined, we can start defining a type to parse into. In Rust, we have the enum kind of type that is perfect for this, as it allows you to define multiple variants that may recurse. I prefer to start by defining variants that do not recurse (i.e., that don't contain nested regex expressions):
With this, we can translate the following basic regexes:
Notice we're not yet able to sequence multiple components together. Let's define the first variant that captures recursive RegExpr for this:
With this Seq (short for sequence) variant, we allow translating patterns that contain multiple components:
Let's finish the RegExpr data structure by adding variants for [.c-inline-code]Optional[.c-inline-code] matching, [.c-inline-code]Not[.c-inline-code] logic in a range, and [.c-inline-code]Either[.c-inline-code] left or right matching:
Some features may make the most sense being implemented during post-processing of the parsed data structure. For example, the case insensitivity feature (the i Modifier) is implemented in the example implementation by taking the parsed RegExpr and mutating every character mentioned inside to cover both the lower case as well as the upper case variant (see function case_insensitive in parser.rs for the example implementation).
The modifier i in our Grammar (for enabling case insensitivity) was easiest to implement by applying a post-processing step to the parser.
We are now able to translate any complex regex into a RegExpr value. For example:
With both the Grammar and the data structure to parse into defined, we can now start implementing the actual parsing logic. There are multiple ways this can be done. For example, there exist tools that can automatically generate parser code by giving it the Grammar definition (these are called parser generators). However, you might prefer to write parsers with a parser combinator library. This may be the better option for you because the behavior in runtime is easier to understand for parsers constructed with a parser combinator library than of parsers that were generated with a parser generator tool.
Rust offers a number of popular parser combinator libraries. This tutorial used combine, but any other library would work just as well. Choose whichever appeals the most to you (including any parser generator tool). The implementation of our regex parser will differ significantly depending on the approach you choose, so we will not cover this in detail here. You may look at the parser code in the example implementation to get an idea of how this could be done. In general though, the Grammar and the datastructure are the important components, while the parser code follows directly from these.
Matching the RegExpr to encrypted content
The next challenge is to build the execution engine, where we take a RegExpr value and recurse into it to apply the necessary actions on the encrypted content. We first have to define how we actually encode our content into an encrypted state. Once that is defined, we can start working on how we will execute our RegExpr onto the encrypted content.
Encoding and encrypting the content.
It is not possible to encrypt the entire content into a single encrypted value. We can only encrypt numbers and perform operations on those encrypted numbers with FHE. Therefore, we have to find a scheme where we encode the content into a sequence of numbers that are then encrypted individually to form a sequence of encrypted numbers.
We recommend the following two strategies:
- to map each character of the content into the u8 ascii value, and then encrypt each bit of these u8 values individually.
- to, instead of encrypting each bit individually, encrypt each u8 ascii value in its entirety.
Strategy 1 requires more high-level TFHE-rs operations to check for a simple character match (we have to check each bit individually for equality as opposed to checking the entire byte in one, high-level TFHE-rs operation), though some experimentation did show that both options performed equally well on a regex like [.c-inline-code]/a/[.c-inline-code]. This is likely because bitwise FHE operations are relatively cheap compared to u8 FHE operations. However, option 1 falls apart as soon as you introduce [.c-inline-code][a-z][.c-inline-code] regex logic. With option 2, it is possible to complete this match with just three TFHE-rs operations: ge, le, and bitand.
If, on the other hand, we had encrypted the content with the first strategy, there would be no way to test for greater/equal than from and [.c-inline-code]less/equal than to[.c-inline-code]. We'd have to check for the potential equality of each character between from and to, and then join the results together with a sequence of sk.bitor; that would require far more cryptographic operations than in strategy 2.
Because FHE operations are computationally expensive, and strategy 1 requires significantly more FHE operations for matching on [.c-inline-code][a-z][.c-inline-code] regex logic, we should opt for strategy 2.
Matching with the AST versus matching with a derived DFA.
There are a lot of regex PMEs. It's been built many times and it's been researched thoroughly. There are different strategies possible here. A straight forward strategy is to directly recurse into our RegExpr value and apply the necessary matching operations onto the content. In a way, this is nice because it allows us to link the RegExpr structure directly to the matching semantics, resulting in code that is easier to understand, maintain, etc.
Alternatively, there exists an algorithm that transforms the AST (i.e., the RegExpr, in our case) into a Deterministic Finite Automata (DFA). Normally, this is a favorable approach in terms of efficiency because the derived DFA can be walked over without needing to backtrack (whereas the former strategy cannot prevent backtracking). This means that the content can be walked over from character to character, and depending on what the character is at this cursor, the DFA is conjunctively traveled in a definite direction which ultimately leads us to the [.c-inline-code]yes, there is a match[.c-inline-code] or the [.c-inline-code]no, there is no match[.c-inline-code]. There is a small upfront cost of having to translate the AST into the DFA, but the lack of backtracking during matching generally makes up for this, especially if the content that it is matched against is significantly big.
In our case though, we are matching on encrypted content. We have no way to know what the character at our cursor is, and therefore no way to find this definite direction to go forward in the DFA. Therefore, translating the AST into the DFA does not help us as it does in normal regex PMEs. For this reason, consider opting for the former strategy because it allows for matching logic that is easier to understand.
Matching.
In the previous section, we decided we'll match by traversing into the RegExpr value. This section will explain exactly how to do that. Similarly to defining the Grammar, it is often best to start with working out the non-recursive RegExpr variants.
We'll start by defining the function that will recursively traverse into the RegExpr value:
[.c-inline-code]sk[.c-inline-code] is the server key (aka, public key), [.c-inline-code]content[.c-inline-code] is what we'll be matching against, [.c-inline-code]re[.c-inline-code] is the RegExpr value we built when parsing the regex, and [.c-inline-code]c_pos[.c-inline-code] is the cursor position (the index in content we are currently matching against).
The result is a vector of tuples, with the first value of the tuple being the computed ciphertext result, and the second value being the content position after the regex components were applied. It's a vector because certain [.c-inline-code]RegExpr[.c-inline-code] variants require the consideration of a list of possible execution paths. For example, [.c-inline-code]RegExpr::Optional[.c-inline-code] might succeed by applying or and not applying the optional regex (notice that in the former case, [.c-inline-code]c_pos[.c-inline-code] moves forward whereas in the latter case it stays put).
On first call, a [.c-inline-code]match[.c-inline-code] of the entire regex pattern starts with [.c-inline-code]c_pos=0[.c-inline-code]. Then [.c-inline-code]match[.c-inline-code] is called again for the entire regex pattern with [.c-inline-code]c_pos=1[.c-inline-code], etc. until [.c-inline-code]c_pos[.c-inline-code] exceeds the length of the content. Each of these alternative match results are then joined together with [.c-inline-code]sk.bitor[.c-inline-code] operations (this works because if one of them results in [.c-inline-code]true[.c-inline-code] then, in general, our matching algorithm should return [.c-inline-code]true[.c-inline-code]).
The [.c-inline-code]...[.c-inline-code] within the match statement above is what we will be working out for some of the [.c-inline-code]RegExpr[.c-inline-code] variants now. Starting with [.c-inline-code]RegExpr::Char[.c-inline-code]:
Let's consider an example of the variant above. If we apply [.c-inline-code]/a/[.c-inline-code] to content [.c-inline-code]bac[.c-inline-code], we'll have the following list of [.c-inline-code]match[.c-inline-code] calls [.c-inline-code]re[.c-inline-code] and [.c-inline-code]c_pos[.c-inline-code] values (for simplicity,[.c-inline-code]re[.c-inline-code] is denoted in regex pattern instead of in RegExpr value):
And we would arrive at the following sequence of ciphertext operations:
AnyChar is a no operation:
The sequence iterates over its [.c-inline-code]re_xs[.c-inline-code], increasing the content position accordingly, and joins the results with [.c-inline-code]bitand[.c-inline-code] operations:
Other variants are similar, as they recurse and manipulate [.c-inline-code]re[.c-inline-code] and [.c-inline-code]c_pos[.c-inline-code] accordingly. Hopefully, the general idea is already clear.
Ultimately the entire pattern-matching logic unfolds into a sequence of the following set of FHE operations:
- [.c-inline-code]eq[.c-inline-code] (tests for an exact character match)
- [.c-inline-code]ge[.c-inline-code] (tests for [.c-inline-code]greater than[.c-inline-code] or [.c-inline-code]equal to[.c-inline-code] a character)
- [.c-inline-code]le[.c-inline-code] (tests for [.c-inline-code]less than[.c-inline-code] or [.c-inline-code]equal to[.c-inline-code] a character)
- [.c-inline-code]bitand[.c-inline-code] ([.c-inline-code]bitwise AND[.c-inline-code], used for sequencing multiple regex components)
- [.c-inline-code]bitor[.c-inline-code] ([.c-inline-code]bitwise OR[.c-inline-code], used for folding multiple possible execution variants results into a single result)
- [.c-inline-code]bitxor[.c-inline-code] ([.c-inline-code]bitwise XOR[.c-inline-code], used for the [.c-inline-code]not[.c-inline-code] logic in ranges)
Optimizations.
Generally, the included example PME follows the approach outlined above. However, there were two additional optimizations applied. Both of these optimizations involved reducing the number of unnecessary FHE operations. Given how computationally expensive these operations are, it makes sense to optimize for this (and to ignore any suboptimal memory usage of our PME, etc.).
The first optimization involved delaying the execution of FHE operations to after the generation of all possible execution paths to be considered. This optimization allows us to prune execution paths during execution path construction that are provably going to result in an encrypted false value, without having already performed the FHE operations up to the point of pruning. Consider the regex [.c-inline-code]/^a+b$/[.c-inline-code], and we are applying this to a content of size 4. If we are executing execution paths naively, we would go ahead and check for all possible amounts of a repetitions: [.c-inline-code]ab[.c-inline-code], [.c-inline-code]aab[.c-inline-code], [.c-inline-code]aaab[.c-inline-code]. However, while building the execution paths, we can use the fact that [.c-inline-code]a+[.c-inline-code] must begin at the beginning of the content, and that [.c-inline-code]b[.c-inline-code] must be the final character of the content. From this follows that we only have to check for the following sentence: [.c-inline-code]aaab[.c-inline-code]. Delaying execution of the FHE operations until after we've built the possible execution paths in this example reduced the number of FHE operations applied by approximately half.
The second optimization involved preventing the same FHE conditions to be re-evaluated. Consider the regex [.c-inline-code]/^a?ab/[.c-inline-code]. This would give us the following possible execution paths to consider:
- [.c-inline-code]content[0] == a && content[1] == a && content[2] == b[.c-inline-code] (we match the [.c-inline-code]a[.c-inline-code] in [.c-inline-code]a?[.c-inline-code])
- [.c-inline-code]content[0] == a && content[1] == b[.c-inline-code] (we don't match the [.c-inline-code]a[.c-inline-code] in [.c-inline-code]a?[.c-inline-code])
Notice that, for both execution paths, we are checking for [.c-inline-code]content[0] == a[.c-inline-code]. Even though we cannot see what the encrypted result is, we do know that it's either going to be an encrypted false for both cases or an encrypted true for both cases. Therefore, we can skip the re-evaluation of [.c-inline-code]content[0] == a[.c-inline-code] and simply copy the result from the first evaluation over. This optimization involves maintaining a cache of known expression evaluation results and reusing those where possible.
Trying out the example implementation
The implementation that guided the writing of this tutorial can be found here.
When compiling with [.c-inline-code]--example regex_engine[.c-inline-code], a binary is produced that serves as a basic demo. Simply call it with the content string as a first argument and the pattern string as a second argument. For example, [.c-inline-code]cargo run --release --features=x86_64-unix,integer --example regex_engine -- 'this is the content' '/^pattern$/'[.c-inline-code], note it's advised to compile the executable with [.c-inline-code]--release[.c-inline-code] flag as the key generation and homomorphic operations otherwise seem to experience a heavy performance penalty.
On execution, a private and public key pair are created. Then, the content is encrypted with the client key, and the regex pattern is applied onto the encrypted content string - with access given only to the server key. Finally, it decrypts the resulting encrypted result using the client key and prints the verdict to the console.
To get more information on exact computations and performance, set the [.c-inline-code]RUST_LOG[.c-inline-code] environment variable to [.c-inline-code]debug[.c-inline-code] or to [.c-inline-code]trace[.c-inline-code].
Additional links
- Star the TFHE-rs Github repository to endorse our work.
- Review the TFHE-rs documentation.
- Get support on our community channels.
- Learn FHE, help us advance the space and make money with The Zama Bounty Program.