In this tutorial we will go through the steps to turn a regular sha256 implementation into its homomorphic version. We explain the basics of the sha256 function first, and then how to implement it homomorphically with performance considerations.

## Sha256

The first step in this experiment is actually implementing the sha256 function. We can find the specification here, but let's summarize the three main sections of the document.

#### Padding

The sha256 function processes the input data in blocks or chunks of 512 bits. Before actually performing the hash computations we have to pad the input in the following way:

• Append a single [.c-inline-code]1[.c-inline-code] bit

• Append a number of [.c-inline-code]0[.c-inline-code] bits such that exactly 64 bits are left to make the message length a multiple of 512

• Append the last 64 bits as a binary encoding of the original input length

Or visually:

Where the numbers on the top represent the length of the padded input at each position, and L+1+k+64 is a multiple of 512 (the length of the padded input).

#### Operations and functions

Let's take a look at the operations that we will use as building blocks for functions inside the sha256 computation. These are bitwise AND, XOR, NOT, addition modulo 2^{32} and the Rotate Right (ROTR) and Shift Right (SHR) operations, all working with 32-bit words and producing a new word.

We combine these operations inside the sigma (with 4 variations), [.c-inline-code]Ch[.c-inline-code] and [.c-inline-code]Maj[.c-inline-code] functions. At the end of the day, when we change the sha256 to be computed homomorphically, we will mainly change the isolated code of each operation.

Here is the definition of each function:

There are some things to note about the functions. Firstly we see that Maj can be simplified by applying the boolean distributive law (x AND y) XOR (x AND z) = x AND (y XOR z). So the new Maj function looks like this:

Next we can also see that Ch can be simplified by using a single bitwise multiplexer. Let's take a look at the truth table of the Ch expression.

When [.c-inline-code]x = 0[.c-inline-code] the result is identical to [.c-inline-code]z[.c-inline-code], but when [.c-inline-code]x = 1[.c-inline-code] the result is identical to [.c-inline-code]y[.c-inline-code]. This is the same as saying [.c-inline-code]if x {y} else {z}[.c-inline-code]. Hence we can replace the 4 bitwise operations of Ch by a single bitwise multiplexer.

Note that all these operations can be evaluated homomorphically. ROTR and SHR can be evaluated by changing the index of each individual bit of the word, even if each bit is encrypted, without using any homomorphic operation. Bitwise AND, XOR and multiplexer can be computed homomorphically and addition modulo 2^{32} can be broken down into boolean homomorphic operations as well.

#### Sha256 computation

As we have mentioned, the sha256 function works with chunks of 512 bits. For each chunk, we will compute 64 32-bit words. 16 will come from the 512 bits and the rest will be computed using the previous functions. After computing the 64 words, and still within the same chunk iteration, a compression loop will compute a hash value (8 32-bit words), again using the previous functions and some constants to mix everything up. When we finish the last chunk iteration, the resulting hash values will be the output of the sha256 function.

Here is how this function looks like using arrays of 32 booleans to represent words:

## Making it homomorphic

The key idea is that we can replace each bit of [.c-inline-code]padded_input[.c-inline-code] with a Fully Homomorphic Encryption of the same bit value, and operate over the encrypted values using homomorphic operations. To achieve this we need to change the function signatures and deal with the borrowing rules of the Ciphertext type (which represents an encrypted bit) but the structure of the sha256 function remains the same. The part of the code that requires more consideration is the implementation of the sha256 operations, since they will use homomorphic boolean operations internally.

Homomorphic operations are really expensive, so we have to remove their unnecessary use and maximize parallelization in order to speed up the program. To simplify our code we use the Rayon crate which provides parallel iterators and efficiently manages threads. Let's now take a look at each sha256 operation!

#### Rotate Right and Shift Right

As we have highlighted, these two operations can be evaluated by changing the position of each encrypted bit in the word, thereby requiring 0 homomorphic operations. Here is our implementation:

#### Bitwise XOR, AND, Multiplexer

To implement these operations we will use the [.c-inline-code]xor[.c-inline-code] [.c-inline-code]and[.c-inline-code] and [.c-inline-code]mux[.c-inline-code] methods provided by the tfhe library to evaluate each boolean operation homomorphically. It's important to note that, since we will operate bitwise, we can parallelize the homomorphic computations. In other words, we can homomorphically XOR the bits at index 0 of two words using a thread, while XORing the bits at index 1 using another thread, and so on. This means we could compute these bitwise operations using up to 32 concurrent threads (since we work with 32-bit words).

Here is our implementation of the bitwise homomorphic XOR operation. The [.c-inline-code]par_iter[.c-inline-code] and [.c-inline-code]par_iter_mut[.c-inline-code] methods create a parallel iterator that we use to compute each individual XOR efficiently. The other two bitwise operations are implemented in the same way.

#### Addition modulo 2^{32}

This is perhaps the trickiest operation to efficiently implement in a homomorphic fashion. A naive implementation could use the Ripple Carry Adder algorithm, which is straightforward but cannot be parallelized because each step depends on the previous one.

A better choice would be the Carry Lookahead Adder, which allows us to use the parallelized AND and XOR bitwise operations. With this design, our adder is around 50% faster than the Ripple Carry Adder.

To even improve performance more, the function that computes the carry signals can also be parallelized using parallel prefix algorithms. These algorithms involve more boolean operations (so homomorphic operations for us) but may be faster because of their parallel nature. We have implemented the Brent-Kung and Ladner-Fischer algorithms, which entail different tradeoffs.

Brent-Kung has the least amount of boolean operations we could find (140 when using grey cells, for 32-bit numbers), which makes it suitable when we can't process many operations concurrently and fast. Our results confirm that it's indeed faster than both the sequential algorithm and Ladner-Fischer when run on regular computers.

On the other hand, Ladner-Fischer performs more boolean operations (209 using grey cells) than Brent-Kung, but they are performed in larger batches. Hence we can compute more operations in parallel and finish earlier, but we need more fast threads available or they will slow down the carry signals computation. Ladner-Fischer can be suitable when using cloud-based computing services, which offer many high-speed threads.

Our implementation uses Brent-Kung by default, but Ladner-Fischer can be enabled when needed by using the [.c-inline-code]--ladner-fischer[.c-inline-code] command line argument.

For more information about parallel prefix adders you can read this paper or this other paper.

Finally, with all these sha256 operations working homomorphically, our functions will be homomomorphic as well along with the whole sha256 function (after adapting the code to work with the Ciphertext type). Let's talk about other performance improvements we can make before we finish.

### More parallel processing

If we inspect the main [.c-inline-code]sha256_fhe[.c-inline-code] function, we will find operations that can be performed in parallel. For instance, within the compression loop, [.c-inline-code]temp1[.c-inline-code] and [.c-inline-code]temp2[.c-inline-code] can be computed concurrently. An efficient way to parallelize computations here is using the [.c-inline-code]rayon::join()[.c-inline-code] function, which uses parallel processing only when there are available CPUs. Recall that the two temporary values in the compression loop are the result of several additions, so we can use nested calls to [.c-inline-code]rayon::join()[.c-inline-code] to potentially parallelize more operations.

Another way to speed up consecutive additions would be using the Carry Save Adder, a very efficient adder that takes 3 numbers and returns a sum and carry sequence. If our inputs are A, B and C, we can construct a CSA with our previously implemented Maj function and the bitwise XOR operation as follows:

By chaining CSAs, we can input the sum and carry from a preceding stage along with another number into a new CSA. Finally, to get the result of the additions we add the sum and carry sequences using a conventional adder. At the end we are performing the same number of additions, but some of them are now CSAs, speeding up the process. Let's see all this together in the [.c-inline-code]temp1[.c-inline-code] and [.c-inline-code]temp2[.c-inline-code] computations.

The first closure of the outer call to join will return [.c-inline-code]temp1[.c-inline-code] and the second [.c-inline-code]temp2[.c-inline-code]. Inside the first outer closure we call join recursively until we reach the addition of the value [.c-inline-code]h[.c-inline-code], the current word [.c-inline-code]w[i][.c-inline-code] and the current constant [.c-inline-code]K[i][.c-inline-code] by using the CSA, while potentially computing in parallel the [.c-inline-code]ch[.c-inline-code] function. Then we take the sum, carry and ch values and add them again using the CSA.

All this is done while potentially computing the [.c-inline-code]sigma_upper_case_1[.c-inline-code] function. Finally we input the previous sum, carry and sigma values to the CSA and perform the final addition with [.c-inline-code]add[.c-inline-code]. Once again, this is done while potentially computing [.c-inline-code]sigma_upper_case_0[.c-inline-code] and [.c-inline-code]maj[.c-inline-code] and adding them to get [.c-inline-code]temp2[.c-inline-code], in the second outer closure.

With some changes of this type, we finally get a homomorphic sha256 function that doesn't leave unused computational resources.

## How to use sha256_bool

First of all, the most important thing when running the program is using the [.c-inline-code]--release[.c-inline-code] flag. The use of sha256_bool would look like this, given the implementation of [.c-inline-code]encrypt_bools[.c-inline-code] and [.c-inline-code]decrypt_bools[.c-inline-code] (code available here):

By using [.c-inline-code]stdin[.c-inline-code] we can supply the data to hash using a file instead of the command line. For example, if our file [.c-inline-code]input.txt[.c-inline-code] is in the same directory as the project, we can use the following shell command after building with [.c-inline-code]cargo build --release[.c-inline-code]:

Our implementation also accepts hexadecimal inputs. To be considered as such, the input must start with "0x" and contain only valid hex digits (otherwise it's interpreted as text).

Finally see that padding is executed on the client side. This has the advantage of hiding the exact length of the input to the server, who already doesn't know anything about the contents of it but may extract information from the length.

Another option would be to perform padding on the server side. The padding function would receive the encrypted input and pad it with trivial bit encryptions. We could then integrate the padding function inside the [.c-inline-code]sha256_fhe[.c-inline-code] function computed by the server.

**Additional links**

- Star the TFHE-rs Github repository to endorse our work.
- Review the TFHE-rs documentation.
- Get support on our community channels.
- Learn FHE, help us advance the space and make money with The Zama Bounty Program.