Bootstrapping TFHE ciphertexts in less than one millisecond

Buckle up. The team at Zama is thrilled to announce that we have broken the 1 millisecond barrier for TFHE bootstrapping; latency is now measured in microseconds on GPU, while maintaining the same security level and probability of failure, for 4-bit messages. In this blog post, we look back at how far we've come to reach this milestone.

Fully Homomorphic Encryption (FHE) makes it possible to apply an arbitrary number of operations on encrypted data. This is possible thanks to a special operation named bootstrapping. It is the main performance bottleneck:  for FHE to reach widespread use, its latency and throughput have to be pushed to their limits. Only then will computation on encrypted data reach latencies and throughput similar to cleartext computation.

At Zama, efforts have been placed on the acceleration of the TFHE programmable bootstrap. It is at the heart of all the operations in TFHE-rs: not only does it reset the noise in a ciphertext, but it can also apply a function onto it. This is very powerful to build general purpose arithmetic on encrypted data.

The first bootstrap implementation at Zama took 53 ms on CPU, with 128 bits of security and a probability of failure of 2-128 for 4 bit messages. Today, it is our pleasure to announce that the 1 millisecond frontier has been crossed, and the TFHE bootstrap latency is now measured in microseconds on GPU, while maintaining the same security level and probability of failure, for 4 bit messages. Let’s go back in time to see how this occurred.

Accelerating the bootstrap

Fully Homomorphic Encryption (FHE) makes it possible to apply an arbitrary number of operations on encrypted data. This is possible thanks to a special operation named bootstrapping. This operation was invented by Craig Gentry in 2009, relying on lattice-based cryptography: the main idea is to evaluate the decryption circuit homomorphically. At the time, the estimate was that it would take up to 30 minutes to compute. 

In 2018 the TFHE bootstrapping was introduced [Chillotti2020], following up on a previous scheme named FHEW. With most other FHE schemes, one bootstrap deals with thousands of messages at once, but when all you need is to bootstrap one or a few, you still pay the price for the whole batch. The TFHE bootstrap latency, however, is very good: as mentioned in the introduction, the first Zama implementation of this bootstrap was taking 53 ms. 

For FHE computations to become seamless, both latency and throughput of the bootstrap have to come as close as possible to cleartext calculations. TFHE opened the door to low latency for a single or few bootstraps, which was not possible before.

Zama has been working on GPU acceleration of the TFHE bootstrap almost since its beginning, betting that there was a way to make it faster on GPU. The bootstrap algorithm is highly sequential, making it badly suited for a GPU. Still, little by little, the bootstrap time has been pushed down. In 2024, one TFHE bootstrap took only 2 ms on one H100 GPU, 26 times faster than the original measurement on CPU.

This was achieved thanks to the use of an alternative algorithm for the TFHE bootstrap: the multi-bit algorithm [Zhou2018][Joye2022], that offers more parallelism. That algorithm can also be implemented on CPU to achieve better latencies, but then the throughput is degraded significantly. On GPU, this algorithm is very well suited in the sense that it reduces the latency significantly while maintaining throughput. After the first implementation, many low level optimizations were implemented to make best use of GPU resources and maximize parallelism. Little by little, performance improved. 

Between 2021 and 2024, the security level had changed: TFHE-rs is now IND-CPAD secure, but in 2021 Concrete was only IND-CPA secure, as the related attack was not known yet. Covering IND-CPAD attacks with 128 bits of security required changing the cryptographic parameters and introducing new techniques to mitigate the attack [Bernard2025, Ruijter2025]. This had a strong effect on performance, and was mitigated by optimizations and by new cryptographic techniques to reduce the noise level after a bootstrap. 

Still, 2 ms was too slow. For the past few months, the GPU team at Zama has been focusing on improving the bootstrap performance further. In particular, an implementation specialized at compile time for blockchain cryptographic parameters was introduced. Having more variables known at compile time reduces register pressure in the GPU, and combining this with fine tuned optimizations it was possible to achieve significant performance improvement.

The bootstrap now takes around 800 µs on one GPU, with 128 bits IND-CPAD security.

This bootstrap encrypts two bits of message to deal with booleans and uses Gaussian noise: this is considered as the reference in the literature. In practice, in TFHE-rs a bootstrap that encrypts 4 bits of message with a TUniform noise distribution is used for blockchain: with these parameters the bootstrap takes 945 µs.

Benchmarks

Below comes a comparison of the current GPU implementation of the bootstrap vs the original CPU one from 2021. Parameters are for IND-CPAD security, i.e. 128 bit of security and a failure probability of 2-128 or less, and a TUniform noise distribution. 

Latency Booleans 4-bit Integers (what we use today)
2021 19 ms 53 ms
2025 796 µs 945 µs
Speedup 24× 56×
Booleans
202119 ms
2025796 µs
Speedup24×
4-bit Integers
(what we use today)
202153 ms
2025945 µs
Speedup56×

Latency of the bootstrap on CPU and GPU: the CPU latency is measured with Concrete-core 0.1.10 from 2021. This is to put in perspective the current GPU latency. Ciphertexts are encrypted using a Gaussian noise distribution, for 128 bits of security and a probability of failure of 2^-128. GPU results were measured on the Nebius platform with 1xH100, CPU results were measured on AWS on an hpc7a.96xlarge instance.

For the original TFHE boolean bootstrap, we achieve a 24x improvement. For 4-bit integers, which is what we use today in all our products, we have a 56x improvement.

What’s also very interesting with TFHE is that computing large batches of bootstraps on multiple GPUs is very straightforward: it’s as simple as copying chunks of inputs to different GPUs and bootstrapping them independently. It does not require synchronization or cooperation between GPUs to perform one bootstrap. The throughput can thus reach 189K bootstraps per second on a single node with 8xH100 GPUs for 4 bit Integers, as shown in the Table below.

Throughput Booleans 4-bit Integers (what we use today)
2021 135 PBS/s 74 PBS/s
2025 223,440 PBS/s 189,000 PBS/s
Improvement 1,655× 2,554×
Booleans
2021135 PBS/s
2025223,440 PBS/s
Improvement1,655×
4-bit Integers (what we use today)
202174 PBS/s
2025189,000 PBS/s
Improvement2,554×

Throughput of the bootstrap on CPU and GPU: the CPU throughput is measured with Concrete-core 0.1.10 from 2021. Ciphertexts are encrypted using a Gaussian noise distribution, for 128 bits of security and a probability of failure of 2^-128. GPU results were measured on the Nebius platform with 8×H100, CPU results were measured on AWS on an hpc7a.96xlarge instance.

The effect on large integer (FheUint) operations

The latency of one bootstrap is a good indicator of FHE performance, but real use cases rarely involve the computation of a single bootstrap. This is why the current GPU implementation in TFHE-rs is not latency oriented, neither throughput oriented, but provides a good tradeoff for the two. This is important to accelerate higher level operations, like an addition or a multiplication of ciphertexts encrypting 32 or 64 bit messages. Further performance improvements could be achieved by having specialized implementations for latency & throughput. The advantage of the current approach is that it provides a strong basis to start this new journey.

With the current implementation, very good latencies can be achieved for the addition and multiplication of ciphertexts encrypting 64 bit integers. Currently, on a single node with 8xH100, the addition of two 64-bit encrypted messages takes 8.7 ms, and their multiplication takes 32 ms, as shown in the Table below:

Latency 64-bit encrypted addition 64-bit encrypted multiplication
2022 2 s 13 s
2025 8.7 ms 32 ms
Improvement 230× 406×
64-bit encrypted addition
20222 s
20258.7 ms
Improvement230×
64-bit encrypted multiplication
202213 s
202532 ms
Improvement406×

Latency of the 64-bit encrypted addition and multiplication on CPU and GPU: the CPU latency is measured with a version of Concrete from December 2022. Ciphertexts are encrypted using a TUniform noise distribution, for 128 bits of security and a probability of failure of 2^-128. GPU results were measured on the Nebius platform with 8xH100, CPU results were measured on AWS on an hpc7a.96xlarge instance.

The full table of benchmarks will be made public when the next TFHE-rs version is released, stay tuned for updates!

We expect this latest achievement shall have a tremendous impact on the adoption of FHE in the industry, particularly in blockchain applications. Bear in mind that in such applications, FHE computation is not the only bottleneck: network communication, MPC protocols, data exchanges, zero knowledge proofs also come into play. Still, FHE performance has never been closer to cleartext computation. And this is only the beginning, as dedicated accelerators are expected to go beyond GPU performance.

Bibliography

  • Zhou, T., Yang, X., Liu, L., Zhang, W. and Li, N., (2018) Faster bootstrapping with multiple addends, IEEE Access, volume 6, pages 49868-49876. https://eprint.iacr.org/2017/735.pdf
  • Joye, M., Paillier, P. (2022). Blind Rotation in Fully Homomorphic Encryption with Extended Keys. In: Dolev, S., Katz, J., Meisels, A. (eds) Cyber Security, Cryptology, and Machine Learning. CSCML 2022. Lecture Notes in Computer Science, vol 13301. Springer, Cham. https://doi.org/10.1007/978-3-031-07689-3_1
  • Bernard, O., Joye, M., Smart, N. P. and Walter, M., (2025) Drifting towards better error probabilities in fully homomorphic encryption schemes, In S. Fehr and P.-A. Fouque, Eds., Advances in Cryptology – EUROCRYPT 2025, Part VIII, vol. 15608 of Lecture Notes in Computer Science, pp. 181-211, Springer, https://doi.org/10.1007/978-3-031-91101-9_7 
  • De Ruijter, T., D'Anvers, J.-P. and Verbauwhede, I. (2025) Don’t be mean: Reducing Approximation Noise in TFHE through Mean Compensation, https://eprint.iacr.org/2025/809

Latest Blog Posts