Bootstrapping TFHE ciphertexts in less than one millisecond
Buckle up. The team at Zama is thrilled to announce that we have broken the 1 millisecond barrier for TFHE bootstrapping; latency is now measured in microseconds on GPU, while maintaining the same security level and probability of failure, for 4-bit messages. In this blog post, we look back at how far we've come to reach this milestone.

Fully Homomorphic Encryption (FHE) makes it possible to apply an arbitrary number of operations on encrypted data. This is possible thanks to a special operation named bootstrapping. It is the main performance bottleneck: for FHE to reach widespread use, its latency and throughput have to be pushed to their limits. Only then will computation on encrypted data reach latencies and throughput similar to cleartext computation.
At Zama, efforts have been placed on the acceleration of the TFHE programmable bootstrap. It is at the heart of all the operations in TFHE-rs: not only does it reset the noise in a ciphertext, but it can also apply a function onto it. This is very powerful to build general purpose arithmetic on encrypted data.
The first bootstrap implementation at Zama took 53 ms on CPU, with 128 bits of security and a probability of failure of 2-128 for 4 bit messages. Today, it is our pleasure to announce that the 1 millisecond frontier has been crossed, and the TFHE bootstrap latency is now measured in microseconds on GPU, while maintaining the same security level and probability of failure, for 4 bit messages. Let’s go back in time to see how this occurred.
Accelerating the bootstrap
Fully Homomorphic Encryption (FHE) makes it possible to apply an arbitrary number of operations on encrypted data. This is possible thanks to a special operation named bootstrapping. This operation was invented by Craig Gentry in 2009, relying on lattice-based cryptography: the main idea is to evaluate the decryption circuit homomorphically. At the time, the estimate was that it would take up to 30 minutes to compute.
In 2018 the TFHE bootstrapping was introduced [Chillotti2020], following up on a previous scheme named FHEW. With most other FHE schemes, one bootstrap deals with thousands of messages at once, but when all you need is to bootstrap one or a few, you still pay the price for the whole batch. The TFHE bootstrap latency, however, is very good: as mentioned in the introduction, the first Zama implementation of this bootstrap was taking 53 ms.
For FHE computations to become seamless, both latency and throughput of the bootstrap have to come as close as possible to cleartext calculations. TFHE opened the door to low latency for a single or few bootstraps, which was not possible before.
Zama has been working on GPU acceleration of the TFHE bootstrap almost since its beginning, betting that there was a way to make it faster on GPU. The bootstrap algorithm is highly sequential, making it badly suited for a GPU. Still, little by little, the bootstrap time has been pushed down. In 2024, one TFHE bootstrap took only 2 ms on one H100 GPU, 26 times faster than the original measurement on CPU.
This was achieved thanks to the use of an alternative algorithm for the TFHE bootstrap: the multi-bit algorithm [Zhou2018][Joye2022], that offers more parallelism. That algorithm can also be implemented on CPU to achieve better latencies, but then the throughput is degraded significantly. On GPU, this algorithm is very well suited in the sense that it reduces the latency significantly while maintaining throughput. After the first implementation, many low level optimizations were implemented to make best use of GPU resources and maximize parallelism. Little by little, performance improved.
Between 2021 and 2024, the security level had changed: TFHE-rs is now IND-CPAD secure, but in 2021 Concrete was only IND-CPA secure, as the related attack was not known yet. Covering IND-CPAD attacks with 128 bits of security required changing the cryptographic parameters and introducing new techniques to mitigate the attack [Bernard2025, Ruijter2025]. This had a strong effect on performance, and was mitigated by optimizations and by new cryptographic techniques to reduce the noise level after a bootstrap.
Still, 2 ms was too slow. For the past few months, the GPU team at Zama has been focusing on improving the bootstrap performance further. In particular, an implementation specialized at compile time for blockchain cryptographic parameters was introduced. Having more variables known at compile time reduces register pressure in the GPU, and combining this with fine tuned optimizations it was possible to achieve significant performance improvement.
The bootstrap now takes around 800 µs on one GPU, with 128 bits IND-CPAD security.
This bootstrap encrypts two bits of message to deal with booleans and uses Gaussian noise: this is considered as the reference in the literature. In practice, in TFHE-rs a bootstrap that encrypts 4 bits of message with a TUniform noise distribution is used for blockchain: with these parameters the bootstrap takes 945 µs.
Benchmarks
Below comes a comparison of the current GPU implementation of the bootstrap vs the original CPU one from 2021. Parameters are for IND-CPAD security, i.e. 128 bit of security and a failure probability of 2-128 or less, and a TUniform noise distribution.
For the original TFHE boolean bootstrap, we achieve a 24x improvement. For 4-bit integers, which is what we use today in all our products, we have a 56x improvement.
What’s also very interesting with TFHE is that computing large batches of bootstraps on multiple GPUs is very straightforward: it’s as simple as copying chunks of inputs to different GPUs and bootstrapping them independently. It does not require synchronization or cooperation between GPUs to perform one bootstrap. The throughput can thus reach 189K bootstraps per second on a single node with 8xH100 GPUs for 4 bit Integers, as shown in the Table below.
The effect on large integer (FheUint) operations
The latency of one bootstrap is a good indicator of FHE performance, but real use cases rarely involve the computation of a single bootstrap. This is why the current GPU implementation in TFHE-rs is not latency oriented, neither throughput oriented, but provides a good tradeoff for the two. This is important to accelerate higher level operations, like an addition or a multiplication of ciphertexts encrypting 32 or 64 bit messages. Further performance improvements could be achieved by having specialized implementations for latency & throughput. The advantage of the current approach is that it provides a strong basis to start this new journey.
With the current implementation, very good latencies can be achieved for the addition and multiplication of ciphertexts encrypting 64 bit integers. Currently, on a single node with 8xH100, the addition of two 64-bit encrypted messages takes 8.7 ms, and their multiplication takes 32 ms, as shown in the Table below:
The full table of benchmarks will be made public when the next TFHE-rs version is released, stay tuned for updates!
We expect this latest achievement shall have a tremendous impact on the adoption of FHE in the industry, particularly in blockchain applications. Bear in mind that in such applications, FHE computation is not the only bottleneck: network communication, MPC protocols, data exchanges, zero knowledge proofs also come into play. Still, FHE performance has never been closer to cleartext computation. And this is only the beginning, as dedicated accelerators are expected to go beyond GPU performance.
Bibliography
- Chillotti, I., Gama, N., Georgieva, M. et al. (2020) TFHE: Fast Fully Homomorphic Encryption Over the Torus. J Cryptol 33, 34–91. https://doi.org/10.1007/s00145-019-09319-x
- Zhou, T., Yang, X., Liu, L., Zhang, W. and Li, N., (2018) Faster bootstrapping with multiple addends, IEEE Access, volume 6, pages 49868-49876. https://eprint.iacr.org/2017/735.pdf
- Joye, M., Paillier, P. (2022). Blind Rotation in Fully Homomorphic Encryption with Extended Keys. In: Dolev, S., Katz, J., Meisels, A. (eds) Cyber Security, Cryptology, and Machine Learning. CSCML 2022. Lecture Notes in Computer Science, vol 13301. Springer, Cham. https://doi.org/10.1007/978-3-031-07689-3_1
- Bernard, O., Joye, M., Smart, N. P. and Walter, M., (2025) Drifting towards better error probabilities in fully homomorphic encryption schemes, In S. Fehr and P.-A. Fouque, Eds., Advances in Cryptology – EUROCRYPT 2025, Part VIII, vol. 15608 of Lecture Notes in Computer Science, pp. 181-211, Springer, https://doi.org/10.1007/978-3-031-91101-9_7
- De Ruijter, T., D'Anvers, J.-P. and Verbauwhede, I. (2025) Don’t be mean: Reducing Approximation Noise in TFHE through Mean Compensation, https://eprint.iacr.org/2025/809