This is the first post in a series dedicated to making large language models (LLMs) encrypted end-to-end with homomorphic encryption. We will publish more details on how to achieve it technically as we make progress towards this goal in the coming years.
Large language models (LLMs) are probably the biggest breakthrough in AI over the last decade. While there were successes using other architectures, the recent advances displayed by the likes of ChatGPT and Bing have completely changed the game. It is now clear that AI will transform society as deeply as the internet did before.
While LLMs trained on public data can be used for anything from writing a blog post to writing code, the real power comes from the ability to contextualize models, either by fine tuning or bootstrapping the prompt with some additional information. In both cases, you will need to feed the LLM custom data, which could include highly sensitive things such as your messaging history, your company’s internal documents, your slack messages, etc. This lack of privacy protection is precisely what made Italy and other countries ban ChatGPT.
But what if we could have encrypted conversations with LLMs in the same ways that we have encrypted conversations with our friends on messaging apps? Being able to use LLMs without revealing our personal data would unleash the true power of AI, while making both users and regulators happy. And as it turns out, this is now possible thanks to a powerful encryption technique called Fully Homomorphic Encryption (FHE).
The idea behind FHE is that you can do processing on encrypted data without having to decrypt it. In the context of AI, this is how it would work:
- Encrypt your context and query using a secret key that only you know.
- Send the encrypted prompt to the service provider running the LLM.
- Compute the LLM on the encrypted data itself, producing an encrypted response. At no point does the LLM or service provider see the data.
- Receive an encrypted response that you decrypt with your key to reveal the output.
Similarly, if you want to fine tune a model using some sensitive data, you would encrypt the entire dataset under your secret key, then send it to the service provider who would blindly calibrate the model.
The reason why FHE wasn’t used before is simply that it wasn’t ready: it was too slow to be practical, too limited in terms of what it could do, and too difficult to use. This is exactly what we are solving at Zama, by providing tools that enable data scientists and developers to use FHE without having to know cryptography. Our technology enables any computation to be carried out over encrypted data, regardless of how complex things are.
But while performance has improved by 20x in the last 3 years, we are still very far from being able to run LLMs in FHE in a cost effective way. A simple back of the envelope calculation shows that for an average-sized LLM, generating one encrypted token would require up to 1 billion large-precision PBS. The term PBS here stands for “programmable bootstrapping”, which is the most costly operation in FHE, and is used to compute functions on encrypted data (such as activation functions in neural networks).
On a modern CPU, we can compute around 200 8-bit PBS / second at a cost of $0.001. To generate one token per second, it would thus cost around ~$5,000 per token. To make this economically viable, tokens should cost at most $0.01, meaning a 500,000x improvement.
While getting 500,000x improvement may sound impossible, it is actually on the horizon. There are 3 major trends at play here:
- LLMs are getting faster thanks to compression techniques. This means less data to compute homomorphically. While this is hard to evaluate, it’s likely to bring at least a 2x performance improvement.
- The cryptography behind FHE is getting better, and we can expect to get at least a 5x speedup within 5 years.
- The biggest speedup however will come from dedicated hardware acceleration. Several companies are currently working on this, and are targeting a 1,000x speedup for their first generation (planned for 2025) and up to 10,000x for their second generation. This means you would eventually only need about 5 FHE accelerators to run an encrypted LLM, on par with the number of GPUs needed today for non-encrypted ones.
Since most of the challenges in FHE are already solved (or will be in the near future), we can confidently expect to have end-to-end encrypted AI within 5 years. I strongly believe that when this happens, nobody will care about privacy anymore, not because it’s unimportant, but because it will be guaranteed by design.
- Chat with the author @randhindi and follow @zama_fhe on Twitter.
- Star the Concrete ML Github repository to endorse our work.
- Review the Concrete ML documentation.
- Get support on our community channels.
- Help advance the FHE space with the Zama Bounty Program.
- Try Zama's latest demo: an encrypted image filtering app using FHE ⤵️