Hybrid Large Language Models To Improve On-premise Deployments with Concrete ML

October 30, 2023

—

Jordan Frery

The Trade-Off: Cloud vs On-Premise

Large Language-Models (LLMs) can enable large productivity increases when unleashed on confidential data that companies store in their knowledge bases. However, since users may leak personal or confidential data, companies and sometimes governments have put in place policies to forbid access to cloud based LLMs, The alternative is on-premise deployment, where the model is deployed fully user-side. In this setting, developers of proprietary LLMs want to ensure their model IP is protected, that the usage of their LLM complies with license agreements, and that per-token revenue tracking is streamlined. Weighing the pros and cons between cloud and on-premise can be a brain teaser, but, a new hybrid FHE LLM feature in Concrete ML addresses many concerns of on-premise deployments.

You can find the full code for this use case at https://github.com/zama-ai/concrete-ml/tree/main/use_case_examples/hybrid_model

Hybrid Deployment: Best of Both Worlds

FHE can preserve the privacy of the user even when remote servers process their sensitive data. Thus, some computation tasks can be offloaded to the cloud, and in the case of on-premise LLM deployment that means running some LLM layers on an untrusted cloud machine, on encrypted data. It’s win-win: user privacy and the performance of on-premise deployment are maintained while sensitive LLM weights are not shared with the user. Users need to query the server to generate each token, in addition allowing usage and license compliance monitoring.

Hybrid Deployment with Concrete ML

Concrete ML provides a HybridFHEModel class that converts any PyTorch model into a hybrid one. Layers that are kept local are executed on-premise, as before, using accelerators such as one or more GPUs. The user can specify which layers are to be moved to the cloud where they will run on encrypted data.

Putting it into Practice: Model conversion

At model development time, the first step is to load a model, for example from the Hugging Face repository.

model_name = "microsoft/phi-1_5"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

Then the model developer will call the Concrete ML hybrid model converter, specifying which layers are to be computed by the server on encrypted data.

from concrete.ml.torch.hybrid_model import HybridFHEModel

module_names = "layers.1.mixer.WqKV"
hybrid_model = HybridFHEModel(model, module_names)

It’s now time to turn the PyTorch LLM into a Concrete ML hybrid LLM. A hybrid model contains:

a copy of the original model, with the selected layers removed, to be deployed on-premise
FHE circuits that compute the selected layers on encrypted data, on the server.

To produce the circuits that execute on the server, just call the compile function.

hybrid_model.compile_model(
       inputs,
       n_bits=8,
   )

The arguments to the compile function are: inputs - a set of different prompts used to calibrate the model to FHE requirements, and n_bits - the precision used for quantized weights and activations. Using 8 bits is quite common for LLMs.
It is easy to check if the hybrid FHE version of the LLM is correct with respect to the original model. Simply compare predictions between FHE simulation and floating point inference of the full model!

simulated_predictions = hybrid_model(x, fhe="simulate")
original_predictions = hybrid_model(x, fhe="disable")

Finally, check the execution time when FHE is enabled:

fhe_predictions = hybrid_model(x, fhe="execute")

Hybrid Model Deployment

To use a hybrid model, the client instantiates the HybridFHEModel class as above, but this time they will load weights only for the layers that are computed on-premise. The client also needs to configure the remote endpoint that performs the computation in FHE. The model developer zeros out the sensitive weights and sends this new model to the client which loads it:

model_name = "microsoft/phi-1_5"
module_names = "layers.1.mixer.WqKV"
sanitized_model_name = "on_premise_FHE_phi-1_5"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = torch.load(sanitized_model_name)


hybrid_model = HybridFHEModel(
    model,
    module_names,
    server_remote_address="http://0.0.0.0:8000",
    model_name=f"{model_name}",
    verbose=False,
)

Next, they can execute the LLM model to generate tokens with the familiar syntax:

input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device=device)
output_ids = model.generate(
      input_ids, max_new_tokens=num_tokens, use_cache=True, streamer=streamer
)
generated = tokenizer.decode(output_ids[0])
print(generated)

Some things to keep in mind:

One or more layers can be chosen to be converted to FHE. To convert several ones, simply specify them as a list to the HybridFHEModel class
It might seem tempting to make full attention layers private. However, such layers contain computationally expensive operations. Instead, moving only the projection heads of these layers to FHE provides a good trade-off.

Results

You can find the full code for this use case on the Concrete ML Github repository.

This implementation was used to benchmark the phi-1.5 model in a hybrid deployment setting. The model is available on Hugging Face and the experiment was run with two machines with the same hardware configuration: ec2 m6i.metal with 64 cores CPU.

In this experiment, the encrypted ciphertext of a 2048-dimensional embedding vector of a token has a size of 20MB. The encrypted output of the projections is 60MB. The time spent transferring data from/to client/server was 1s and the FHE execution time was around 1.5s.

This scenario protects a single attention layer which contains about 1% of the total weights. Extra layers could be easily protected by specifying them at compilation time.

A generation latency of 2.5 seconds per token was achieved. For comparison, a fully on-premise inference needs around 50ms per token on the same machine. The slow-down introduced by the FHE computation and communication will be improved upon in later Concrete ML versions. Using ciphertext seeding the size of the ciphertexts will be decreased by a factor of 1000x. Furthermore, using GPUs for the server-side computation should reduce its latency many times.

The hybrid LLM model discussed in this blog addresses many of the issues of pure on-premise deployment: it adds IP protection of model weights and easy usage monitoring.

Additional links

Star the Concrete ML Github repository to endorse our work.
Review the Concrete ML documentation.
Get support on our community channels.
Help advance the FHE space with the Zama Bounty Program.

Latest Blog Posts

Zama Bounty Program Season 9: Build a privacy-preserving DCA bot

Announcements

Calling all developers to build a privacy-preserving DCA bot with transaction batching using the Zama Protocol.

Zama Partners with OpenZeppelin to Bring Confidential Smart Contracts to DeFi and Digital Assets

Announcements

Today, we're taking a decisive step toward the future of confidential blockchain, and it involves our new partners at OpenZeppelin

TFHE-rs v1.3: Faster Division on CPU, Key Upgrader & Memory Tracking on GPU

TFHE-rs

TFHE-rs v1.3 brings several major improvements and new features across CPU, GPU, and HPU backends.

Read more →

Back to blog

Privacy is necessary for an open society in the electronic age. Privacy is not secrecy. A private matter is something one doesn't want the whole world to know, but a secret matter is something one doesn't want anybody to know. Privacy is the power to selectively reveal oneself to the world.If two parties have some sort of dealings, then each has a memory of their interaction. Each party can speak about their own memory of this; how could anyone prevent it? One could pass laws against it, but the freedom of speech, even more than privacy, is fundamental to an open society; we seek not to restrict any speech at all. If many parties speak together in the same forum, each can speak to all the others and aggregate together knowledge about individuals and other parties. The power of electronic communications has enabled such group speech, and it will not go away merely because we might want it to.Since we desire privacy, we must ensure that each party to a transaction have knowledge only of that which is directly necessary for that transaction. Since any information can be spoken of, we must ensure that we reveal as little as possible. In most cases personal identity is not salient. When I purchase a magazine at a store and hand cash to the clerk, there is no need to know who I am. When I ask my electronic mail provider to send and receive messages, my provider need not know to whom I am speaking or what I am saying or what others are saying to me; my provider only need know how to get the message there and how much I owe them in fees. When my identity is revealed by the underlying mechanism of the transaction, I have no privacy. I cannot here selectively reveal myself; I must always reveal myself.Therefore, privacy in an open society requires anonymous transaction systems. Until now, cash has been the primary such system. An anonymous transaction system is not a secret transaction system. An anonymous system empowers individuals to reveal their identity when desired and only when desired; this is the essence of privacy.Privacy in an open society also requires cryptography. If I say something, I want it heard only by those for whom I intend it. If the content of my speech is available to the world, I have no privacy. To encrypt is to indicate the desire for privacy, and to encrypt with weak cryptography is to indicate not too much desire for privacy. Furthermore, to reveal one's identity with assurance when the default is anonymity requires the cryptographic signature.We cannot expect governments, corporations, or other large, faceless organizations to grant us privacy out of their beneficence. It is to their advantage to speak of us, and we should expect that they will speak. To try to prevent their speech is to fight against the realities of information. Information does not just want to be free, it longs to be free. Information expands to fill the available storage space. Information is Rumor's younger, stronger cousin; Information is fleeter of foot, has more eyes, knows more, and understands less than Rumor.We must defend our own privacy if we expect to have any. We must come together and create systems which allow anonymous transactions to take place. People have been defending their own privacy for centuries with whispers, darkness, envelopes, closed doors, secret handshakes, and couriers. The technologies of the past did not allow for strong privacy, but electronic technologies do.We the Cypherpunks are dedicated to building anonymous systems. We are defending our privacy with cryptography, with anonymous mail forwarding systems, with digital signatures, and with electronic money.Cypherpunks write code. We know that someone has to write software to defend privacy, and since we can't get privacy unless we all do, we're going to write it. We publish our code so that our fellow Cypherpunks may practice and play with it. Our code is free for all to use, worldwide. We don't much care if you don't approve of the software we write. We know that software can't be destroyed and that a widely dispersed system can't be shut down.Cypherpunks deplore regulations on cryptography, for encryption is fundamentally a private act. The act of encryption, in fact, removes information from the public realm. Even laws against cryptography reach only so far as a nation's border and the arm of its violence. Cryptography will ineluctably spread over the whole globe, and with it the anonymous transactions systems that it makes possible.For privacy to be widespread it must be part of a social contract. People must come and together deploy these systems for the common good. Privacy only extends so far as the cooperation of one's fellows in society. We the Cypherpunks seek your questions and your concerns and hope we may engage you so that we do not deceive ourselves. We will not, however, be moved out of our course because some may disagree with our goals.The Cypherpunks are actively engaged in making the networks safer for privacy. Let us proceed together apace.Onward.Eric Hughes9 March 1993