Training Predictive Models on Encrypted Data using Fully Homomorphic Encryption

March 14, 2024

—

Jordan Frery and Luis Montero

The transformative power of data across sectors like healthcare, finance, advertising, and genomics cannot be overstated. Yet, as valuable as this data can be, it is often laden with sensitive information that can include personally identifiable details, making its security paramount. Herein lies the potential of Fully Homomorphic Encryption (FHE) – a groundbreaking technology that secures data while preserving its utility, allowing data owners to process it even in its encrypted state.

The implications of FHE stretch far into the future of machine learning, offering a path to unlock use-cases where data privacy isn't just a requirement but a cornerstone. By enabling the training of machine learning models on encrypted data, FHE introduces a new era of privacy protections in collaborative environments. Imagine a scenario where entities can enrich their models by leveraging the data of others, without ever compromising the integrity and confidentiality of the information shared. This not only safeguards privacy but also fosters a culture of trust and cooperation.

Moreover, in such collaborative settings, the ability to train interpretable models on encrypted data can be revolutionary. It offers clear insights into how external data enhances model performance, thereby validating the value of collaboration. This transparency and understanding are crucial, creating a solid foundation of confidence among parties that their partnership is not only secure but also mutually beneficial.

In essence, Fully Homomorphic Encryption is not just a tool for data security; it is a catalyst for innovation, enabling safer, more productive collaborations across industries where privacy concerns have traditionally hindered progress. Its application in training machine learning models on encrypted data promises a future where privacy and utility go hand in hand, unlocking unprecedented potential for growth and advancement.

Training encrypted data with Concrete ML

With its most recent release, v1.4, Concrete ML supports training of Logistic Regression models on encrypted data. This class of models are well known for their simplicity, robustness and interpretability. The implementation uses stochastic gradient descent to train the model, so the data sets are split into batches during training.

The current encrypted training approach in Concrete ML performs encryption, training and decryption jointly. This is useful for development purposes for data scientists that want to explore the accuracy of models trained on encrypted data. In a real client-server application one expects that these steps will be separate so that the server only receives and sends encrypted data.

First, instantiate the Logistic Regression training class, [.c-inline-code]SGDClassifier[.c-inline-code].

from concrete.ml.sklearn import SGDClassifier
model = SGDClassifier(fit_encrypted=True)

Next, simply call the fit function while specifying that the training should use FHE. This function quantizes and encrypts the training data and labels and, after training, decrypts the resulting model.

model.fit(X_binary, y_binary, fhe="execute")

The model parameters, stored in the clear in the model object can now be used to predict on new clear or encrypted data.

You can easily use the above code to learn a classifier on the well known breast-cancer dataset. The following code shows how to train batch by batch in order to monitor the model accuracy throughout the training process.

First, download, split and scale the dataset.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

X2, y2 = datasets.load_breast_cancer(return_X_y=True)
x2_train, x2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=0.3, stratify=y2)

scaler = MinMaxScaler(feature_range=[-1, 1])
x2_train = scaler.fit_transform(x2_train)
x2_test = scaler.transform(x2_test)

rng = np.random.default_rng()
perm = rng.permutation(x2_train.shape[0])
x2_train = x2_train[perm, ::]
y2_train = y2_train[perm]

Now, train the classifier by encrypting each batch of data individually and running it through the training algorithm. The model is decrypted after each batch so that it can be evaluated.

clf = SGDClassifier(
    random_state=42,
    max_iter=50,
    fit_encrypted=True,
    warm_start=True,
)

# Go through the training batches
for idx in range(x2_train.shape[0] // clf.batch_size):
    batch_range = range(idx * clf.batch_size, (idx + 1) * clf.batch_size)
    x_batch = x2_train[batch_range, ::]
    y_batch = y2_train[batch_range]

    # Fit on a single batch with partial_fit
    clf.partial_fit(x_batch, y_batch, fhe="execute")

    # Measure accuracy of the model with FHE simulation
    clf.compile(x2_train)
    y_pred_fhe = clf.predict(x2_test, fhe="simulate")
    accuracy = (y_pred_fhe == y2_test).mean()

Plotting the above accuracy after each batch was processed gives the following graph:

The training time for this model is around 11 seconds per batch on a large AWS server. For the entire breast-cancer dataset the time to train the model is 13 minutes. For more results, check out this research paper that will be presented at FHE.org 2024.

Future work

While currently only the single user experimentation use-case is explored, the natural use-case for training is the client-server setting. In future versions of Concrete ML this deployment setting will be added, following the API for deploying encrypted inference.

Collaboration between multiple parties is where training on encrypted data really shines. Using threshold protocols it is possible for multiple parties to generate keys that keep their individual data secure, while allowing joint training. Furthermore, in future versions of Concrete ML, more complex models such as neural networks will be enabled for encrypted training.

Additional links

Star the Concrete ML Github repository to endorse our work.
Review the Concrete ML documentation.
Join the Zama Bounty and Grant Program.
Get support on our community channels.

Related Blog Posts

[Video Tutorial] Improving Multiple-GPU Throughput Using TFHE-rs

Tutorials

In this tutorial, Zama team member Agnes Leroy, shows you how to improve multiple-GPU throughput using TFHE-rs.

Zama Bounty Program Season 8

Announcements

Announcing the winning submissions from Season 7 and the new bounties for Season 8.

Call For Builders: Onboard The Next Trillions In DeFi With Confidential Lending

Confidential Blockchain

DeFi is fast, open, and efficient—but too transparent for institutions. What if it offered Swiss-bank-level privacy?

Read more →

Back to blog

Privacy is necessary for an open society in the electronic age. Privacy is not secrecy. A private matter is something one doesn't want the whole world to know, but a secret matter is something one doesn't want anybody to know. Privacy is the power to selectively reveal oneself to the world.If two parties have some sort of dealings, then each has a memory of their interaction. Each party can speak about their own memory of this; how could anyone prevent it? One could pass laws against it, but the freedom of speech, even more than privacy, is fundamental to an open society; we seek not to restrict any speech at all. If many parties speak together in the same forum, each can speak to all the others and aggregate together knowledge about individuals and other parties. The power of electronic communications has enabled such group speech, and it will not go away merely because we might want it to.Since we desire privacy, we must ensure that each party to a transaction have knowledge only of that which is directly necessary for that transaction. Since any information can be spoken of, we must ensure that we reveal as little as possible. In most cases personal identity is not salient. When I purchase a magazine at a store and hand cash to the clerk, there is no need to know who I am. When I ask my electronic mail provider to send and receive messages, my provider need not know to whom I am speaking or what I am saying or what others are saying to me; my provider only need know how to get the message there and how much I owe them in fees. When my identity is revealed by the underlying mechanism of the transaction, I have no privacy. I cannot here selectively reveal myself; I must always reveal myself.Therefore, privacy in an open society requires anonymous transaction systems. Until now, cash has been the primary such system. An anonymous transaction system is not a secret transaction system. An anonymous system empowers individuals to reveal their identity when desired and only when desired; this is the essence of privacy.Privacy in an open society also requires cryptography. If I say something, I want it heard only by those for whom I intend it. If the content of my speech is available to the world, I have no privacy. To encrypt is to indicate the desire for privacy, and to encrypt with weak cryptography is to indicate not too much desire for privacy. Furthermore, to reveal one's identity with assurance when the default is anonymity requires the cryptographic signature.We cannot expect governments, corporations, or other large, faceless organizations to grant us privacy out of their beneficence. It is to their advantage to speak of us, and we should expect that they will speak. To try to prevent their speech is to fight against the realities of information. Information does not just want to be free, it longs to be free. Information expands to fill the available storage space. Information is Rumor's younger, stronger cousin; Information is fleeter of foot, has more eyes, knows more, and understands less than Rumor.We must defend our own privacy if we expect to have any. We must come together and create systems which allow anonymous transactions to take place. People have been defending their own privacy for centuries with whispers, darkness, envelopes, closed doors, secret handshakes, and couriers. The technologies of the past did not allow for strong privacy, but electronic technologies do.We the Cypherpunks are dedicated to building anonymous systems. We are defending our privacy with cryptography, with anonymous mail forwarding systems, with digital signatures, and with electronic money.Cypherpunks write code. We know that someone has to write software to defend privacy, and since we can't get privacy unless we all do, we're going to write it. We publish our code so that our fellow Cypherpunks may practice and play with it. Our code is free for all to use, worldwide. We don't much care if you don't approve of the software we write. We know that software can't be destroyed and that a widely dispersed system can't be shut down.Cypherpunks deplore regulations on cryptography, for encryption is fundamentally a private act. The act of encryption, in fact, removes information from the public realm. Even laws against cryptography reach only so far as a nation's border and the arm of its violence. Cryptography will ineluctably spread over the whole globe, and with it the anonymous transactions systems that it makes possible.For privacy to be widespread it must be part of a social contract. People must come and together deploy these systems for the common good. Privacy only extends so far as the cooperation of one's fellows in society. We the Cypherpunks seek your questions and your concerns and hope we may engage you so that we do not deceive ourselves. We will not, however, be moved out of our course because some may disagree with our goals.The Cypherpunks are actively engaged in making the networks safer for privacy. Let us proceed together apace.Onward.Eric Hughes9 March 1993