Privacy-preserving insurance quotes

January 19, 2022

—

Andrei Stoian

Last week we released Concrete Numpy, a python package that contains the tools data scientists need to compile various numpy functions into their Fully Homomorphic Encryption (FHE) equivalents.

Today, we will show you how to build an FHE-enabled insurance incident predictor. You can reproduce the results shown here by running the associated notebook in the Concrete Numpy documentation.

Let’s say you’re looking for a new car insurance policy and would like to compare some quotes. You can compare quotes online, but this requires you to fill in some personal information, such as your home address, date of birth, and driver license number. This data could be stored by the websites and your implicit permission is buried somewhere in a very long Terms of Service contract. You might be aware of data leaks where large databases of personal data were obtained by malicious parties using it for identity theft, so you’re quite hesitant to use such a service.

How can technology overcome this issue of trust? The answer is Fully Homomorphic Encryption (FHE), a technology that allows third parties to provide applications on your data without actually getting access to it, meaning the data stays encrypted during processing. Thus, the service processes your data without having access to it, and you receive the quote without giving away any of your information. At the end, you will be the only one who can read the quote by decrypting it, and the remote service can only store encrypted data that would be useless to an attacker.

‍

Building a simple insurance incident predictor

Let’s now peer through an application developer’s perspective: How can I use my insurance quote predictor model on encrypted data?

At Zama, we are developing Concrete Numpy to allow data scientists and developers to train or convert machine learning models into the FHE paradigm. Let’s look at an example for car insurance.

First, we’ll train a model based on some open data from OpenML, following a tutorial from the popular scikit-learn framework. This data set contains 670,000 examples giving the frequency of car accidents for drivers of various ages, past accident history, car type, car color, geographical region, and so on.

‍

Some of the features are continuous, some ordinal, and some nominal. The target feature which is the frequency of car accidents (ClaimNb/Exposure), has a Poisson distribution.

This suggests that if we are to model the frequency based on the other features we should use Poisson Regression. This technique is a generalized linear model with a log-link function:


Ln (lambda) = w . x + b

Frequency ~ Poisson(lambda)

Let’s first build a model with a single predictor variable, for which we can easily visualize the regression trend line. We will then build a full model with all the predictors. With scikit-learn, training a model only takes a line of code, and with a second line we will predict on the independent test set.


reg = PoissonRegressor(alpha=1e-12, max_iter=300)

reg.fit(df_train[“DrivAge”].values.reshape(-1,1), df_train[“Frequency”])

predictions = reg.predict(test_data)

‍

Converting your model to FHE

So, how can you run this model in FHE? It’s simple if you use Concrete Numpy. We first need to quantize the model’s parameters to integers. In our implementation of FHE we cannot use floating point operations, but computations with floating points on integers can be pre-computed into lookup tables. These table lookups are executed with what is known as Programmable Bootstrapping (PBS), a unique FHE bootstrapping approach.

For Poisson regression, prediction requires to compute the linear part of the model w.x + b and then apply the inverse of the link function. In our case the link function is the natural logarithm so the inverse-link is the exponential function.

Using the Generalized Linear Model from Concrete Numpy, converting the model to FHE is easy, but requires a calibration set that converts the parameters w from floating point to integer and builds the PBS lookup table.

‍


calib_data = np.expand_dims(df_calib[“DrivAge”].values, 1)

n_bits = 5

q_glm = QuantizedGLM(n_bits, reg, calib_data)

q_test_data = q_glm.quantize_input(test_data)

Next, you compile the model to FHE with Concrete Numpy:


engine = q_glm.compile(q_test_data)

And you are ready to predict on encrypted test data:


for i, q_sample in enumerate(q_test_data.qvalues):

q_pred_fhe = engine.run(q_sample)

‍
Let’s compare the results of the model in clear, working on floating point with the one compiled in FHE working with 5-bit quantized inputs and parameters:

‍

Analysis of the univariate model

It appears that this univariate model based on age is not a good predictor. The Poisson deviance which measures the goodness of fit is quite high. We do note that there is a slight degradation in the Poisson deviance (let’s call it d in the following) with the quantized FHE model. This is due to the fact that we quantized the inputs and the weights to 5 bits. We also note some noise in the FHE quantized model which is caused by the encryption scheme.

‍

Building a more complex model

Following the scikit-learn tutorial, we will now build a more complex model that uses all the predictor features. We proceed by transforming the raw features into ones that can be input to a regression model. Thus, the categorical features are transformed into one-hot encoding, and we also bin vehicle and person age. Transforming the data this way, we end up with a total of 57 continuous features (instead of the initial 11).

Here is where we encounter one of the limitations of our framework. We perform a dot product in the prediction, w . x, but in our framework the maximum integer size is, for now, limited to 7 bits. As every multiplication doubles the number of bits of precision of the inputs performing 57 multiplication-additions of integers to compute w.x would quickly overflow 7 bits.

We can compute the number of accumulator bits necessary for a certain number of dimensions of w and x when they are represented in b bits as :

With 2 bits for the values in w and b, to keep the accumulator at 7 bits, we can afford up to 14 dimensions. Luckily we have tools, such as Principal Component Analysis (PCA), to reduce data dimensionality while minimizing the loss of information.

Let’s first perform a PCA to reduce dimensionality from 57 to 14 dimensions, and normalize the transformed features to have zero-mean. We create a scikit-learn pipeline to transform the features, do PCA, then fit the GLM:


poisson_glm_pca = Pipeline(

  [

      (“preprocessor”, linear_model_preprocessor),

      (“pca”, PCA(n_components=15, whiten=True)),

      (“regressor”, PoissonRegressor(alpha=1e-12, max_iter=300)),

  ]

)

poisson_glm_pca.fit(

      df_train,

      df_train[“Frequency”],

      regressor__sample_weight=df_train[“Exposure”]

)

Performing PCA in this case very slightly reduces our model’s predictive power:

Now let’s examine the effects of quantization and FHE compilation on the model. Note that when running the compiled FHE program, the results are identical to those of the quantized model in the clear, our FHE compiler does not introduce any loss of precision, thanks to the exact approach we have taken! We’ll now make a graph of the Poisson deviance for multiple quantization bit-widths.

You’ll notice in this graph that the quality of the prediction (high deviance) starts getting worse around 3–4 bits. With 14 features we are forced to work in 2 bits. This will improve soon when we improve quantization tools in Concrete Numpy.

We can now complete the comparison:

‍

Conclusion

In this article, we looked at how you can use Concrete Numpy to convert a scikit-learn-based Poisson regression model to FHE. We have shown that, with the proper choice of pipeline and parameters, we can do the conversion with little loss of precision. This decrease in the quality of prediction is due to quantization of model weights and input data, and some minor noise can appear due to FHE.

All the results of this article can be produced by running the associated python notebook.

Thank you to Andrei Stoian for his contribution to this article.

Get the latest news about homomorphic encryption and what we do at Zama: subscribe to our newsletter.

We are hiring! Join Zama and help us safeguard privacy by making the internet encrypted end-to-end. All the info here: jobs.zama.ai

We’re open source — follow Zama on Github here: github.com/zama-ai

Related Blog Posts

[Video Tutorial] Improving Multiple-GPU Throughput Using TFHE-rs

Tutorials

In this tutorial, Zama team member Agnes Leroy, shows you how to improve multiple-GPU throughput using TFHE-rs.

Zama Bounty Program Season 8

Announcements

Announcing the winning submissions from Season 7 and the new bounties for Season 8.

Call For Builders: Onboard The Next Trillions In DeFi With Confidential Lending

Confidential Blockchain

DeFi is fast, open, and efficient—but too transparent for institutions. What if it offered Swiss-bank-level privacy?

Read more →

Back to blog

Privacy is necessary for an open society in the electronic age. Privacy is not secrecy. A private matter is something one doesn't want the whole world to know, but a secret matter is something one doesn't want anybody to know. Privacy is the power to selectively reveal oneself to the world.If two parties have some sort of dealings, then each has a memory of their interaction. Each party can speak about their own memory of this; how could anyone prevent it? One could pass laws against it, but the freedom of speech, even more than privacy, is fundamental to an open society; we seek not to restrict any speech at all. If many parties speak together in the same forum, each can speak to all the others and aggregate together knowledge about individuals and other parties. The power of electronic communications has enabled such group speech, and it will not go away merely because we might want it to.Since we desire privacy, we must ensure that each party to a transaction have knowledge only of that which is directly necessary for that transaction. Since any information can be spoken of, we must ensure that we reveal as little as possible. In most cases personal identity is not salient. When I purchase a magazine at a store and hand cash to the clerk, there is no need to know who I am. When I ask my electronic mail provider to send and receive messages, my provider need not know to whom I am speaking or what I am saying or what others are saying to me; my provider only need know how to get the message there and how much I owe them in fees. When my identity is revealed by the underlying mechanism of the transaction, I have no privacy. I cannot here selectively reveal myself; I must always reveal myself.Therefore, privacy in an open society requires anonymous transaction systems. Until now, cash has been the primary such system. An anonymous transaction system is not a secret transaction system. An anonymous system empowers individuals to reveal their identity when desired and only when desired; this is the essence of privacy.Privacy in an open society also requires cryptography. If I say something, I want it heard only by those for whom I intend it. If the content of my speech is available to the world, I have no privacy. To encrypt is to indicate the desire for privacy, and to encrypt with weak cryptography is to indicate not too much desire for privacy. Furthermore, to reveal one's identity with assurance when the default is anonymity requires the cryptographic signature.We cannot expect governments, corporations, or other large, faceless organizations to grant us privacy out of their beneficence. It is to their advantage to speak of us, and we should expect that they will speak. To try to prevent their speech is to fight against the realities of information. Information does not just want to be free, it longs to be free. Information expands to fill the available storage space. Information is Rumor's younger, stronger cousin; Information is fleeter of foot, has more eyes, knows more, and understands less than Rumor.We must defend our own privacy if we expect to have any. We must come together and create systems which allow anonymous transactions to take place. People have been defending their own privacy for centuries with whispers, darkness, envelopes, closed doors, secret handshakes, and couriers. The technologies of the past did not allow for strong privacy, but electronic technologies do.We the Cypherpunks are dedicated to building anonymous systems. We are defending our privacy with cryptography, with anonymous mail forwarding systems, with digital signatures, and with electronic money.Cypherpunks write code. We know that someone has to write software to defend privacy, and since we can't get privacy unless we all do, we're going to write it. We publish our code so that our fellow Cypherpunks may practice and play with it. Our code is free for all to use, worldwide. We don't much care if you don't approve of the software we write. We know that software can't be destroyed and that a widely dispersed system can't be shut down.Cypherpunks deplore regulations on cryptography, for encryption is fundamentally a private act. The act of encryption, in fact, removes information from the public realm. Even laws against cryptography reach only so far as a nation's border and the arm of its violence. Cryptography will ineluctably spread over the whole globe, and with it the anonymous transactions systems that it makes possible.For privacy to be widespread it must be part of a social contract. People must come and together deploy these systems for the common good. Privacy only extends so far as the cooperation of one's fellows in society. We the Cypherpunks seek your questions and your concerns and hope we may engage you so that we do not deceive ourselves. We will not, however, be moved out of our course because some may disagree with our goals.The Cypherpunks are actively engaged in making the networks safer for privacy. Let us proceed together apace.Onward.Eric Hughes9 March 1993