# Linear Regression Over Encrypted Data With Homomorphic Encryption

In this tutorial, we show how to create, train, and evaluate a LinearSVR regression model using the Concrete ML library, our open-source, privacy-preserving, machine learning framework based on Fully Homomorphic Encryption (FHE).

For the sake of simplicity, we will only consider a single explanatory variable, making it easy to plot its relationship with the target variable.

In order to identify the best set of hyperparameters for the LinearSVR, we perform a grid search on the following:

- $C$: (inverse) strength of the l2 penalization
- $e$: margin for the support vectors

Please refer to Scikit-Learn documentation on LinearSVR for more details

## Import libraries

And some helpers for visualization.

## Generate a data-set

We display the data-set to visualize the data distribution.

## Identify best set of hyperparameters

**Sklearn LinearSVR.**

Create scorer with the Mean Squared Error.

Train the scikit-learn LinearSVR model on clear data.

A parameter grid with several values for $e$ and $C$ is used.

**Concrete ML quantized LinearSVR.**

The typical development flow of a Concrete ML model is the following:

We use the same grid of parameter values. We test several values for [.c-inline-code]n_bits[.c-inline-code] : [.c-inline-code][6, 8, 12][.c-inline-code] (see explanation in following sections).

As seen in the graph, this fairly simple data-set has only a single feature and few points, in addition to a fairly simple model with only few parameters for the decision rule. The complexity of the information to be represented as integers is not huge. This results in [.c-inline-code]n_bits[.c-inline-code] values having less of an impact on the perfomance of the model.

You can see that for the best model, performances with [.c-inline-code]n_bits[.c-inline-code] equal to [.c-inline-code]6[.c-inline-code], [.c-inline-code]8[.c-inline-code] or [.c-inline-code]12[.c-inline-code] are quite close. Performances differ for models with [.c-inline-code]C = 100000.0[.c-inline-code], meaning the [.c-inline-code]l2[.c-inline-code] penalty is weak and the decision rule can adjust more easily to the data.

## Compare sklearn and Concrete ML Quantized best models.

**Performance.**

**Hyperparameters.**

**Train with best hyperparameter set on the complete training data-set.**

## Concrete ML Quantized LinearSVR with FHE

**Prerequisite.**

Some prerequisites should be reviewed before you dive in.

Quantization is a technique that converts continuous data (floating point, e.g., in 32-bits) to discrete numbers within a fixed range (in our case either 6, 8, or 12 bits). This means that some information is lost during the process. However, the larger the integers' range, the smaller the error becomes, making it acceptable in some cases.

To learn more about quantization, please refer to this page.

Regarding FHE, the input data type must be represented exclusively as integers, making the use of quantization necessary. A linear model trained on floats is quantized into an equivalent integer model using *Post-Training Quantization*. This operation can lead to a loss of accuracy compared to the standard floating point models working on clear data.

In practice, this loss is usually very limited with linear FHE models as they can consider very large integers with up to 50 bits in some cases. This means these models can quantize their inputs and weights over a large number of bits while still considering data-sets containing many features (e.g. 1000). We often observe almost identical performance scores between float, quantized, and FHE models.

To learn more about the relation between the maximum bit-width reached within a model, the bits of quantization used, and the data-set's number of features, please refer to this page.

**Compilation.**

To perform homomorphic inference, we take the above trained quantized model and we compile it to get a FHE model.

The compiler requires an exhaustive set of data to evaluate the maximum integer bit-width within the graph, which is needed during FHE computations before running any predictions.

You can either provide the entire trained data-set or a smaller but representative subset.

**Generate the key.**

The compiler returns a circuit, which can then be used for key generation and predictions. More precisely, it generates:

- a Secret Key, used for the encryption and decryption processes. This key should remain accessible only to the user.
- an Evaluation Key, used to evaluate the circuit on encrypted data. Anyone can access this key without breaching the model's security.

**Now let's predict using the FHE model on encrypted data.**

Please notice [.c-inline-code]fhe="execute"[.c-inline-code], which creates the job under the hood: Before the data is sent to be applied to the model, it is encrypted with the client secret key, generated above.

As for future comparison below, we also predict on the very same data on both:

- the sklearn model;
- the Concrete ML quantized model without FHE.

**

## Compare

**Display performance spreads.**

We can observe that scikit-learn and Concrete ML (quantized clear) models output **very close** scores. This demonstrates how the quantization process has a very limited impact on performances.

We can observe as well that the performance difference between Concrete ML (quantized clear) and Concrete ML (FHE) is quasi null. This demonstrates that the compilation process has a very limited impact on performance. If one observed a more significant difference, then n_bits can be increased to offer more degrees of freedom during the compilation process.

**Visualize the decision rule.**

In the above graph, you can see that the test data-set has a point for which [.c-inline-code]X[.c-inline-code] value is outside the range of the trained data-set. Since, when we compiled the quantized model we used [.c-inline-code]X_train[.c-inline-code], this [.c-inline-code]X[.c-inline-code] value in the test data-set was not seen by the compilation process, the decision rule poorly generalizes on those values outside the boundaries observed on [.c-inline-code]X_tain[.c-inline-code]. An alternative to correct this would be to give at model compilation time, a wider range of values for [.c-inline-code]X[.c-inline-code].

## Conclusion

We have shown how easy it is to train and execute a LinearSVR regression model in FHE using Concrete ML.

We have also discussed the development flow of an FHE model: training, quantization, compilation, and inference.

Prediction performances are quasi identical. The tiny difference is explained by the quantization process and the compilation of the trained model.

**Additional links**

- Star the Concrete ML Github repository to endorse our work.
- Review the Concrete ML documentation.
- Get support on our community channels.
- Learn FHE, help us advance the space and make money with the Zama Bounty Program.