In this tutorial, we will compare the performance of different **Concrete ML** regressors with scikit-learn regressors. Concrete ML regressors have an API that is very similar to scikit-learn regressors, with two additional elements:

- compiling the model to FHE
- predicting in FHE

To test Fully Homomorphic Encryption (FHE) regressors, we can use a simulated FHE environment that is much faster than running in FHE. Although it does not operate over encrypted data, it is useful to design and train FHE-compatible regressors, as it allows the user to investigate whether the FHE constraints are met at design time.

### FHE runtime considerations and simulation

Here, a single test data is executed in FHE to get the execution time and the decision function values for the domain grid are computed using an FHE simulation. Thus:

- the R2 score reported is computed in simulated mode

This notebook is the regressor version of the following tutorial : Tutorial: Comparison of Concrete ML Classifier.

### Prerequisites

Before diving deep into the topic, it's important to review some prerequisites.

Quantization is a technique that discretizes continuous data, such as floating point numbers, into a fixed range of integers. This process may result in some loss of information, but a larger integer range can reduce error, making it acceptable in some cases.

To learn more about quantization, you can refer to this page.

In the context of FHE, input data must be represented exclusively as integers, requiring the use of quantization. As a result:

- For linear models, quantization is performed after training by finding the best integer weight representations based on input and weight distribution. Users can manually set the n_bits parameter. Linear FHE models can handle large integers up to 50 bits, enabling the quantization of inputs and weights over many bits (e.g., 16) while handling data-sets with many features (e.g., 1000). Thus, they typically exhibit minimal loss, resulting in similar performance scores (e.g., R2 score) to float and quantized models.
- For tree-based models, both training and test data are quantized. A maximum accumulator bit-width of n+1 bits is needed for models trained with n_bits=n. A value of 5 or 6 bits gives the same accuracy as training in floating point, while values above 7 do not increase model performance and cause slowdowns.
- Built-in neural networks use several linear layers and Quantization Aware Training. The maximum accumulator bit-width is controlled by the number of weights and activation bits, as well as a pruning factor. This factor is automatically determined based on the desired accumulator bit-width and a multiplier factor can be optionally specified.

To learn more about the relationship between the maximum bit-width reached within a model, the bits of quantization used, and the number of features in the data-set, please refer to this page.

### Regression model with Concrete ML

The development flow for a Concrete ML Regressions model includes the following steps:

- The model is trained on plaintext data, as only FHE inference is supported.
- Depending on the model type, quantization is performed with the associated scheme. The available quantization schemes can be found at the bottom of the notebook.
- The quantized model is compiled into an FHE equivalent using three steps: creating an executable operation graph, checking that the graph is FHE-compatible by verifying the maximum bit-width required for executing the model, and determining the cryptographic parameters to generate the secret and evaluation keys. If the compilation process fails due to the lack of parameters, the user can lower the value of n_bits for linear models or decrease the number of features in the data-set (using techniques such as PCA) and repeat the development flow.
- Inference can be performed on encrypted data.

**Neural net-based regressor.**

**Linear regressor.**

### Tree and tree ensemble classifiers

## Conclusion

We compared the performance of different linear regression models using Concrete ML and scikit-learn libraries in Python. The purpose of this tutorial was to show the effectiveness of linear regression models in Fully Homomorphic Encryption (FHE) and to provide a comparison between the different models.

We evaluated three types of linear regression models using an R2 score:

- linear (Support Vector Regressor, Linear Regression)
- neural networks (multi-layer non-linear models)
- tree-based (Decision Tree, Random Forest, XGBoost)

The R2 score of the Concrete regressors is measured on encrypted data. These regressors work with parameters and inputs that are heavily quantized and, thus, show R2 score loss (especially in XBGRegressor):

- linear regression models: linear regression models in FHE have good performance and are fast. These models are accurate as they require very little quantization. Their performance is almost identical to that of their fp32 counterparts.
- tree-based regression models: Tree-based regression models achieve a good R2 score on encrypted data in both fp32 and quantized mode in FHE, thanks to their unique computations. In fact, using a Random Forest model on a polynomial of degree 3 can even improve their performance further. However, XGBRegressor is an exception as its performance is not as good in quantized mode compared to fp32. This is because the best parameters for XGBRegressor were identified for fp32 models and were applied to the quantized version, resulting in a lower R2 score.
- neural network regressors: as seen above, neural network regressors have good performance in FHE despite being subjected to heavy quantization thanks to Quantization Aware Training (QAT) techniques.

One way to reduce the performance gap between FHE and their fp32 counterparts for complex models like neural networks or XGBRegressor, which require more hyperparameter optimization work, is to use the GridSearch method separately on both FHE and fp32 models. This is recommended instead of using the ".fit_benchmark" method, which forces the use of the same hyperparameters on both models. For an example using GridSearch in the context of Concrete ML, please refer to this (page)[https://github.com/zama-ai/concrete-ml/blob/release/0.6.x/docs/advanced_examples/LinearRegression.ipynb]

It should be noted that the number of samples is relatively low for runtime purposes, especially for tree-based models and neural networks, where feature engineering was not performed.

Refer to Classifier version of this notebook

**Additional links**

- Star the Concrete ML Github repository to endorse our work.
- Review the Concrete ML documentation.
- Get support on our community channels.
- Learn FHE, help us advance the space and make money with The Zama Bounty Program.