Linear Regression Over Encrypted Data With Homomorphic Encryption

June 13, 2023
  -  
The Zama Team

× During the second season of The Zama Bounty Program, we asked users to create a tutorial showing how to perform a Linear regression with Concrete ML. Github user Oboulant successfully completed this bounty, and this blog post is based on his contribution.

Learn more about the zama bounty program on Github

In this tutorial, we show how to create, train, and evaluate a LinearSVR regression model using the Concrete ML library, our open-source, privacy-preserving, machine learning framework based on Fully Homomorphic Encryption (FHE).

For the sake of simplicity, we will only consider a single explanatory variable, making it easy to plot its relationship with the target variable.

In order to identify the best set of hyperparameters for the LinearSVR, we perform a grid search on the following:

  • $C$: (inverse) strength of the l2 penalization
  • $e$: margin for the support vectors

Please refer to Scikit-Learn documentation on LinearSVR for more details

Import libraries

import time

import numpy as np
import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.metrics import make_scorer, mean_squared_error
from sklearn.model_selection import GridSearchCV, KFold, train_test_split
from sklearn.svm import LinearSVR as SklearnLinearSVR

from concrete.ml.sklearn.svm import LinearSVR as ConcreteLinearSVR

And some helpers for visualization.

%matplotlib inline

import matplotlib.pyplot as plt
from IPython.display import display

train_plot_config = {"c": "black", "marker": "D", "s": 15, "label": "Train data"}
test_plot_config = {"c": "red", "marker": "x", "s": 15, "label": "Test data"}


def get_sklearn_plot_config(mse_score=None):
    label = "scikit-learn"
    if mse_score is not None:
        label += f", {'
'}={mse_score:.4f}"
    return {"c": "blue", "linewidth": 2.5, "label": label}


def get_concrete_plot_config(mse_score=None):
    label = "Concrete-ML"
    if mse_score is not None:
        label += f", {'
'}={mse_score:.4f}"
    return {"c": "orange", "linewidth": 2.5, "label": label}

Generate a data-set

# Load the diabetes data-set
X, y = load_diabetes(return_X_y=True)
# Use only one feature for educational purpose
X = X[:, np.newaxis, 2]

# We split the data-set into a training and a testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=23)

# We sort the test set for a better visualization
sorted_indexes = np.argsort(np.squeeze(X_test))
X_test = X_test[sorted_indexes, :]
y_test = y_test[sorted_indexes]

We display the data-set to visualize the data distribution.

plt.ioff()

plt.clf()
fig, ax = plt.subplots(1, figsize=(10, 5))
fig.patch.set_facecolor("white")
ax.scatter(X_train, y_train, **train_plot_config)
ax.scatter(X_test, y_test, **test_plot_config)
ax.legend()
display(fig)

Identify best set of hyperparameters

Sklearn LinearSVR.

Create scorer with the Mean Squared Error.

grid_scorer = make_scorer(mean_squared_error, greater_is_better=False)

Train the scikit-learn LinearSVR model on clear data.

A parameter grid with several values for $e$ and $C$ is used.

param_grid = {
    "epsilon": [0.0, 1.0, 10.0, 20.0],
    "C": [0.1, 100.0, 10000.0, 100000.0],
}

sklearn_rgs = SklearnLinearSVR()
kfold_cv = KFold(n_splits=5, shuffle=True, random_state=13)

gs_sklearn = GridSearchCV(
    sklearn_rgs,
    param_grid,
    cv=kfold_cv,
    scoring=grid_scorer,
    verbose=1,
).fit(X_train, y_train)
Fitting 5 folds for each of 16 candidates, totalling 80 fits

Concrete ML quantized LinearSVR.

The typical development flow of a Concrete ML model is the following:

  • The model is trained on clear (plaintext) data, as only FHE inference is currently supported.
  • The resulting trained model is quantized using a [.c-inline-code]n_bits[.c-inline-code] parameter set by the user (see documentation here). This parameter can either be:
    1. a dictionary composed of [.c-inline-code]op_inputs[.c-inline-code] and [.c-inline-code]op_weights[.c-inline-code] keys. These parameters are given as integers representing the number of bits over which the associated data should be quantized.
    2. an integer, representing the number of bits over which each input and weight should be quantized. Default is 8. We try several values to test the various precisions gained for quantization.
  • The quantized model is compiled to an FHE-equivalent, following 3 steps:
    1. Create an executable operation graph.
    2. Check that the op-graph is FHE-compatible by checking the maximum bit-width needed to execute the model.
    3. Determine cryptographic parameters that will help to generate the secret keys and evaluation keys. If no parameters can be found, the compilation process can't be completed and an error is thrown. The user then can either lower the value(s) chosen for [.c-inline-code]n_bits[.c-inline-code] or decrease the number of features found in the data-set (using techniques such as PCA) and run the development flow once again.
  • WInference can then be done on encrypted data (using [.c-inline-code]fhe="execute"[.c-inline-code] when calling [.c-inline-code].predict()[.c-inline-code] method. See below)

We use the same grid of parameter values. We test several values for [.c-inline-code]n_bits[.c-inline-code] : [.c-inline-code][6, 8, 12][.c-inline-code] (see explanation in following sections).

param_grid = {
    "n_bits": [6, 8, 12],
    "epsilon": [0.0, 1.0, 10.0, 20.0],
    "C": [0.1, 100.0, 10000.0, 100000.0],
}

concrete_rgs = ConcreteLinearSVR()

gs_concrete = GridSearchCV(
    concrete_rgs,
    param_grid,
    cv=kfold_cv,
    scoring=grid_scorer,
    verbose=1,
).fit(X_train, y_train)
Fitting 5 folds for each of 48 candidates, totalling 240 fits
plt.ioff()

results_df = pd.DataFrame(gs_concrete.cv_results_)

fig, ax = plt.subplots(1, figsize=(12, 8))
(l1,) = ax.plot(
    np.arange(16), -results_df.loc[results_df["param_n_bits"] == 6, "mean_test_score"], "-o"
)
(l2,) = ax.plot(
    np.arange(16), -results_df.loc[results_df["param_n_bits"] == 8, "mean_test_score"], "-o"
)
(l3,) = ax.plot(
    np.arange(16), -results_df.loc[results_df["param_n_bits"] == 12, "mean_test_score"], "-o"
)
ax.legend((l1, l2, l3), ("n_bits = 6", "n_bits = 8", "n_bits = 12"), loc="upper right", shadow=True)
ax.set_xlabel("Different models with fixed values of C and epsilon")
ax.set_ylabel("Mean MSE accros CV folds")
ax.set_title("Impact of `n_bits` on Cross Validation performances")
display(fig)

As seen in the graph, this fairly simple data-set has only a single feature and few points, in addition to a fairly simple model with only few parameters for the decision rule. The complexity of the information to be represented as integers is not huge. This results in [.c-inline-code]n_bits[.c-inline-code] values having less of an impact on the perfomance of the model.

You can see that for the best model, performances with [.c-inline-code]n_bits[.c-inline-code] equal to [.c-inline-code]6[.c-inline-code], [.c-inline-code]8[.c-inline-code] or [.c-inline-code]12[.c-inline-code] are quite close. Performances differ for models with [.c-inline-code]C = 100000.0[.c-inline-code], meaning the [.c-inline-code]l2[.c-inline-code] penalty is weak and the decision rule can adjust more easily to the data.

Compare sklearn and Concrete ML Quantized best models.

Performance.

# Print mean time fit and std time fit for both models
print(
    f"Mean time fit sklearn: {np.mean(gs_sklearn.cv_results_['mean_fit_time']):.3f}s,"
    f" std time fit sklearn: {np.std(gs_sklearn.cv_results_['mean_fit_time']):.3f}s"
)
print(
    f"Mean time fit concrete: {np.mean(gs_concrete.cv_results_['mean_fit_time']):.3f}s,"
    f"std time fit concrete: {np.std(gs_concrete.cv_results_['mean_fit_time']):.3f}s"
)

# Print best score for both models
print(f"Best MSE score sklearn: {-gs_sklearn.best_score_:.2f}")
print(f"Best MSE score concrete: {-gs_concrete.best_score_:.2f}")
Mean time fit sklearn: 0.003s, std time fit sklearn: 0.003s

Mean time fit concrete: 0.061s,std time fit concrete: 0.008s
Best MSE score sklearn: 4073.18
Best MSE score concrete: 4094.31

Hyperparameters.

# Get best hyperparameters out of gs_concrete
best_params_concrete = gs_concrete.best_params_
print(f"Best parameters for Concrete: {best_params_concrete}")
best_params_sklearn = gs_sklearn.best_params_
print(f"Best parameters for Sklearn: {best_params_sklearn}")
Best parameters for Concrete: {'C': 10000.0, 'epsilon': 20.0, 'n_bits': 12}

Best parameters for Sklearn: {'C': 10000.0, 'epsilon': 20.0}

Train with best hyperparameter set on the complete training data-set.

# Train concrete and sklearn LinearSVR with best hyper parameters
concrete_rgs = ConcreteLinearSVR(**best_params_concrete)

concrete_rgs, sklearn_rgs = concrete_rgs.fit_benchmark(X_train, y_train)

Concrete ML Quantized LinearSVR with FHE

Prerequisite.

Some prerequisites should be reviewed before you dive in.

Quantization is a technique that converts continuous data (floating point, e.g., in 32-bits) to discrete numbers within a fixed range (in our case either 6, 8, or 12 bits). This means that some information is lost during the process. However, the larger the integers' range, the smaller the error becomes, making it acceptable in some cases.

To learn more about quantization, please refer to this page.

Regarding FHE, the input data type must be represented exclusively as integers, making the use of quantization necessary. A linear model trained on floats is quantized into an equivalent integer model using Post-Training Quantization. This operation can lead to a loss of accuracy compared to the standard floating point models working on clear data.

In practice, this loss is usually very limited with linear FHE models as they can consider very large integers with up to 50 bits in some cases. This means these models can quantize their inputs and weights over a large number of bits while still considering data-sets containing many features (e.g. 1000). We often observe almost identical performance scores between float, quantized, and FHE models.

To learn more about the relation between the maximum bit-width reached within a model, the bits of quantization used, and the data-set's number of features, please refer to this page.

Compilation.

To perform homomorphic inference, we take the above trained quantized model and we compile it to get a FHE model.

The compiler requires an exhaustive set of data to evaluate the maximum integer bit-width within the graph, which is needed during FHE computations before running any predictions.

You can either provide the entire trained data-set or a smaller but representative subset.

# Compile the model using the training data
circuit = concrete_rgs.compile(X_train)

Generate the key.

The compiler returns a circuit, which can then be used for key generation and predictions. More precisely, it generates:

  • a Secret Key, used for the encryption and decryption processes. This key should remain accessible only to the user.
  • an Evaluation Key, used to evaluate the circuit on encrypted data. Anyone can access this key without breaching the model's security.
# Generate the key
print(f"Generating a key for an {circuit.graph.maximum_integer_bit_width()}-bit circuit")
Generating a key for an 13-bit circuit
time_begin = time.time()
circuit.client.keygen(force=False)
print(f"Key generation time: {time.time() - time_begin:.2f} seconds")
Key generation time: 0.00 seconds

Now let's predict using the FHE model on encrypted data.

Please notice [.c-inline-code]fhe="execute"[.c-inline-code], which creates the job under the hood: Before the data is sent to be applied to the model, it is encrypted with the client secret key, generated above.

As for future comparison below, we also predict on the very same data on both:

  • the sklearn model;
  • the Concrete ML quantized model without FHE.
# Now predict using the FHE-quantized model on the testing set
time_begin = time.time()
y_pred_fhe = concrete_rgs.predict(X_test, fhe="execute")
print(f"Execution time: {(time.time() - time_begin) / len(X_test):.4f} seconds per sample")

# Now predict using the Sklearn model on the testing set
time_begin = time.time()
y_pred_sklearn = sklearn_rgs.predict(X_test)
print(f"Execution time: {(time.time() - time_begin) / len(X_test):.4f} seconds per sample")

# Now predict using clear quantized Concrete-ML model on testing set
time_begin = time.time()
y_preds_quantized = concrete_rgs.predict(X_test)
print(f"Execution time: {(time.time() - time_begin) / len(X_test):.4f} seconds per sample")
Execution time: 0.0003 seconds per sample

Execution time: 0.0000 seconds per sample
Execution time: 0.0000 seconds per sample


Compare

Display performance spreads.

# Print all MSE a string to explain

mse_sklearn = mean_squared_error(y_test, y_pred_sklearn)
mse_clear = mean_squared_error(y_test, y_preds_quantized)
mse_fhe = mean_squared_error(y_test, y_pred_fhe)

print(
    f"Clear FP32 sklearn model MSE: {mse_sklearn:.3f}\n"
    f"Clear quantized model MSE: {mse_clear:.3f}\n"
    f"FHE model MSE: {mse_fhe:.3f}"
)

# Measure the error of the FHE-quantized model with respect to quantized clear Concrete ML model
concrete_score_difference = abs(mse_fhe - mse_clear) * 100 / mse_clear
print(
    "\nRelative difference between Concrete-ml (quantized clear) and Concrete-ml (FHE) scores:",
    f"{concrete_score_difference:.2f}%",
)


# Measure the error of the FHE quantized model with respect to the sklearn float model
score_difference = abs(mse_fhe - mse_sklearn) * 100 / mse_sklearn
print(
    "Relative difference between scikit-learn (clear) and Concrete-ml (FHE) scores:",
    f"{score_difference:.2f}%",
)
Clear FP32 sklearn model MSE: 3895.391

Clear quantized model MSE: 3884.349
FHE model MSE: 3884.349
Relative difference between Concrete-ml (quantized clear) and Concrete-ml (FHE) scores: 0.00%
Relative difference between scikit-learn (clear) and Concrete-ml (FHE) scores: 0.28%

We can observe that scikit-learn and Concrete ML (quantized clear) models output very close scores. This demonstrates how the quantization process has a very limited impact on performances.

We can observe as well that the performance difference between Concrete ML (quantized clear) and Concrete ML (FHE) is quasi null. This demonstrates that the compilation process has a very limited impact on performance. If one observed a more significant difference, then n_bits can be increased to offer more degrees of freedom during the compilation process.

Visualize the decision rule.

# We densify the space representation of the original X,
# to better visualize the resulting step function in the following figure
x_space = np.linspace(X_test.min(), X_test.max(), num=300)
x_space = x_space[:, np.newaxis]
y_pred_q_space = concrete_rgs.predict(x_space)
plt.ioff()

plt.clf()
fig, ax = plt.subplots(1, figsize=(12, 8))
fig.patch.set_facecolor("white")
ax.scatter(X_train, y_train, **train_plot_config)
ax.scatter(X_test, y_test, **test_plot_config)
ax.plot(X_test, y_pred_sklearn, **get_sklearn_plot_config(mse_sklearn))
ax.plot(x_space, y_pred_q_space, **get_concrete_plot_config(mse_clear))
ax.legend()
display(fig)

In the above graph, you can see that the test data-set has a point for which [.c-inline-code]X[.c-inline-code] value is outside the range of the trained data-set. Since, when we compiled the quantized model we used [.c-inline-code]X_train[.c-inline-code], this [.c-inline-code]X[.c-inline-code] value in the test data-set was not seen by the compilation process, the decision rule poorly generalizes on those values outside the boundaries observed on [.c-inline-code]X_tain[.c-inline-code]. An alternative to correct this would be to give at model compilation time, a wider range of values for [.c-inline-code]X[.c-inline-code].

Conclusion

We have shown how easy it is to train and execute a LinearSVR regression model in FHE using Concrete ML.

We have also discussed the development flow of an FHE model: training, quantization, compilation, and inference.

Prediction performances are quasi identical. The tiny difference is explained by the quantization process and the compilation of the trained model.

Additional links

Read more related posts

Encrypted Key-value Database Using Homomorphic Encryption

How to create an encrypted key-value database using Fully Homomorphic Encryption.

Read Article

Encrypted Image Filtering Using Homomorphic Encryption

This tutorial introduces a way to apply image filters, such as the use of blurring and sharpening tools, on encrypted images.

Read Article

[Video tutorial] How To Convert a scikit-learn Model Into Its Homomorphic Equivalent

This video tutorial shows you how easy it is to turn a scikit-learn model into its FHE equivalent

Read Article