Data normalization is a common data pre-processing technique. It refers to scaling data values, so that new values are within a specific range (typically -1 to 1, or 0 to 1). This step often comes before ML model fitting and helps to remove bias between features or values of different scales.
Let's take an example. Imagine we have a dataset of houses, with columns including price, number of bedrooms, size in square meters etc. All these attributes have values in different ranges in terms of magnitude. The house price might be from 100,000 to 2 million euros, while the rooms might range from 1 to 5 bedrooms. This can cause a bias where the larger values are given more weight than lower values, which can affect the performance of the model.
There are various scaling methods which solve this problem by scaling all values to the same range and in BastionLab we have implemented some of the most common methods for you: z-score or standard scaling (also referred to as standardization), min/max scaling (also referred to as normalization), mean scaling, maximum absolute scaling and median and quantile scaling.
In this tutorial, we are going to take a look at each of these methods and how to use them.
Pre-requisites¶
Installation¶
In order to run this notebook, we need to:
- Have Python3.7 (or greater) and Python Pip already installed
- Install BastionLab
We'll do so by running the code block below.
If you are running this notebook on your machine instead of Google Colab, you can see our Installation page to find the installation method that best suits your needs.
# install pip packages
!pip install bastionlab
!pip install bastionlab_server
Launch and connect to the server¶
# launch bastionlab_server test package
import bastionlab_server
srv = bastionlab_server.start()
Note that the bastionlab_server package we install here was created for testing purposes. You can also install BastionLab server using our Docker image or from source (especially for non-test purposes). Check out our Installation Tutorial for more details.
# connecting to the server
from bastionlab import Connection
connection = Connection("localhost")
client = connection.client
Create an upload the dataframe to the server¶
For this tutorial, we'll create a dataset with four columns and rows of numerical values, including a column of floats.
import polars as pl
# create Polars DataFrame
df1 = pl.DataFrame(
{
"Col A": [180000, 360000, 230000, 60000],
"Col B": [110, 905, 230, 450],
"Col C": [18.9, 23.4, 14.0, 13.5],
"Col D": [1400, 1800, 1300, 1500],
}
)
We'll quickly upload the dataset to the server with an open safety policy. This will allows us to print out the dataset in full for demonstrative purposes without having to approve any data access requests. You can check out how to define a safe privacy policy here.
from bastionlab.polars.policy import Policy, TrueRule, Log
# create custom open policy
policy = Policy(safe_zone=TrueRule(), unsafe_handling=Log(), savable=True)
# send Polars DataFrame with policy and get RemoteLazyFrame instance back
rdf = client.polars.send_df(df1, policy=policy)
Scaling methods¶
Z-score (standardization) scaling¶
Applying Z-score scaling to data is a common practice before training ML algorithms on a dataset. The z-score method (also called standardization) rescales the data by subtracting the mean from all the data points and then dividing the result by the standard deviation of the data.
Mathematically, this is written as: (x-u)/s
where u
refers to the mean value and s
refers to the standard deviation.
We can use the z-score method in BastionLab by calling the zscore_scale
method on our RemoteLazyFrame and providing the name or list of names of columns we wish z-score scaling to be performed on as the only argument. If we want to scale all columns, we can pass the method the rdf.columns
attribute to quickly get a list of all columns in our dataset.
# create list of columns to scale
columns = ["Col A", "Col B", "Col C", "Col D"]
# scale using z-score scaler and display data
z = rdf.zscore_scale(columns)
# Collecting the resulting RemoteDataFrame
z.collect().fetch()
Col A | Col B | Col C | Col D |
---|---|---|---|
f64 | f64 | f64 | f64 |
-0.221422 | -0.895492 | 0.311486 | -0.46291 |
1.227884 | 1.373564 | 1.278167 | 1.38873 |
0.181163 | -0.552993 | -0.741122 | -0.92582 |
-1.187625 | 0.074922 | -0.848531 | 0.0 |
We can visualize the scaled data using our barplot method.
But first we need to manipulate the data to get it into the right format for the visualization.
The aim is to see each column represented by a different color and group them by their row. This means we need to convert our table to one with:
- a
Row
column, listing the original row the value was part of (0, 1, 2 or 3), - a
Column
column which lists which column the value originally belong to, Col A, Col B, Col C or Col D, - a
Value
column with the values we want to display.
So we should end up with a table like this:
```Row Column Value 0 "Col A" -0.221422
1 "Col A" 1.227884
2 "Col A" 0.181163
3 "Col A" -1.187625
... ... ... ```
To do this, we first will add an index using the with_row_count()
method and specify the name of this new column as Row
. Then we'll use the melt()
function. We'll specify the index as our new Row
column. The value_vars
, which are the values that will end up in our Value
column, will be taken from the four original columns: Col A, Col B, Col C and Col D.
Once we've done all the formatting, we can call the barplot
method with our Row
column as the x
value, our Value
column as the y
value and, that's important, our Column
column as the hue
value.
# transform scaled data into table suited for barplot
z = z.with_row_count("Row").melt(
id_vars="Row",
value_vars=["Col A", "Col B", "Col C", "Col D"],
variable_name="Column",
value_name="Value",
)
z.barplot(x="Row", y="Value", hue="Column")
Min/max (normalization) scaling¶
The min/max method (often called normalization) rescales the feature to a range of [0,1] by subtracting the overall minimum value of the data and then dividing the result by the difference between the minimum and maximum values.
The min/max method can be written as: x_scaled = x − min(x) / max(x) − min(x)
The min/max method is commonly used for data scaling where the maximum and minimum values for data points are known in advance - for example, image pixels with values between 0 and 255.
We can apply min/max normalization to one or multiple columns by calling the minmax_scale
method on our RemoteLazyFrame
with either the string name of one column or a list of string names of multiple columns.
# scale using min/max scaler and display data
mm = rdf.minmax_scale(rdf.columns)
mm.collect().fetch()
Col A | Col B | Col C | Col D |
---|---|---|---|
f64 | f64 | f64 | f64 |
0.4 | 0.0 | 0.545455 | 0.2 |
1.0 | 1.0 | 1.0 | 1.0 |
0.566667 | 0.150943 | 0.050505 | 0.0 |
0.0 | 0.427673 | 0.0 | 0.4 |
We will again visualize this as a barplot (you can check the Z-score section to get more details on the code below).
# transform scaled data into table suited for barplot
mm = mm.with_row_count("Row").melt(
id_vars="Row",
value_vars=["Col A", "Col B", "Col C", "Col D"],
variable_name="Column",
value_name="Value",
)
mm.barplot(x="Row", y="Value", hue="Column")
Maximum absolute scaling¶
Maximum absolute scaling rescales each feature between -1 and 1 by dividing each data point by its maximum absolute value.
It can be written as: x_scaled = x / max(|x|)
We can apply maximum absolute normalization to one or multiple columns by calling the max_abs()
method on our RemoteLazyFrame
with either the string name of one column or a list of string names of multiple columns.
# scale using maximum absolute scaler and display data
ma = rdf.max_abs_scale(columns)
ma.collect().fetch()
Col A | Col B | Col C | Col D |
---|---|---|---|
f64 | f64 | f64 | f64 |
0.5 | 0.121547 | 0.807692 | 0.777778 |
1.0 | 1.0 | 1.0 | 1.0 |
0.638889 | 0.254144 | 0.598291 | 0.722222 |
0.166667 | 0.497238 | 0.576923 | 0.833333 |
Let's visualize this again with barplot (you can check the Z-score section to get more details on the code below).
# transform scaled data into table suited for barplot
ma = ma.with_row_count("Row").melt(
id_vars="Row",
value_vars=["Col A", "Col B", "Col C", "Col D"],
variable_name="Column",
value_name="Value",
)
ma.barplot(x="Row", y="Value", hue="Column")
Mean scaling¶
Mean scaling is the same as min/max scaling, except that it is the mean value that is subtracted from data points, rather than the minimum value. Like with min/max scaling, mean scaling is commonly used where the maximum and minimum values for data points are known in advance.
The formula for mean scaling is: x_scaled = x - mean(x) / max(x) - min(x)
We can apply mean scaling to one or multiple columns by calling the mean_scale
method on our RemoteLazyFrame
with either the string name of one column or a list of string names of multiple columns.
# scale using mean scaler and display data
mean = rdf.mean_scale(columns)
mean.collect().fetch()
Col A | Col B | Col C | Col D |
---|---|---|---|
f64 | f64 | f64 | f64 |
-0.091667 | -0.394654 | 0.146465 | -0.2 |
0.508333 | 0.605346 | 0.60101 | 0.6 |
0.075 | -0.243711 | -0.348485 | -0.4 |
-0.491667 | 0.033019 | -0.39899 | 0.0 |
Now, let's visualize the result of this with a barplot (you can check the Z-score section to get more details on the code below).
# transform scaled data into table suited for barplot
mean = mean.with_row_count("Row").melt(
id_vars="Row",
value_vars=["Col A", "Col B", "Col C", "Col D"],
variable_name="Column",
value_name="Value",
)
mean.barplot(x="Row", y="Value", hue="Column")
Median and quantile (robust) scaling¶
The final scaling method we provide is median and quantile scaling (also known as robust scaling). This involves subtracting the median value from data points and dividing the result by the IQR (inter-quartile range).
The formula for this method can be written as: x_scaled = x - median(x) / (Q3-Q1)
This method is commonly used for datasets with large numbers of outliers.
To this method, as with the other scaling methods, we call the relevant method, median_quantile
, on our RemoteLazyFrame
with either the string name of one column or a list of string names of multiple columns!
# scale using robust scaler and display data
mq = rdf.median_quantile_scale(columns)
mq.collect().fetch()
Col A | Col B | Col C | Col D |
---|---|---|---|
f64 | f64 | f64 | f64 |
-0.138889 | -0.340741 | 0.260638 | -0.125 |
0.861111 | 0.837037 | 0.739362 | 0.875 |
0.138889 | -0.162963 | -0.260638 | -0.375 |
-0.805556 | 0.162963 | -0.31383 | 0.125 |
Once again, let's visualize the result of this with a barplot (you can check the Z-score section to get more details on the code below).
# transform scaled data into table suited for barplot
mq = mq.with_row_count("Row").melt(
id_vars="Row",
value_vars=["Col A", "Col B", "Col C", "Col D"],
variable_name="Column",
value_name="Value",
)
mq.barplot(x="Row", y="Value", hue="Column")
This brings us to the end of this introduction into BastionLab's normalization features. We can now close our connection with the server and stop the server instance.
# close connection
connection.close()
# stop server
bastionlab_server.stop(srv)