Data cleaning

Data cleaning is a crucial first step in the data anlysis which ensures data is ready to be analysed.

To illustrate just how important this step is, imagine you run a query to get the total Covid-19 cases in 'France' in 2022 without realising that some rows relating to France have been labelled 'france' with a lower case 'f', while the rest have been labelled as 'France' with an upper case 'F'. You could end up with a completely inaccurate result which could easily have been avoided by replacing 'france' labels with the upper case 'France' during the data cleaning process.

This is why we have made sure to integrate a wide range of data cleaning methods, all while you work with dataframes remotely with safety guarantees. As well as these find-and-replace operations, other common steps of data cleaning include adding or removing certain columns, filling null values or dropping them and changing data types. We'll show you how to do them all in BastionLab with this tutorial.

Let's dive in!

Pre-requisites¶

Installation and dataset¶

In order to run this notebook, we need to:

Have Python3.7 (or greater) and Python Pip installed
Install BastionLab
Download the dataset we will be using in this tutorial.

We'll do so by running the code block below.

If you are running this notebook on your machine instead of Google Colab, you can see our Installation page to find the installation method that best suits your needs.

In [ ]:

            
                Copied!
                
# pip packages
!pip install bastionlab
!pip install bastionlab_server

# downloading the modified Titanic dataset
!wget 'https://raw.githubusercontent.com/chingjunetao/medium-article/master/simple-guide-to-data-cleaning/modified_titanic_data.csv'
# pip packages
!pip install bastionlab
!pip install bastionlab_server

# downloading the modified Titanic dataset
!wget 'https://raw.githubusercontent.com/chingjunetao/medium-article/master/simple-guide-to-data-cleaning/modified_titanic_data.csv'

Our dataset is based on the Titanic dataset, one of the most popular datasets used for understanding machine learning which contains information relating to the passengers aboard the Titanic. However, it has been modified by data scientist XX to contain some values that need cleaning up before we can start running queries!

Launch and connect to the server¶

In [ ]:

            
                Copied!
                
# launch bastionlab_server test package
import bastionlab_server

srv = bastionlab_server.start()
# launch bastionlab_server test package
import bastionlab_server

srv = bastionlab_server.start()

Note that the bastionlab_server package we install here was created for testing purposes. You can also install BastionLab server using our Docker image or from source (especially for non-test purposes). Check out our Installation Tutorial for more details.

In [1]:

            
                Copied!
                
# connecting to the server
from bastionlab import Connection

connection = Connection("localhost")
client = connection.client
# connecting to the server
from bastionlab import Connection

connection = Connection("localhost")
client = connection.client

Upload the dataframe to the server¶

We'll quickly upload the dataset to the server with an open safety policy, since setting up BastionLab is not the focus of this tutorial. It will allows us to demonstrate features without having to approve any data access requests. You can check out how to define a safe privacy policy here.

In [3]:

            
                Copied!
                
import polars as pl
from bastionlab.polars.policy import Policy, TrueRule, Log

df = pl.read_csv("modified_titanic_data.csv")

policy = Policy(safe_zone=TrueRule(), unsafe_handling=Log(), savable=True)
rdf = client.polars.send_df(df, policy=policy, sanitized_columns=["Name"])

rdf
import polars as pl
from bastionlab.polars.policy import Policy, TrueRule, Log

df = pl.read_csv("modified_titanic_data.csv")

policy = Policy(safe_zone=TrueRule(), unsafe_handling=Log(), savable=True)
rdf = client.polars.send_df(df, policy=policy, sanitized_columns=["Name"])

rdf

Out[3]:

FetchableLazyFrame(identifier=1d829896-7df7-4c32-a8ad-df34128c5a6f)

Important!

This policy is not suitable for production. Please note that we only use it for demonstration purposes, to avoid having to approve any data access requests in the tutorial.

We'll check that we're properly connected and that we have the authorizations by running a simple query:

In [4]:

            
                Copied!
                
rdf.columns
rdf.columns

Out[4]:

['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'Cabin',
 'Embarked',
 'OnboardTime']

We're good to go. Let's get started!

Dropping columns¶

Let's imagine that we don't need the column "Fare". You can drop the column by using RemoteLazyFrame's drop() method, which takes the name of a column or a list of column names as a parameter and returns a RemoteLazyFrame which no longer includes this/those columns.

In [5]:

            
                Copied!
                
rdf = rdf.drop("Fare")
rdf.columns
rdf = rdf.drop("Fare")
rdf.columns

Out[5]:

['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Cabin',
 'Embarked',
 'OnboardTime']

As you can see, this fare is now longer in our rdf RemoteDataFrame instance.

Cleaning null values¶

The next problem we want to address is null values in the database. First of all, let's see how many null values we have in the "Age".

We can do this by selecting the column 'Age' and using the is_null() method. This will output a RemoteLazyFrame with each original cell in the 'Age' column that was null now replaced by the value 1, and each original cell that was not null replacing by the value 0. We then use sum() to count up all these 1 values, which gives us the total null values in the 'Age' column.

Note - To access and show the data in this RemoteDataFrame, we always need to use the collect() and fetch() methods. In this case, this will trigger a request for the data owner's approval: please respond to this request by inputting 'y' in your terminal running the docker image to accept this request.

In [6]:

            
                Copied!
                
total_nulls = rdf.select(pl.col("Age").is_null().sum())

total_nulls.collect().fetch()
total_nulls = rdf.select(pl.col("Age").is_null().sum())

total_nulls.collect().fetch()

Out[6]:

shape: (1, 1)

Age
u32
178

Replacing null values¶

Now that we know there are 178 null values in the 'Age' column, we can ask the next question: what do we want to do with them?

One method to deal with null values would be to replace them with another value. To do that, we can use the fill_null() function to replace all null 'Age' cells with the value 100.

To verify this has worked, we'll check how many null values we have in our new RemoteLazyFrame instance called swap:

In [7]:

            
                Copied!
                
swap = rdf.fill_null("100")
total_nulls = swap.select(pl.col("Age").is_null().sum())

total_nulls.collect().fetch()
swap = rdf.fill_null("100")
total_nulls = swap.select(pl.col("Age").is_null().sum())

total_nulls.collect().fetch()

Out[7]:

shape: (1, 1)

Age
u32
0

The result is 0!

Let's also check how many cells contain 100. We can do this by filtering the values in 'Age' down to those equal to 100 and then counting all the cells in that column.

In [8]:

            
                Copied!
                
total_100s = swap.filter(pl.col("Age") == "100").select(pl.col("Age").count())

total_100s.collect().fetch()
total_100s = swap.filter(pl.col("Age") == "100").select(pl.col("Age").count())

total_100s.collect().fetch()

Out[8]:

shape: (1, 1)

Age
u32
178

The output is, of course, 178!

Converting column types¶

As you may have noticed, our "Age" column contains strings, not integers. If we wanted to change that, we could use the .cast() method with strict set to False to convert out string values to numerical ones!

In [9]:

            
                Copied!
                
swap = swap.with_column(pl.col("Age").cast(pl.Int64, strict=False))

total_num_100s = swap.filter(pl.col("Age") == 100).select(pl.col("Age").count())

total_num_100s.collect().fetch()
swap = swap.with_column(pl.col("Age").cast(pl.Int64, strict=False))

total_num_100s = swap.filter(pl.col("Age") == 100).select(pl.col("Age").count())

total_num_100s.collect().fetch()

Out[9]:

shape: (1, 1)

Age
u32
178

Deleting null values¶

Another method for handling null values is... to delete them! We can do this by using RemoteLazyFrame's drop_nulls() method.

In [10]:

            
                Copied!
                
drop = rdf.drop_nulls()

total_nulls = drop.select(pl.col("Age").is_null().sum())

total_nulls.collect().fetch()
drop = rdf.drop_nulls()

total_nulls = drop.select(pl.col("Age").is_null().sum())

total_nulls.collect().fetch()

Out[10]:

shape: (1, 1)

Age
u32
0

As you can see our drop instance of the original rdf RemoteLazyFrame now also has zero null values.

Cleaning near-matching labels¶

We'll now handle near-matching labels. We can filter down a column or a whole dataset to contain only unique cells by using the unique() method:

In [11]:

            
                Copied!
                
df = rdf.unique()
df.select(pl.col("Sex")).collect().fetch()
df = rdf.unique()
df.select(pl.col("Sex")).collect().fetch()

Out[11]:

shape: (7, 1)

Sex
str
"male"
"female"
"m"
"m "
"M"
"F"
"f"

As you can see, only unique values are left - since they all have some sort of variation. But this leads us to our final cleaning topic: how can we map all alternative forms for one value, like "m", "M" and "male" for male, to one same value?

One way to achieve this is using a polar's "when-then-otherwise" statement to replace alternative forms of "male" and "female" with one chosen form.

In [12]:

            
                Copied!
                
                    
                    
                
                

        
new_rdf = (
    df.select(
        pl.when(pl.col("Sex") == "M")
        .then("male")
        .when(pl.col("Sex") == "m")
        .then("male")
        .when(pl.col("Sex") == "m ")
        .then("male")
        .when(pl.col("Sex") == "F")
        .then("female")
        .when(pl.col("Sex") == "f")
        .then("female")
        .otherwise(pl.col("Sex"))
    )
    .collect()
    .fetch()
)
new_rdf
new_rdf = (
    df.select(
        pl.when(pl.col("Sex") == "M")
        .then("male")
        .when(pl.col("Sex") == "m")
        .then("male")
        .when(pl.col("Sex") == "m ")
        .then("male")
        .when(pl.col("Sex") == "F")
        .then("female")
        .when(pl.col("Sex") == "f")
        .then("female")
        .otherwise(pl.col("Sex"))
    )
    .collect()
    .fetch()
)
new_rdf

Out[12]:

shape: (7, 1)

literal
str
"male"
"female"
"male"
"male"
"male"
"female"
"female"

All our variations have been replaced by one unique option!

You now know how to clean your data using BastionLab. All that's left to do now is to close your connection to the server and stop the server:

In [13]:

            
                Copied!
                
# connection.close()
# bastionlab_server.stop(srv)
# connection.close()
# bastionlab_server.stop(srv)