Data cleaning is a crucial first step in the data anlysis which ensures data is ready to be analysed.
To illustrate just how important this step is, imagine you run a query to get the total Covid-19 cases in 'France' in 2022 without realising that some rows relating to France have been labelled 'france' with a lower case 'f', while the rest have been labelled as 'France' with an upper case 'F'. You could end up with a completely inaccurate result which could easily have been avoided by replacing 'france' labels with the upper case 'France' during the data cleaning process.
This is why we have made sure to integrate a wide range of data cleaning methods, all while you work with dataframes remotely with safety guarantees. As well as these find-and-replace operations, other common steps of data cleaning include adding or removing certain columns, filling null values or dropping them and changing data types. We'll show you how to do them all in BastionLab with this tutorial.
Let's dive in!
Installation and dataset¶
In order to run this notebook, we need to:
- Have Python3.7 (or greater) and Python Pip installed
- Install BastionLab
- Download the dataset we will be using in this tutorial.
We'll do so by running the code block below.
If you are running this notebook on your machine instead of Google Colab, you can see our Installation page to find the installation method that best suits your needs.
# pip packages !pip install bastionlab !pip install bastionlab_server # downloading the modified Titanic dataset !wget 'https://raw.githubusercontent.com/chingjunetao/medium-article/master/simple-guide-to-data-cleaning/modified_titanic_data.csv'
Our dataset is based on the Titanic dataset, one of the most popular datasets used for understanding machine learning which contains information relating to the passengers aboard the Titanic. However, it has been modified by data scientist XX to contain some values that need cleaning up before we can start running queries!
# launch bastionlab_server test package import bastionlab_server srv = bastionlab_server.start()
Note that the bastionlab_server package we install here was created for testing purposes. You can also install BastionLab server using our Docker image or from source (especially for non-test purposes). Check out our Installation Tutorial for more details.
# connecting to the server from bastionlab import Connection connection = Connection("localhost") client = connection.client
import polars as pl from bastionlab.polars.policy import Policy, TrueRule, Log df = pl.read_csv("modified_titanic_data.csv") policy = Policy(safe_zone=TrueRule(), unsafe_handling=Log(), savable=True) rdf = client.polars.send_df(df, policy=policy, sanitized_columns=["Name"]) rdf
This policy is not suitable for production. Please note that we only use it for demonstration purposes, to avoid having to approve any data access requests in the tutorial.
We'll check that we're properly connected and that we have the authorizations by running a simple query:
['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'OnboardTime']
We're good to go. Let's get started!
Let's imagine that we don't need the column
"Fare". You can drop the column by using RemoteLazyFrame's
drop() method, which takes the name of a column or a list of column names as a parameter and returns a RemoteLazyFrame which no longer includes this/those columns.
rdf = rdf.drop("Fare") rdf.columns
['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Cabin', 'Embarked', 'OnboardTime']
As you can see, this fare is now longer in our rdf RemoteDataFrame instance.
Cleaning null values¶
The next problem we want to address is null values in the database. First of all, let's see how many null values we have in the "Age".
We can do this by selecting the column
'Age' and using the
is_null() method. This will output a RemoteLazyFrame with each original cell in the
'Age' column that was
null now replaced by the value
1, and each original cell that was
not null replacing by the value
0. We then use
sum() to count up all these 1 values, which gives us the total null values in the
Note - To access and show the data in this RemoteDataFrame, we always need to use the
fetch()methods. In this case, this will trigger a request for the data owner's approval:
please respond to this request by inputting 'y' in your terminal running the docker image to accept this request.
total_nulls = rdf.select(pl.col("Age").is_null().sum()) total_nulls.collect().fetch()
Replacing null values¶
Now that we know there are 178
null values in the
'Age' column, we can ask the next question: what do we want to do with them?
One method to deal with null values would be to replace them with another value. To do that, we can use the
fill_null() function to replace all
'Age' cells with the value
To verify this has worked, we'll check how many
null values we have in our new RemoteLazyFrame instance called swap:
swap = rdf.fill_null("100") total_nulls = swap.select(pl.col("Age").is_null().sum()) total_nulls.collect().fetch()
The result is
Let's also check how many cells contain
100. We can do this by filtering the values in
'Age' down to those
equal to 100 and then counting all the cells in that column.
total_100s = swap.filter(pl.col("Age") == "100").select(pl.col("Age").count()) total_100s.collect().fetch()
The output is, of course,
Converting column types¶
As you may have noticed, our
"Age" column contains strings, not integers. If we wanted to change that, we could use the
.cast() method with strict set to
False to convert out string values to numerical ones!
swap = swap.with_column(pl.col("Age").cast(pl.Int64, strict=False)) total_num_100s = swap.filter(pl.col("Age") == 100).select(pl.col("Age").count()) total_num_100s.collect().fetch()
Deleting null values¶
Another method for handling null values is... to delete them! We can do this by using RemoteLazyFrame's
drop = rdf.drop_nulls() total_nulls = drop.select(pl.col("Age").is_null().sum()) total_nulls.collect().fetch()
As you can see our
drop instance of the original
rdf RemoteLazyFrame now also has zero null values.
Cleaning near-matching labels¶
We'll now handle near-matching labels. We can filter down a column or a whole dataset to contain only unique cells by using the
df = rdf.unique() df.select(pl.col("Sex")).collect().fetch()
As you can see, only unique values are left - since they all have some sort of variation. But this leads us to our final cleaning topic: how can we map all alternative forms for one value, like
"male" for male, to one same value?
One way to achieve this is using a polar's
"when-then-otherwise" statement to replace alternative forms of
"female" with one chosen form.
new_rdf = ( df.select( pl.when(pl.col("Sex") == "M") .then("male") .when(pl.col("Sex") == "m") .then("male") .when(pl.col("Sex") == "m ") .then("male") .when(pl.col("Sex") == "F") .then("female") .when(pl.col("Sex") == "f") .then("female") .otherwise(pl.col("Sex")) ) .collect() .fetch() ) new_rdf
All our variations have been replaced by one unique option!
You now know how to clean your data using BastionLab. All that's left to do now is to close your connection to the server and stop the server:
# connection.close() # bastionlab_server.stop(srv)