Saving dataframes

In this tutorial, we will save a dataframe and then restart the server to test whether the dataframe remains available on the server.

Pre-requisites¶

Installation and dataset¶

In order to run this notebook, we need to:

Have Python3.7 (or greater) and Python Pip installed
Install BastionLab
Download the dataset we will be using in this tutorial.

We'll do so by running the code block below.

If you are running this notebook on your machine instead of Google Colab, you can see our Installation page to find the installation method that best suits your needs.

In [28]:

            
                Copied!
                
# pip packages
!pip install bastionlab
!pip install bastionlab_server

# download the Titanic dataset
!wget 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
# pip packages
!pip install bastionlab
!pip install bastionlab_server

# download the Titanic dataset
!wget 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'

Our dataset is based on the Titanic dataset, one of the most popular datasets used for understanding machine learning which contains information relating to the passengers aboard the Titanic.

Launch and connect to the server¶

In [ ]:

            
                Copied!
                
# launch bastionlab_server test package
import bastionlab_server

srv = bastionlab_server.start()
# launch bastionlab_server test package
import bastionlab_server

srv = bastionlab_server.start()

Note that the bastionlab_server package we install here was created for testing purposes. You can also install BastionLab server using our Docker image or from source (especially for non-test purposes). Check out our Installation Tutorial for more details.

It's important to note that in a typical workflow, the data owner would send a set of keys to the server, so that authorization can be required for all users at the point of connection. BastionLab offers the authorization feature, but as it's not the focus of this visualization tutorial, we will not use it. You can refer to the authentication tutorial if you want to set it up.

In [1]:

            
                Copied!
                
# connecting to the server
from bastionlab import Connection

connection = Connection("localhost")
client = connection.client
# connecting to the server
from bastionlab import Connection

connection = Connection("localhost")
client = connection.client

Upload the dataframe to the server¶

We'll quickly upload the dataset to the server with an open safety policy, since setting up BastionLab is not the focus of this tutorial. It will allows us to demonstrate features without having to approve any data access requests. You can check out how to define a safe privacy policy here.

In [2]:

            
                Copied!
                
import polars as pl
from bastionlab.polars.policy import Policy, TrueRule, Log

df = pl.read_csv("modified_titanic_data.csv")

policy = Policy(safe_zone=TrueRule(), unsafe_handling=Log(), savable=True)
rdf = client.polars.send_df(df, policy=policy, sanitized_columns=["Name"])

rdf
import polars as pl
from bastionlab.polars.policy import Policy, TrueRule, Log

df = pl.read_csv("modified_titanic_data.csv")

policy = Policy(safe_zone=TrueRule(), unsafe_handling=Log(), savable=True)
rdf = client.polars.send_df(df, policy=policy, sanitized_columns=["Name"])

rdf

Out[2]:

FetchableLazyFrame(identifier=43eabca3-e2e9-4600-b0f2-fb09e3422548)

Important!

This policy is not suitable for production. Please note that we only use it for demonstration purposes, to avoid having to approve any data access requests in the tutorial.

We'll check that we're properly connected and that we have the authorizations by running a simple query:

In [3]:

            
                Copied!
                
                    
                    
                
                

        
per_class_rates = (
    rdf.select([pl.col("Pclass"), pl.col("Survived")])
    .groupby(pl.col("Pclass"))
    .agg(pl.col("Survived").mean())
    .sort("Survived", reverse=True)
    .collect()
)
per_class_rates = (
    rdf.select([pl.col("Pclass"), pl.col("Survived")])
    .groupby(pl.col("Pclass"))
    .agg(pl.col("Survived").mean())
    .sort("Survived", reverse=True)
    .collect()
)

Saving the data frame¶

Now we save the resulting dataframe from our previous operation.

In [4]:

            
                Copied!
                
per_class_rates.save()
saved_identifier = per_class_rates.identifier
print(saved_identifier)
per_class_rates.save()
saved_identifier = per_class_rates.identifier
print(saved_identifier)

4c6edbf1-acd7-49ae-82fe-95eab559536e

Let us also fetch the rdf so we can compare it to the reloaded dataframe later.

In [5]:

            
                Copied!
                
per_class_rates.fetch()
per_class_rates.fetch()

Out[5]:

shape: (3, 2)

Pclass	Survived
i64	f64
1	0.633028
2	0.475676
3	0.24187

We will now restart the server and check which dataframes persist in the server.

Testing persistence of data frames¶

Terminate the running bastionlab server, then restart it.

If you are not running this Notebook in Colab or if you do not use the pip packaged server, you can kill the server by issuing Ctrl+C in your terminal and then launch it again using the same command you used to start it.

In [ ]:

            
                Copied!
                
bastionlab_server.stop(srv)
srv = bastionlab_server.start()
bastionlab_server.stop(srv)
srv = bastionlab_server.start()

Reconnect to the server and list the available dataframes.

In [6]:

            
                Copied!
                
connection = Connection("localhost")
client = connection.client

client.polars.list_dfs()
connection = Connection("localhost")
client = connection.client

client.polars.list_dfs()

Out[6]:

[FetchableLazyFrame(identifier=4c6edbf1-acd7-49ae-82fe-95eab559536e)]

As you can see, the saved dataframe persists on the server.

You can print the dataframe to be sure it's the same one you saved.

In [7]:

            
                Copied!
                
retrieved_rdf = client.polars.get_df(saved_identifier)
retrieved_rdf.fetch()
retrieved_rdf = client.polars.get_df(saved_identifier)
retrieved_rdf.fetch()

Out[7]:

shape: (3, 2)

Pclass	Survived
i64	f64
1	0.633028
2	0.475676
3	0.24187

Finally, close the connection.

In [8]:

            
                Copied!
                
# connection.close()
# bastionlab_server.stop(srv)
# connection.close()
# bastionlab_server.stop(srv)