In this tutorial, we will save a dataframe and then restart the server to test whether the dataframe remains available on the server.
Installation and dataset¶
In order to run this notebook, we need to:
- Have Python3.7 (or greater) and Python Pip installed
- Install BastionLab
- Download the dataset we will be using in this tutorial.
We'll do so by running the code block below.
If you are running this notebook on your machine instead of Google Colab, you can see our Installation page to find the installation method that best suits your needs.
# pip packages !pip install bastionlab !pip install bastionlab_server # download the Titanic dataset !wget 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
Our dataset is based on the Titanic dataset, one of the most popular datasets used for understanding machine learning which contains information relating to the passengers aboard the Titanic.
# launch bastionlab_server test package import bastionlab_server srv = bastionlab_server.start()
Note that the bastionlab_server package we install here was created for testing purposes. You can also install BastionLab server using our Docker image or from source (especially for non-test purposes). Check out our Installation Tutorial for more details.
It's important to note that in a typical workflow, the data owner would send a set of keys to the server, so that authorization can be required for all users at the point of connection. BastionLab offers the authorization feature, but as it's not the focus of this visualization tutorial, we will not use it. You can refer to the authentication tutorial if you want to set it up.
# connecting to the server from bastionlab import Connection connection = Connection("localhost") client = connection.client
import polars as pl from bastionlab.polars.policy import Policy, TrueRule, Log df = pl.read_csv("modified_titanic_data.csv") policy = Policy(safe_zone=TrueRule(), unsafe_handling=Log(), savable=True) rdf = client.polars.send_df(df, policy=policy, sanitized_columns=["Name"]) rdf
This policy is not suitable for production. Please note that we only use it for demonstration purposes, to avoid having to approve any data access requests in the tutorial.
We'll check that we're properly connected and that we have the authorizations by running a simple query:
per_class_rates = ( rdf.select([pl.col("Pclass"), pl.col("Survived")]) .groupby(pl.col("Pclass")) .agg(pl.col("Survived").mean()) .sort("Survived", reverse=True) .collect() )
per_class_rates.save() saved_identifier = per_class_rates.identifier print(saved_identifier)
Let us also fetch the rdf so we can compare it to the reloaded dataframe later.
We will now restart the server and check which dataframes persist in the server.
Testing persistence of data frames¶
Terminate the running bastionlab server, then restart it.
If you are not running this Notebook in Colab or if you do not use the pip packaged server, you can kill the server by issuing Ctrl+C in your terminal and then launch it again using the same command you used to start it.
bastionlab_server.stop(srv) srv = bastionlab_server.start()
Reconnect to the server and list the available dataframes.
connection = Connection("localhost") client = connection.client client.polars.list_dfs()
As you can see, the saved dataframe persists on the server.
You can print the dataframe to be sure it's the same one you saved.
retrieved_rdf = client.polars.get_df(saved_identifier) retrieved_rdf.fetch()
Finally, close the connection.
# connection.close() # bastionlab_server.stop(srv)