SQL is the most important querying language in data science - around 70% of data scientists using it in their work! That is why we wanted to be able to accept SQL queries in BastionLab, enabling us to combine their familiarity with security guarantees.
In the following notebook, we will show you how to run basic SQL queries on RemoteLazyFrames.
But before we dive in, let's get everything set up! If you already know how to do this from previous tutorials or our quick tour, feel free to skip ahead to the SQL queries section - we'll be using the Titanic dataset!
Pre-requisites¶
Installation and dataset¶
In order to run this notebook, we need to:
- Have Python3.7 (or greater) and Python Pip installed
- Install BastionLab
- Download the dataset we will be using in this tutorial.
We'll do so by running the code block below.
If you are running this notebook on your machine instead of Google Colab, you can see our Installation page to find the installation method that best suits your needs.
# pip packages
!pip install bastionlab
!pip install bastionlab_server
# download the dataset
!wget 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
Our dataset is based on the Titanic dataset, one of the most popular ressource used for understanding machine learning, which contains information relating to the passengers aboard the Titanic.
Launch and connect to the server¶
# launch bastionlab_server test package
import bastionlab_server
srv = bastionlab_server.start()
Note that the bastionlab_server package we install here was created for testing purposes. You can also install BastionLab server using our Docker image or from source (especially for non-test purposes). Check out our Installation Tutorial for more details.
# connect to the server
from bastionlab import Connection
connection = Connection("localhost")
client = connection.client
import polars as pl
from bastionlab.polars.policy import Policy, Aggregation, Log
df = pl.read_csv("titanic.csv")
policy = Policy(
safe_zone=Aggregation(min_agg_size=10), unsafe_handling=Log(), savable=False
)
rdf = client.polars.send_df(df, policy=policy)
rdf
FetchableLazyFrame(identifier=b57df989-7cf7-46e4-bf9a-0a6fd8979e97)
The server returns a RemoteLazyFrame
which we will be working with throughout the rest of this tutorial!
SQL method¶
SQL queries in BastionLab work by using the sql()
static method of the RemoteLazyFrame class.
sql()
takes two arguments:
query
: a string containing your query,rdfs
: your RemoteLazyFrame(s) provided as*args
.
Select¶
Let's start by looking at an example of how to select
columns.
We first create our query string, q
. Usually, you would name a table to work with after the from
keyword. Instead, we'll leave a placeholder {}
that will take in the second argument given to the sql()
method - here, the RemoteLazyFrame rdf
.
Note that you can user upper or lower case for the instructions in your SQL queries.
from bastionlab.polars import RemoteLazyFrame
# select the Sex and Age columns, limit output to 3 columns
q = "select Name, Sex, Age from {} limit 3"
# give the RemoteDataFrame to be read instead of {}
RemoteLazyFrame.sql(q, rdf).collect().fetch()
Warning: non privacy-preserving query. Reason: Cannot fetch a result DataFrame that does not aggregate at least 10 rows of DataFrame b57df989-7cf7-46e4-bf9a-0a6fd8979e97. This incident will be reported to the data owner.
Name | Sex | Age |
---|---|---|
str | str | f64 |
"Braund, Mr. Ow... | "male" | 22.0 |
"Cumings, Mrs. ... | "female" | 38.0 |
"Heikkinen, Mis... | "female" | 26.0 |
Filter with WHERE¶
We can add a WHERE
clause into our query.
We can use IS NOT NULL
to filter out any null results.
# select passengers who are between 10 and 18, embarked from Southampton, and have a cabin number
RemoteLazyFrame.sql(
"SELECT * FROM {} WHERE Age BETWEEN 10 AND 18 AND Embarked = 'S' AND Cabin IS NOT NULL",
rdf,
).collect().fetch()
Warning: non privacy-preserving query. Reason: Cannot fetch a result DataFrame that does not aggregate at least 10 rows of DataFrame b57df989-7cf7-46e4-bf9a-0a6fd8979e97. This incident will be reported to the data owner.
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|---|
i64 | i64 | i64 | str | str | f64 | i64 | i64 | str | f64 | str | str |
436 | 1 | 1 | "Carter, Miss. ... | "female" | 14.0 | 1 | 2 | "113760" | 120.0 | "B96 B98" | "S" |
505 | 1 | 1 | "Maioni, Miss. ... | "female" | 16.0 | 0 | 0 | "110152" | 86.5 | "B79" | "S" |
690 | 1 | 1 | "Madill, Miss. ... | "female" | 15.0 | 0 | 1 | "24160" | 211.3375 | "B5" | "S" |
782 | 1 | 1 | "Dick, Mrs. Alb... | "female" | 17.0 | 1 | 0 | "17474" | 57.0 | "B20" | "S" |
803 | 1 | 1 | "Carter, Master... | "male" | 11.0 | 1 | 2 | "113760" | 120.0 | "B96 B98" | "S" |
854 | 1 | 1 | "Lines, Miss. M... | "female" | 16.0 | 0 | 1 | "PC 17592" | 39.4 | "D28" | "S" |
Aggregated queries¶
To test out aggregated queries, let's get the oldest and youngest passengers on the titanic by using Max()
and Min()
. We'll also get the average age of passengers using Avg()
.
The AS
keyword will let us give a custom name to the column resulting from the operation.
# get the oldest, yougest and average age of the passengers.
RemoteLazyFrame.sql(
"SELECT Max(Age) AS Oldest, Min(Age) AS Youngest, Avg(Age) AS Average FROM {}", rdf
).collect().fetch()
Oldest | Youngest | Average |
---|---|---|
f64 | f64 | f64 |
80.0 | 0.42 | 29.699118 |
You might think 0.42 is a strange result to get for an age query, but you can check: there is a 0.42 entry in the Titanic dataset!
Group by¶
Next up, we are going to get the number of survivors in each class (1, 2 and 3) by using GROUP BY
.
# select how many passengers survived in each each ticket class
RemoteLazyFrame.sql(
'SELECT Pclass, count("Survived") FROM {} GROUP BY Pclass ORDER BY Pclass', rdf
).collect().fetch()
Pclass | Survived |
---|---|
i64 | u32 |
1 | 216 |
2 | 184 |
3 | 491 |
Join¶
Now let's take a look at an example of a join
. It would be complicated to do a quick example using the Titanic dataset, so let's just make our own two little dataframes.
The first one, df_actor
, will have the names of the two main characters of the movie Titanic and the actors playing them. The second, df_class
, will have the names of the two main characters and the class of their ticket.
We will do an inner join
to combine both and get all these informations in one rdf_TitanicMovie
RemoteLazyFrame. We use the ON
keyword to indicate that the "Name"
column should be taken as the joining point.
You'll notice that we need to send our RemoteLazyFrames as arguments to sql()
twice because we use our placeholders twice in our SQL query!
# creating and uploading a DataFrame with the actors
# playing the main characters in Titanic to the server
df_actor = pl.DataFrame(
{
"Name": ["Rose DeWitt Bukater", "Jack Dawson"],
"Actor": ["Kate Winslet", "Leonardo Di Caprio"],
}
)
# creating and uploading a DataFrame with the ticket
# class of the main characters in Titanic to the server
df_class = pl.DataFrame(
{
"Name": ["Rose DeWitt Bukater", "Jack Dawson"],
"Class": [1, 3],
}
)
# getting the corresponding RemoteLazyFrames of those DataFrames
rdf_actor = connection.client.polars.send_df(df_actor, policy=policy)
rdf_class = connection.client.polars.send_df(df_class, policy=policy)
# joining both RemoteDataFrames based on their "Name" column
rdf_TitanicMovie = RemoteLazyFrame.sql(
"SELECT * FROM {} inner join {} ON {}.Name = {}.Name",
rdf_actor,
rdf_class,
rdf_actor,
rdf_class,
)
# fetching the result
rdf_TitanicMovie.collect().fetch()
Warning: non privacy-preserving query. Reason: Cannot fetch a result DataFrame that does not aggregate at least 10 rows of DataFrame a480f4c0-7c76-4292-a3db-424c978b80fb. This incident will be reported to the data owner.
Name | Actor | Class |
---|---|---|
str | str | i64 |
"Rose DeWitt Bu... | "Kate Winslet" | 1 |
"Jack Dawson" | "Leonardo Di Ca... | 3 |
This little movie adventure brings our tutorial on SQL queries to an end. We have learnt how to select data, filter data down using WHERE
, use GROUP BY
, aggregate functions and do joins
using BastionLab's sql()
functionality.
Important!
Be aware, though, that not all SQL functionnalities work with BastionLab. There can be two reasons:- First, for security reasons, the query cannot be executed. You will get a message explaining why.
- Second, some features are not implemented in polars-sql, the package we use with BastionLab. For example,
DELETE
,UPDATE
,CASE
,INSERT
andFULL OUTER
andRIGHT JOIN
are not currently available.
Let's now close the connection and shutdown the server.
# connection.close()
# bastionlab_server.stop(srv)