Combining datasets allows us to train models or run queries on multiple datasets from different parties, which can lead to much more powerful results.
Let's take an example: say 100 different hospitals around the world want to take part in a project to train a machine learning model to determine if a patient has Covid-19 based on a chest X-ray scan. The model will almost certainly be much more accurate and relevant to a more varied range of patients if this model is trained on a combined dataset from the 100 hospitals than from any one of these hospital.
In this tutorial, we are going to explore how we can combine remote datasets securely in BastionLab. The vital advantage of doing this with our remote privacy features is that a data scientist can combine all of these datasets without having direct access to any one of them, enabling a level of collaboration which may previously have been deemed too risky in terms of data privacy.
So let's take a look at the steps required to combine datasets.
Pre-requisites¶
Installation¶
In order to run this notebook, we need to:
- Have Python3.7 (or greater) and Python Pip installed
- Install BastionLab
We'll do so by running the code block below.
If you are running this notebook on your machine instead of Google Colab, you can see our Installation page to find the installation method that best suits your needs.
# pip packages
!pip install bastionlab
!pip install bastionlab_server
Launch and connect to the server¶
# launch bastionlab_server test package
import bastionlab_server
srv = bastionlab_server.start()
Note that the bastionlab_server package we install here was created for testing purposes. You can also install BastionLab server using our Docker image or from source (especially for non-test purposes). Check out our Installation Tutorial for more details.
# connecting to the server
from bastionlab import Connection
connection = Connection("localhost")
client = connection.client
Upload the dataframe to the server¶
For this tutorial, we'll create three simple Polars dataframes with "Element" and "Melting Point (K)" columns.
import polars as pl
df1 = pl.DataFrame(
{
"Element": ["Copper", "Silver", "Silver", "Gold"],
"Melting Point (K)": [1357.77, 1234.93, 1234.93, 1337.33],
}
)
df2 = pl.DataFrame(
{"Element": ["Platinum", "Palladium"], "Melting Point (K)": [2041.4, 1828.05]}
)
df3 = pl.DataFrame({"Element": ["Titanium"], "Melting Point (K)": [1945.0]})
We'll quickly upload the dataset to the server with an open safety policy, since setting up BastionLab is not the focus of this tutorial. It will allows us to demonstrate features without having to approve any data access requests. You can check out how to define a safe privacy policy here.
from bastionlab.polars.policy import Policy, TrueRule, Log
policy = Policy(safe_zone=TrueRule(), unsafe_handling=Log(), savable=True)
rdf1 = client.polars.send_df(df1, policy=policy)
rdf2 = client.polars.send_df(df2, policy=policy)
rdf3 = client.polars.send_df(df3, policy=policy)
Important!
This policy is not suitable for production. Please note that we only use it for demonstration purposes, to avoid having to approve any data access requests in the tutorial.
Appending datasets¶
We can now move onto exploring how to append RemoteLazyFrames using vstack
. vstack()
can be used to append any RemoteLazyFrame to another where the column names and types match.
You can learn about dataset preparation, including changing column types, names and adding and removing columns, in our Data cleaning tutorial
We call the vstack()
method on the first RemoteLazyFrame and then give the RemoteLazyFrame we want to append to it as an argument. vstack
returns the resulting combined RemoteLazyFrame.
Here, for example, rdf2
, containing Platinum and Palladium, is appended to rdf1
, containing Copper, Silver and Gold. We set rdf1
to equal combined RemoteLazyFrame returned by vstack
, and so when we .collect().fetch()
rdf1
, we see the resulting combined dataset.
rdf1 = rdf1.vstack(rdf2)
rdf1.collect().fetch()
Element | Melting Point (K) |
---|---|
str | f64 |
"Copper" | 1357.77 |
"Silver" | 1234.93 |
"Silver" | 1234.93 |
"Gold" | 1337.33 |
"Platinum" | 2041.4 |
"Palladium" | 1828.05 |
You can do this as many times as you want. For example, let's add our third RemoteLazyFrame containing Titanium to our first RemoteDataFrame, twice.
As you can see, rdf1
now has all the previous elements, plus two lots of Titanium at the end.
rdf1 = rdf1.vstack(rdf3)
rdf1 = rdf1.vstack(rdf3)
rdf1.collect().fetch()
Element | Melting Point (K) |
---|---|
str | f64 |
"Copper" | 1357.77 |
"Silver" | 1234.93 |
"Silver" | 1234.93 |
"Gold" | 1337.33 |
"Platinum" | 2041.4 |
"Palladium" | 1828.05 |
"Titanium" | 1945.0 |
"Titanium" | 1945.0 |
Joining datasets¶
We also implement join
and join_asof
to enable users to combine rows from two or more tables, based on a related column between the tables.
join()
¶
Let's start by examining the join
function further. The join
function makes use of Polars own join
function for LazyFrames and allows joins on RemoteLazyFrames on the same server.
The arguments accepted by join
are the same as those accepted by Polars LazyFrame join
method:
other (mandatory argument)
: The other RemoteLazyFrame you want to join your current RemoteLazyFrame with.left_on
: The name or list of names of the left join column(s). (Note: you must specify either theleft_on
andright_on
columns or oneon
column.)right_on
: The name or list of names of of the right join column(s).on
: The name or list of names of join columns for both RemoteLazyFrames.how
: Thehow
argument is where you specify your join strategy. See the 'strategies' section below for more info on the available options.allow_parallel
: Boolean value for allowing the physical plan to evaluate the computation of both RemoteLazyFrames up to the join in parallel.force_parallel
: Boolean value for forcing parallel the physical plan to evaluate the computation of both RemoteLazyFrames up to the join in parallel.
strategies¶
The following strategies are supported as values of the how
keyword. They all will let you join your tables following a specific set of instructions.
inner (default)
: Returns records that have matching values in both tables.left
: Returns all records from the left table, and the matched records from the right table.outer
: Returns all records when there is a match in either left or right table.semi
: Similar to theinner
strategy but returns results with only the columns from the left-hand RemoteLazyFrame.anti
: Like thesemi
join, but the opposite! It shows only the left-hand values that do not match with the right-hand RemoteLazyFrame values (based on theon
column).cross
: Returns a paired combination of each row of the first table with each row of the second table.suffix
: Suffix to append to columns with a duplicate name.
To learn more about the more common
inner
,left
andouter
join methods, check out this article or this visual helper tool.To learn more about
cross
joins, check out this article.To learn more about the
semi
,anti
and join strategies, check out the example in Polars User Guide.
Examples¶
We now know about join's arguments and join strategies, so let's take a look at a couple of examples!
Let's start by creating a new RemoteLazyFrame which will contain an element
column to join with our previous rdf1
RemoteLazyFrame on, plus a Symbol
and Number
column. It will include some of the elements already in our rdf1
RemoteLazyFrame, plus some new elements.
rdf1.collect().fetch()
Element | Melting Point (K) |
---|---|
str | f64 |
"Copper" | 1357.77 |
"Silver" | 1234.93 |
"Silver" | 1234.93 |
"Gold" | 1337.33 |
"Platinum" | 2041.4 |
"Palladium" | 1828.05 |
"Titanium" | 1945.0 |
"Titanium" | 1945.0 |
new_df = pl.DataFrame(
{
"Element": ["Magnesium", "Silver", "Gold", "Platinum"],
"Symbol": ["Mg", "Ag", "Au", "Pt"],
"Number": [12, 47, 79, 78],
}
)
new_rdf = client.polars.send_df(new_df, policy=policy)
new_rdf.collect().fetch()
Element | Symbol | Number |
---|---|---|
str | str | i64 |
"Magnesium" | "Mg" | 12 |
"Silver" | "Ag" | 47 |
"Gold" | "Au" | 79 |
"Platinum" | "Pt" | 78 |
For our first example, we'll join the two RemoteLazyFrames by their column in common, "Element"
, using the inner
method. This will leave us with a combined table, but will drop any elements which were not found in both tables.
join = rdf1.join(new_rdf, on="Element", how="inner")
join.collect().fetch()
Element | Melting Point (K) | Symbol | Number |
---|---|---|---|
str | f64 | str | i64 |
"Silver" | 1234.93 | "Ag" | 47 |
"Silver" | 1234.93 | "Ag" | 47 |
"Gold" | 1337.33 | "Au" | 79 |
"Platinum" | 2041.4 | "Pt" | 78 |
For our second example, we will us the anti
join which will give us only the elements from the right-hand table which do not match with any elements in the the other
table.
join = rdf1.join(new_rdf, on="Element", how="anti")
join.collect().fetch()
Element | Melting Point (K) |
---|---|
str | f64 |
"Copper" | 1357.77 |
"Palladium" | 1828.05 |
"Titanium" | 1945.0 |
"Titanium" | 1945.0 |
join_asof()
¶
join_asof()
works similarly to a left-join
, except that we match on nearest key rather than equal keys. For this to work, both RemoteLazyFrames must be sorted by the join_asof
key.
Like with join
, join_asof
also makes use of Polars own join_asof
function for LazyFrames and allows join_asof
to be performed on RemoteLazyFrames on the same server.
The arguments accepted are the same as those accepted by the Polars LazyFrame join_asof
method.
other (mandatory argument)
: The other RemoteLazyFrame you want to join your current RemoteLazyFrame with.left_on
: The name(s) of the left join column(s). (Note: you must specify either theleft_on
andright_on
columns or oneon
column.)right_on
: The name(s) of of the right join column(s).on
: Name(s) of join columns for both RemoteLazyFrames.by_left
: Join on these columns before doing asof join.by_right
: Join on these columns before doing asof join.by
: Join on these columns before doing asof join.strategy
: Join strategy:'backward'
or'forward'
. See the next section, 'strategy' for more details.suffix
: Suffix to append to columns with a duplicate name.tolerance
: Numeric tolerance. By setting this the join will only be done if the near keys are within this distance.allow_parallel
: Boolean value for allowing the physical plan to evaluate the computation of both RemoteLazyFrames up to the join in parallel.force_parallel
: Boolean value for forcing parallel the physical plan to evaluate the computation of both RemoteLazyFrames up to the join in parallel.
strategy¶
- If you select
backward
, search selects the last row in the right DataFrame whoseon
key is less than or equal to the left’s key. - If you select
forward
, search selects the first row in the right DataFrame whoseon
key is greater than or equal to the left’s key.
Examples¶
This was a lot of theory, now let's dive into an example.
First, we will create two RemoteLazyFrames. They both have a distance
column that they can join on. The first dataframe has a names columns, with runners names, and a distance column, wiht how far they have ran over a week. The second dataframe has set "levels" associated with having ran a certain distance over a week: like running 50+km is classed as 'Pro' level.
df1 = pl.DataFrame(
{
"distance": [7, 16, 24, 49],
"name": ["Laura", "Charles", "Kwabena", "Shannon"],
}
)
rdf1 = client.polars.send_df(df1, policy=policy)
rdf1.collect().fetch()
distance | name |
---|---|
i64 | str |
7 | "Laura" |
16 | "Charles" |
24 | "Kwabena" |
49 | "Shannon" |
df2 = pl.DataFrame(
{
"distance": [1, 10, 25, 50],
"level unlocked": ["Amateur", "Intermediate", "Excellent", "Pro"],
}
)
rdf2 = client.polars.send_df(df2, policy=policy)
rdf2.collect().fetch()
distance | level unlocked |
---|---|
i64 | str |
1 | "Amateur" |
10 | "Intermediate" |
25 | "Excellent" |
50 | "Pro" |
We can use the join_asof
function to join the two columns, giving each runer an associated 'level unlocked' value based on how much they ran.
joined = rdf1.join_asof(rdf2, on="distance")
joined.collect().fetch()
distance | name | level unlocked |
---|---|---|
i64 | str | str |
7 | "Laura" | "Amateur" |
16 | "Charles" | "Intermediate" |
24 | "Kwabena" | "Intermediate" |
49 | "Shannon" | "Excellent" |
For further examples of join_asof, check out the examples in Polars User Guide!
With this, you now know multiple ways of how to combine datasets using BastionLab. We can close the connection and stop the server.
# connection.close()
# bastionlab_server.stop(srv)