Combining datasets

Combining datasets allows us to train models or run queries on multiple datasets from different parties, which can lead to much more powerful results.

Let's take an example: say 100 different hospitals around the world want to take part in a project to train a machine learning model to determine if a patient has Covid-19 based on a chest X-ray scan. The model will almost certainly be much more accurate and relevant to a more varied range of patients if this model is trained on a combined dataset from the 100 hospitals than from any one of these hospital.

In this tutorial, we are going to explore how we can combine remote datasets securely in BastionLab. The vital advantage of doing this with our remote privacy features is that a data scientist can combine all of these datasets without having direct access to any one of them, enabling a level of collaboration which may previously have been deemed too risky in terms of data privacy.

So let's take a look at the steps required to combine datasets.

Pre-requisites¶

Installation¶

In order to run this notebook, we need to:

Have Python3.7 (or greater) and Python Pip installed
Install BastionLab

We'll do so by running the code block below.

If you are running this notebook on your machine instead of Google Colab, you can see our Installation page to find the installation method that best suits your needs.

In [ ]:

            
                Copied!
                
# pip packages
!pip install bastionlab
!pip install bastionlab_server
# pip packages
!pip install bastionlab
!pip install bastionlab_server

Launch and connect to the server¶

In [ ]:

            
                Copied!
                
# launch bastionlab_server test package
import bastionlab_server

srv = bastionlab_server.start()
# launch bastionlab_server test package
import bastionlab_server

srv = bastionlab_server.start()

Note that the bastionlab_server package we install here was created for testing purposes. You can also install BastionLab server using our Docker image or from source (especially for non-test purposes). Check out our Installation Tutorial for more details.

In [1]:

            
                Copied!
                
# connecting to the server
from bastionlab import Connection

connection = Connection("localhost")
client = connection.client
# connecting to the server
from bastionlab import Connection

connection = Connection("localhost")
client = connection.client

Upload the dataframe to the server¶

For this tutorial, we'll create three simple Polars dataframes with "Element" and "Melting Point (K)" columns.

In [2]:

            
                Copied!
                
                    
                    
                
                

        
import polars as pl

df1 = pl.DataFrame(
    {
        "Element": ["Copper", "Silver", "Silver", "Gold"],
        "Melting Point (K)": [1357.77, 1234.93, 1234.93, 1337.33],
    }
)

df2 = pl.DataFrame(
    {"Element": ["Platinum", "Palladium"], "Melting Point (K)": [2041.4, 1828.05]}
)

df3 = pl.DataFrame({"Element": ["Titanium"], "Melting Point (K)": [1945.0]})
import polars as pl

df1 = pl.DataFrame(
    {
        "Element": ["Copper", "Silver", "Silver", "Gold"],
        "Melting Point (K)": [1357.77, 1234.93, 1234.93, 1337.33],
    }
)

df2 = pl.DataFrame(
    {"Element": ["Platinum", "Palladium"], "Melting Point (K)": [2041.4, 1828.05]}
)

df3 = pl.DataFrame({"Element": ["Titanium"], "Melting Point (K)": [1945.0]})

We'll quickly upload the dataset to the server with an open safety policy, since setting up BastionLab is not the focus of this tutorial. It will allows us to demonstrate features without having to approve any data access requests. You can check out how to define a safe privacy policy here.

In [3]:

            
                Copied!
                
from bastionlab.polars.policy import Policy, TrueRule, Log

policy = Policy(safe_zone=TrueRule(), unsafe_handling=Log(), savable=True)

rdf1 = client.polars.send_df(df1, policy=policy)
rdf2 = client.polars.send_df(df2, policy=policy)
rdf3 = client.polars.send_df(df3, policy=policy)
from bastionlab.polars.policy import Policy, TrueRule, Log

policy = Policy(safe_zone=TrueRule(), unsafe_handling=Log(), savable=True)

rdf1 = client.polars.send_df(df1, policy=policy)
rdf2 = client.polars.send_df(df2, policy=policy)
rdf3 = client.polars.send_df(df3, policy=policy)

Important!

This policy is not suitable for production. Please note that we only use it for demonstration purposes, to avoid having to approve any data access requests in the tutorial.

Appending datasets¶

We can now move onto exploring how to append RemoteLazyFrames using vstack. vstack() can be used to append any RemoteLazyFrame to another where the column names and types match.

You can learn about dataset preparation, including changing column types, names and adding and removing columns, in our Data cleaning tutorial

We call the vstack() method on the first RemoteLazyFrame and then give the RemoteLazyFrame we want to append to it as an argument. vstack returns the resulting combined RemoteLazyFrame.

Here, for example, rdf2, containing Platinum and Palladium, is appended to rdf1, containing Copper, Silver and Gold. We set rdf1 to equal combined RemoteLazyFrame returned by vstack, and so when we .collect().fetch() rdf1, we see the resulting combined dataset.

In [4]:

            
                Copied!
                
rdf1 = rdf1.vstack(rdf2)
rdf1.collect().fetch()
rdf1 = rdf1.vstack(rdf2)
rdf1.collect().fetch()

Out[4]:

shape: (6, 2)

Element	Melting Point (K)
str	f64
"Copper"	1357.77
"Silver"	1234.93
"Silver"	1234.93
"Gold"	1337.33
"Platinum"	2041.4
"Palladium"	1828.05

You can do this as many times as you want. For example, let's add our third RemoteLazyFrame containing Titanium to our first RemoteDataFrame, twice.

As you can see, rdf1 now has all the previous elements, plus two lots of Titanium at the end.

In [5]:

            
                Copied!
                
rdf1 = rdf1.vstack(rdf3)
rdf1 = rdf1.vstack(rdf3)
rdf1.collect().fetch()
rdf1 = rdf1.vstack(rdf3)
rdf1 = rdf1.vstack(rdf3)
rdf1.collect().fetch()

Out[5]:

shape: (8, 2)

Element	Melting Point (K)
str	f64
"Copper"	1357.77
"Silver"	1234.93
"Silver"	1234.93
"Gold"	1337.33
"Platinum"	2041.4
"Palladium"	1828.05
"Titanium"	1945.0
"Titanium"	1945.0

Joining datasets¶

We also implement join and join_asof to enable users to combine rows from two or more tables, based on a related column between the tables.

`join()`¶

Let's start by examining the join function further. The join function makes use of Polars own join function for LazyFrames and allows joins on RemoteLazyFrames on the same server.

The arguments accepted by join are the same as those accepted by Polars LazyFrame join method:

other (mandatory argument): The other RemoteLazyFrame you want to join your current RemoteLazyFrame with.
left_on: The name or list of names of the left join column(s). (Note: you must specify either the left_on and right_on columns or one on column.)
right_on: The name or list of names of of the right join column(s).
on: The name or list of names of join columns for both RemoteLazyFrames.
how: The how argument is where you specify your join strategy. See the 'strategies' section below for more info on the available options.
allow_parallel: Boolean value for allowing the physical plan to evaluate the computation of both RemoteLazyFrames up to the join in parallel.
force_parallel: Boolean value for forcing parallel the physical plan to evaluate the computation of both RemoteLazyFrames up to the join in parallel.

strategies¶

The following strategies are supported as values of the how keyword. They all will let you join your tables following a specific set of instructions.

inner (default): Returns records that have matching values in both tables.
left: Returns all records from the left table, and the matched records from the right table.
outer: Returns all records when there is a match in either left or right table.
semi: Similar to the inner strategy but returns results with only the columns from the left-hand RemoteLazyFrame.
anti: Like the semi join, but the opposite! It shows only the left-hand values that do not match with the right-hand RemoteLazyFrame values (based on the on column).
cross: Returns a paired combination of each row of the first table with each row of the second table.
suffix: Suffix to append to columns with a duplicate name.

To learn more about the more common inner, left and outer join methods, check out this article or this visual helper tool.

To learn more about cross joins, check out this article.

To learn more about the semi, anti and join strategies, check out the example in Polars User Guide.

Examples¶

We now know about join's arguments and join strategies, so let's take a look at a couple of examples!

Let's start by creating a new RemoteLazyFrame which will contain an element column to join with our previous rdf1 RemoteLazyFrame on, plus a Symbol and Number column. It will include some of the elements already in our rdf1 RemoteLazyFrame, plus some new elements.

In [6]:

            
                Copied!
                
rdf1.collect().fetch()
rdf1.collect().fetch()

Out[6]:

shape: (8, 2)

Element	Melting Point (K)
str	f64
"Copper"	1357.77
"Silver"	1234.93
"Silver"	1234.93
"Gold"	1337.33
"Platinum"	2041.4
"Palladium"	1828.05
"Titanium"	1945.0
"Titanium"	1945.0

In [7]:

            
                Copied!
                
                    
                    
                
                

        
new_df = pl.DataFrame(
    {
        "Element": ["Magnesium", "Silver", "Gold", "Platinum"],
        "Symbol": ["Mg", "Ag", "Au", "Pt"],
        "Number": [12, 47, 79, 78],
    }
)
new_rdf = client.polars.send_df(new_df, policy=policy)
new_rdf.collect().fetch()
new_df = pl.DataFrame(
    {
        "Element": ["Magnesium", "Silver", "Gold", "Platinum"],
        "Symbol": ["Mg", "Ag", "Au", "Pt"],
        "Number": [12, 47, 79, 78],
    }
)
new_rdf = client.polars.send_df(new_df, policy=policy)
new_rdf.collect().fetch()

Out[7]:

shape: (4, 3)

Element	Symbol	Number
str	str	i64
"Magnesium"	"Mg"	12
"Silver"	"Ag"	47
"Gold"	"Au"	79
"Platinum"	"Pt"	78

For our first example, we'll join the two RemoteLazyFrames by their column in common, "Element", using the inner method. This will leave us with a combined table, but will drop any elements which were not found in both tables.

In [8]:

            
                Copied!
                
join = rdf1.join(new_rdf, on="Element", how="inner")
join.collect().fetch()
join = rdf1.join(new_rdf, on="Element", how="inner")
join.collect().fetch()

Out[8]:

shape: (4, 4)

Element	Melting Point (K)	Symbol	Number
str	f64	str	i64
"Silver"	1234.93	"Ag"	47
"Silver"	1234.93	"Ag"	47
"Gold"	1337.33	"Au"	79
"Platinum"	2041.4	"Pt"	78

For our second example, we will us the anti join which will give us only the elements from the right-hand table which do not match with any elements in the the other table.

In [9]:

            
                Copied!
                
join = rdf1.join(new_rdf, on="Element", how="anti")
join.collect().fetch()
join = rdf1.join(new_rdf, on="Element", how="anti")
join.collect().fetch()

Out[9]:

shape: (4, 2)

Element	Melting Point (K)
str	f64
"Copper"	1357.77
"Palladium"	1828.05
"Titanium"	1945.0
"Titanium"	1945.0

`join_asof()`¶

join_asof() works similarly to a left-join, except that we match on nearest key rather than equal keys. For this to work, both RemoteLazyFrames must be sorted by the join_asof key.

Like with join, join_asof also makes use of Polars own join_asof function for LazyFrames and allows join_asof to be performed on RemoteLazyFrames on the same server.

The arguments accepted are the same as those accepted by the Polars LazyFrame join_asof method.

other (mandatory argument): The other RemoteLazyFrame you want to join your current RemoteLazyFrame with.
left_on: The name(s) of the left join column(s). (Note: you must specify either the left_on and right_on columns or one on column.)
right_on: The name(s) of of the right join column(s).
on: Name(s) of join columns for both RemoteLazyFrames.
by_left: Join on these columns before doing asof join.
by_right: Join on these columns before doing asof join.
by: Join on these columns before doing asof join.
strategy: Join strategy: 'backward' or 'forward'. See the next section, 'strategy' for more details.
suffix: Suffix to append to columns with a duplicate name.
tolerance: Numeric tolerance. By setting this the join will only be done if the near keys are within this distance.
allow_parallel: Boolean value for allowing the physical plan to evaluate the computation of both RemoteLazyFrames up to the join in parallel.
force_parallel: Boolean value for forcing parallel the physical plan to evaluate the computation of both RemoteLazyFrames up to the join in parallel.

strategy¶

If you select backward, search selects the last row in the right DataFrame whose on key is less than or equal to the left’s key.
If you select forward, search selects the first row in the right DataFrame whose on key is greater than or equal to the left’s key.

Examples¶

This was a lot of theory, now let's dive into an example.

First, we will create two RemoteLazyFrames. They both have a distance column that they can join on. The first dataframe has a names columns, with runners names, and a distance column, wiht how far they have ran over a week. The second dataframe has set "levels" associated with having ran a certain distance over a week: like running 50+km is classed as 'Pro' level.

In [10]:

            
                Copied!
                
                    
                    
                
                

        
df1 = pl.DataFrame(
    {
        "distance": [7, 16, 24, 49],
        "name": ["Laura", "Charles", "Kwabena", "Shannon"],
    }
)
rdf1 = client.polars.send_df(df1, policy=policy)
rdf1.collect().fetch()
df1 = pl.DataFrame(
    {
        "distance": [7, 16, 24, 49],
        "name": ["Laura", "Charles", "Kwabena", "Shannon"],
    }
)
rdf1 = client.polars.send_df(df1, policy=policy)
rdf1.collect().fetch()

Out[10]:

shape: (4, 2)

distance	name
i64	str
7	"Laura"
16	"Charles"
24	"Kwabena"
49	"Shannon"

In [11]:

            
                Copied!
                
                    
                    
                
                

        
df2 = pl.DataFrame(
    {
        "distance": [1, 10, 25, 50],
        "level unlocked": ["Amateur", "Intermediate", "Excellent", "Pro"],
    }
)
rdf2 = client.polars.send_df(df2, policy=policy)
rdf2.collect().fetch()
df2 = pl.DataFrame(
    {
        "distance": [1, 10, 25, 50],
        "level unlocked": ["Amateur", "Intermediate", "Excellent", "Pro"],
    }
)
rdf2 = client.polars.send_df(df2, policy=policy)
rdf2.collect().fetch()

Out[11]:

shape: (4, 2)

distance	level unlocked
i64	str
1	"Amateur"
10	"Intermediate"
25	"Excellent"
50	"Pro"

We can use the join_asof function to join the two columns, giving each runer an associated 'level unlocked' value based on how much they ran.

In [12]:

            
                Copied!
                
joined = rdf1.join_asof(rdf2, on="distance")
joined.collect().fetch()
joined = rdf1.join_asof(rdf2, on="distance")
joined.collect().fetch()

Out[12]:

shape: (4, 3)

distance	name	level unlocked
i64	str	str
7	"Laura"	"Amateur"
16	"Charles"	"Intermediate"
24	"Kwabena"	"Intermediate"
49	"Shannon"	"Excellent"

For further examples of join_asof, check out the examples in Polars User Guide!

With this, you now know multiple ways of how to combine datasets using BastionLab. We can close the connection and stop the server.

In [13]:

            
                Copied!
                
# connection.close()
# bastionlab_server.stop(srv)
# connection.close()
# bastionlab_server.stop(srv)

Combining datasets

Pre-requisites¶

Installation¶

Launch and connect to the server¶

Upload the dataframe to the server¶

Appending datasets¶

Joining datasets¶

join()¶

strategies¶

Examples¶

join_asof()¶

strategy¶

Examples¶

`join()`¶

`join_asof()`¶