When collaborating with data scientists, data owners often have to manually sanitize the extracts of the datasets they share. This is unsafe due to the large risk of human-error, and costs a lot in time as well as ressources.
Implementing a data access policy is the solution we found to automate this process, while making it safer and less of a headache. Our privacy policy defines the kind of operations that can be run on a RemoteDataFrame (the main object you'll manipulate with BastionLab). It will ensure that data scientists are unable to fetch individial rows or do any operation that leak informations. The policy must be set based on the sensitivity of the dataset.
In this tutorial, we'll show how it works, which options you can customize to your needs, and how to implement it on your dataset.
Pre-requisites¶
Installation¶
In order to run this notebook, we need to:
- Have Python3.7 (or greater) and Python Pip installed
- Install BastionLab
We'll do so by running the code block below.
If you are running this notebook on your machine instead of Google Colab, you can see our Installation page to find the installation method that best suits your needs.
# pip packages
!pip install bastionlab
!pip install bastionlab_server
Launch the server¶
# launch bastionlab_server test package
import bastionlab_server
srv = bastionlab_server.start()
Note that the bastionlab_server package we install here was created for testing purposes. You can also install BastionLab server using our Docker image or from source (especially for non-test purposes). Check out our Installation Tutorial for more details.
Privacy policy options¶
A privacy policy is a set of rules describing the kind of operations that can be run on your data.
Technically, they are defined at the RemoteDataFrame level (BastionLab's main object), which means that every RemoteDataFrame
produced (output) will inherit their policy from the input. When there is more than one input, the new policy is a combination of all the input policies using the AND
combinator. In this section, we'll cover the different inputs you can define.
A policy has various sections:
safe_zone
: it contains the rules specifying whether the result of a query is safe to return to a data-scientist.unsafe_handling
: this parameter specifies the type of action that must be taken if a query breaks the rules of the safe zone.savable
: this parameter accepts a boolean value. Iftrue
, theRemoteDataFrame
itself and all its derived RemoteDataFrames can be saved on the server.
Now, let's import all the options they can have and that we'll demonstrate in this tutorial:
from bastionlab import Identity, Connection
from bastionlab.polars.policy import (
Policy,
AtLeastNOf,
Aggregation,
UserId,
Log,
Review,
Reject,
)
safe-zone
¶
The safe zone contains the rules specifying whether the result of a query is safe to return to a user.
Aggregation()
¶
The Aggregation()
rule ensures that the returned dataframe aggregates, at minimum, the specified number of rows from the orginial dataset.
In the following example if the result of a query does not aggregate at least 10 rows, it will violate the safe zone.
policy = Policy(
safe_zone=Aggregation(min_agg_size=10), unsafe_handling=Log(), savable=False
)
UserId()
¶
The UserId()
rule lets a data owner grant access to a dataframe, to one particular user. The user_id
is the hash of the public key of the user.
Note - We explain what
Identities
are and how they work in our Authentication tutorial.
The worflow is as follows: on one side, the data scientist (or user) shoud make their Identity and obtain their user_id
. Then they should share it with other side, the data owner, so their can add it in the safe zone.
Here's how:
# Data scientist side
data_scientist = Identity.create("./data_scientist")
user_id = (
data_scientist.pubkey.hash.hex()
) # returns the public-key from the Identity, converted in the right format
# Data owner side
policy = Policy(safe_zone=UserId(user_id), unsafe_handling=Log(), savable=False)
*Important -
UserId()
will only work if authentication is enabled on the server.*
AtLeastNOf()
¶
AtLeastNOf()
is a collection of rules which ensures that the result of a query must pass at least n
rules of the total number of rules.
You can use this to specify, for example, different rules for different users.
A possible scenario would be:
- Our main user, a data scientist, is trusted by the data owner. They can run any query they want on the dataset and retrieve the results.
- Other users are untrusted by the data owner. They must aggregate a minimum of 20 rows in the resulting dataframe.
When a query is run on the dataframe with this policy, the AtLeastNOf()
rule will check that atleast 'n'
of the rules listed in 'of'
are matched. Another way of understanding it is that either the user connecting is the trusted_data_scientist
, or they have to aggregate a minimum of 20 rows.
data_scientist = Identity.create("./data_scientist")
trusted_data_scientist_id = data_scientist.pubkey.hash.hex()
policy = Policy(
safe_zone=AtLeastNOf(
n=1, of=[UserId(trusted_data_scientist_id), Aggregation(min_agg_size=20)]
),
unsafe_handling=Log(),
savable=False,
)
If you changed n
to 2 in the code above, the policy would enforce that both rules match: access would only be allowed for the trusted_data_scientist
and their queries would also need to aggregate a minimum of 20 rows.
unsafe_handling
¶
The unsafe_handling
parameter is where the data owner specifies the action that must be taken when a query violates the safe zone.
Log()
¶
Important - This action is unsafe! It is only suitable for development and testing. The server will return the dataframe that violates the safe zone to the user.
The Log()
action logs every query that violates the safe zone. It is the default action.
For example, if the following policy (which requires a minimum of 10 aggregated rows) is violated because an operation only aggregates 5 rows, the server will log that query.
policy = Policy(safe_zone=Aggregation(10), unsafe_handling=Log(), savable=False)
Review()
¶
The Review()
action will require the data owner's approval to return any dataframes that violate the safe zone. Then the data owner can review the operation and either accept or reject the query.
If approved, the dataframe is returned to the user. If rejected, the user will be notified that the data owner has rejected their query.
policy = Policy(safe_zone=Aggregation(10), unsafe_handling=Review(), savable=False)
Reject()
¶
The Reject()
action will automatically reject any query that violates the safe zone.
policy = Policy(safe_zone=Aggregation(10), unsafe_handling=Reject(), savable=False)
savable
¶
The savable
parameter is where the data owner specifies whether the current (this) RemoteDataFrame can be saved and allowed to remain on the server even after a server restart.
If set to true
, any user can save this RemoteDataFrame and any RemoteDataFrames derived from it.
If set to false
, neither this RemoteDataFrame nor any RemoteDataFrames resulting from it can be saved.
# this dataframe can be saved
policy = Policy(safe_zone=Aggregation(10), unsafe_handling=Reject(), savable=True)
Set-up a privacy policy¶
Now that we know how all the rules work, let's play with an example and see how to implement it when uploading our dataset. We'll use the Titanic dataset, which contains information relating to the passengers aboard the Titanic. We can download it by running the code block below:
!wget 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
We'll set up a minimum of 10 aggregated row for any query and reject the ones that don't follow this rule:
import polars as pl
# we open the connection to BastionLab server
connection = Connection("localhost")
# we create a dataframe with the dataset
df = pl.read_csv("titanic.csv")
policy = Policy(
safe_zone=Aggregation(10), unsafe_handling=Reject(), savable=False
) # we define the policy
# we upload our dataset AND the policy rules which returns a RemoteDataFrame instance
rdf = connection.client.polars.send_df(df, policy=policy)
To test that it works, let's run a safe query that aggregates at least 10 rows:
per_class_rates = (
rdf.select([pl.col("Pclass"), pl.col("Survived")])
.groupby(pl.col("Pclass"))
.agg(pl.col("Survived").mean())
.sort("Survived", reverse=True)
.collect()
.fetch()
)
print(per_class_rates)
shape: (3, 2) ┌────────┬──────────┐ │ Pclass ┆ Survived │ │ --- ┆ --- │ │ i64 ┆ f64 │ ╞════════╪══════════╡ │ 1 ┆ 0.62963 │ ├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤ │ 2 ┆ 0.472826 │ ├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤ │ 3 ┆ 0.242363 │ └────────┴──────────┘
Let's now try an unsafe query that doesn't aggregate the minimum number of rows:
unsafe_df = (
rdf.select([pl.col("Age"), pl.col("Survived")])
.groupby(pl.col("Age"))
.agg(pl.col("Survived").mean())
.sort("Survived", reverse=True)
.collect()
.fetch()
)
print(unsafe_df)
The query has been rejected by the data owner. None
Sanitization of columns¶
We're now handling many options automatically, but what about columns which are never safe to expose? For example, a column of names... We want to make sure those are removed from the dataframe when the dataframe is fetched.
We can do this by using the sanitized_columns
parameter in the send_df()
call:
policy = Policy(safe_zone=Aggregation(10), unsafe_handling=Log(), savable=False)
# We add a step in the send_df() call:
rdf = connection.client.polars.send_df(df, policy=policy, sanitized_columns=["Name"])
rdf.head(5).collect().fetch()
Warning: non privacy-preserving query. Reason: Cannot fetch a result DataFrame that does not aggregate at least 10 rows of DataFrame a3f0a488-7da6-4874-9ba0-933aad1f41a9. This incident will be reported to the data owner.
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|---|
i64 | i64 | i64 | str | str | f64 | i64 | i64 | str | f64 | str | str |
1 | 0 | 3 | null | "male" | 22.0 | 1 | 0 | "A/5 21171" | 7.25 | null | "S" |
2 | 1 | 1 | null | "female" | 38.0 | 1 | 0 | "PC 17599" | 71.2833 | "C85" | "C" |
3 | 1 | 3 | null | "female" | 26.0 | 0 | 0 | "STON/O2. 31012... | 7.925 | null | "S" |
4 | 1 | 1 | null | "female" | 35.0 | 1 | 0 | "113803" | 53.1 | "C123" | "S" |
5 | 0 | 3 | null | "male" | 35.0 | 0 | 0 | "373450" | 8.05 | null | "S" |
When we print the first five rows of the dataset, the 'Name'
column has been replaced by null
values!
We have successfully set up a privacy policy. Now let's terminate the connection and stop the server:
# connection.close()
# bastionlab_server.stop(srv)