Authentication¶
In this tutorial, we'll explain to you how to start the BastionLab server
with user authentication enabled and connect the client
to it. We'll execute simple queries on BastionLab's central object, the RemoteDataFrame
, and we'll show that the authentication works, by creating a connection to the server with an non-authenticated identity.
It is important to note that BastionLab has authentication enabled by default. But you can also use it without authentication, which is a perfectly fine setting when deploying locally. To do that, you need to export an evironment variable before you run the server, by running export DISABLE_AUTHENTICATION=1
.
If youโre not deploying locally, youโll need authentication to help secure access to the server. This means only known users will be able to access and use it.
To do so with BastionLab, we'll use public key authentication. We will also learn how to set up authentication and create 'identities', which are BastionLabโs abstraction for authentication. The Identity
interface creates both usersโ public and private keys.
If you want to know more on both the queries and the
RemoteDataFrames
, you can check out the Data Scientist's side in our Quick tour.
Pre-requisites¶
Installation¶
In order to run this notebook, we need to:
- Have Python3.7 (or greater) and Python Pip installed
- Have Docker installed (here's the official tutorial)
- Install BastionLab
We'll do so by running the code block below.
You can see our Installation page to check if PyPi and Docker are the best method for you.
# pip packages
!pip install bastionlab
Launch the server¶
!docker pull mithrilsecuritysas/bastionlab:latest
Setting up the keys¶
In an authentication-enabled environment, BastionLab only accept request from verified users (known users whose public keys have been registered to the server at start-up).
Authentication is done with asymmetric cryptography:
- First the data owners provides a list of authorized public keys to the server at start-up.
- Then all users must provide their corresponding private key to the client when they connect to the server.
Identity creation¶
BastionLab provides a utility module to manage the keys. We'll show how to create the public and private keys for a single user. To create and manage key pairs (the corresponding public and private keys), we'll use the Identity
class of the client
.
In this section, we will create two identities:
- one for the
data_owner
- one for the
data_scientist
Note - The keys generated by the
Identity
class are placed in the current working directory.
from bastionlab import Identity
# Create `Identity` for data owner.
data_owner = Identity.create("data_owner")
# Create `Identity` for data scientist.
data_scientist = Identity.create("data_scientist")
# Fake `Identity` used for testing authentication
fake_scientist = Identity.create("fake_scientist")
Now that we have setup our identities, we will have to start the server with the public keys of the data owner and the data scientist.
Note - This step will have to be done by the party setting up the server, commonly the data owner. They will have to get all the public keys of the interested parties.
BastionLab server public keys structure¶
Illustrated below is the directory structure of BastionLab server.
keys/
โโ owners/
โโ users/
By convention,
keys
is used as the default directory to store public keys.
For ease of use, it's best to have have a directory structre for your public keys similar to that of BastionLab server.
To that end, run the following commands that will create the relevant directory structure.
!mkdir -p keys/owners keys/users
For the purpose of this tutorial, we'll copy both the public keys of the data owner and data scientist.
!cp data_owner.pub keys/owners
!cp data_scientist.pub keys/users
Starting BastionLab server with public keys¶
!docker run -it -p 50056:50056 -v $(pwd)/keys:/app/bin/keys mithrilsecuritysas/bastionlab:latest
Setting up an authenticated connection¶
This tutorial is essentially the same as in the Quick tour but using an authenticated connection.
Note - Please remember that we use
data_owner
from Identity Creation
!wget 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
--2023-01-04 16:03:34-- https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 60302 (59K) [text/plain] Saving to: โtitanic.csvโ titanic.csv 100%[===================>] 58.89K --.-KB/s in 0.1s 2023-01-04 16:03:35 (486 KB/s) - โtitanic.csvโ saved [60302/60302]
Data Owner's side¶
Upload the data frame to the BastionLab Client¶
First, we load the csv file:
import polars as pl
df = pl.read_csv("titanic.csv")
We then open an authenticated connection to the server by providing its hostname and the identity:
from bastionlab import Connection
connection = Connection("localhost", identity=data_owner)
For more details on both operations, you can check the Data Owner's side of the Quick tour tutorial.
The Connection
class¶
The Connection
class accepts as arguments the hostname
and identity
.
hostname
: This is the address of the BastionLab server to which we are connecting. Since we host locally the server, we uselocalhost
.identity
: TheIdentity.create
method returnsSigningKey
or a private key and this is used to establish a connection with the server.- The BastionLab client uses the
SigningKey
to sign a special message (this message contains a unique challenge message requested from the server and some other metadata). - The signed message is then sent to the server along with the hash of the public key of the client.
- If the server has that public key (either in
keys/owners
orkeys/users
), it verifies the signed message using the corresponding public key. - If the verification passes, a session is created between the server and client and then a session token is sent back to the client.
- If it fails, a connection isn't established and the server throws an error that "The user isn't authenticated".
- The BastionLab client uses the
To avoid having to authenticate every single operation, which would be tedious and not significantly improve the security, BastionLab allows the user to remain authenticated for as long as the session doesnโt expire.
To do so, a token is sent from the server to the user after the creation of a connection and silently appended to each subsequent request to the server. For example, once authenticated, the user can send_df
requests to any of BastionLabโs endpoints without needing to add their identity.
from bastionlab.polars.policy import Policy, Aggregation, Log
policy = Policy(
safe_zone=Aggregation(min_agg_size=10), unsafe_handling=Log(), savable=True
)
connection.client.polars.send_df(df, policy=policy, sanitized_columns=["Name"])
FetchableLazyFrame(identifier=92f6366d-461e-4b55-acf1-e054dfdce06e)
Deleting a dataframe¶
The data owner can also delete dataframes on the server and it's important to note that they are the only one with the right to perfom a deletion.
Let's see how it works by sending a dataframe and then deleting it. Weโll use the delete_df()
method and give it the RemoteDataFrame identifier
argument. Then we'll list the dataframes available on the server before and after the deletion to test it.
# We create a RemoteLazyFrame with our dataset
rdf = connection.client.polars.send_df(df, policy=policy, sanitized_columns=["Name"])
# We test it's been created
print(connection.client.polars.list_dfs())
# We delete the dataframe using the delete method
rdf.delete()
# We can't find it in the list now: it's been deleted!
print(connection.client.polars.list_dfs())
[FetchableLazyFrame(identifier=fbc57a76-39e7-43dc-8d0a-a595f2c752a8), FetchableLazyFrame(identifier=92f6366d-461e-4b55-acf1-e054dfdce06e)] [FetchableLazyFrame(identifier=92f6366d-461e-4b55-acf1-e054dfdce06e)]
Data Scientist's side¶
The data owner is not the only one that has to connect. Using the same method as in the previous section, hereโs how the data scientist can set up a connection to the server using their Identity
.
First, weโll ask the server for a list of all the data frames, then select the first RemoteDataFrame
to be used for the rest of the analysis:
connection = Connection("localhost", identity=data_scientist)
client = connection.client
all_rdfs = client.polars.list_dfs()
rdf = all_rdfs[0]
all_rdfs
[FetchableLazyFrame(identifier=92f6366d-461e-4b55-acf1-e054dfdce06e)]
Note - If you want to know more about those queries, they are well-detailed in the Data Scientist's side section of the Quick tour.
Running Queries¶
As an example, letโs do a simple operation: read the first five (5) elements of the RemoteDataFrame
.
rdf1 = rdf.head(5)
print(rdf1)
rdf2 = rdf1.collect()
print(rdf2)
RemoteLazyFrame FetchableLazyFrame(identifier=2f9eb018-e429-4220-9a71-38467eafa896)
It works!
Testing a non-authenticated user¶
Now, let's try to connect to server with an Identity
unknown to the server.
connection.close()
connection = Connection("localhost", identity=fake_scientist)
policy = Policy(
safe_zone=Aggregation(min_agg_size=10), unsafe_handling=Log(), savable=True
)
connection.client.polars.send_df(df, policy=policy, sanitized_columns=["Name"])
--------------------------------------------------------------------------- _InactiveRpcError Traceback (most recent call last) Cell In [8], line 7 3 connection = Connection("localhost", identity=fake_scientist) 4 policy = Policy( 5 safe_zone=Aggregation(min_agg_size=10), unsafe_handling=Log(), savable=True 6 ) ----> 7 connection.client.polars.send_df(df, policy=policy, sanitized_columns=["Name"]) File ~/Documents/bastionlab/env/lib/python3.10/site-packages/bastionlab/client.py:223, in Connection.client(self) 221 return self._client 222 else: --> 223 return self.__enter__() File ~/Documents/bastionlab/env/lib/python3.10/site-packages/bastionlab/client.py:247, in Connection.__enter__(self) 244 connection_options = (("grpc.ssl_target_name_override", self.server_name),) 246 # Verify user by creating session --> 247 self.token = Connection._verify_user( 248 server_target, server_creds, connection_options, self.identity 249 ) 251 auth_plugin = AuthPlugin() 253 channel_cred = ( 254 server_creds 255 if self.identity is None (...) 260 ) 261 ) File ~/Documents/bastionlab/env/lib/python3.10/site-packages/bastionlab/client.py:208, in Connection._verify_user(server_target, server_creds, options, signing_key) 205 signed = signing_key.sign(to_sign) 206 metadata += ((f"signature-{(pubkey_hex)}-bin", signed),) --> 208 token = session_stub.CreateSession(CLIENT_INFO, metadata=metadata).token 210 return token 211 else: File ~/Documents/bastionlab/env/lib/python3.10/site-packages/grpc/_channel.py:946, in _UnaryUnaryMultiCallable.__call__(self, request, timeout, metadata, credentials, wait_for_ready, compression) 937 def __call__(self, 938 request, 939 timeout=None, (...) 942 wait_for_ready=None, 943 compression=None): 944 state, call, = self._blocking(request, timeout, metadata, credentials, 945 wait_for_ready, compression) --> 946 return _end_unary_response_blocking(state, call, False, None) File ~/Documents/bastionlab/env/lib/python3.10/site-packages/grpc/_channel.py:849, in _end_unary_response_blocking(state, call, with_call, deadline) 847 return state.response 848 else: --> 849 raise _InactiveRpcError(state) _InactiveRpcError: <_InactiveRpcError of RPC that terminated with: status = StatusCode.PERMISSION_DENIED details = ""79067ba35423057fce2950b2f685733ded4d98b60fdde618d24db8a583522b9a" not authenticated!" debug_error_string = "{"created":"@1673019761.451517774","description":"Error received from peer ipv4:127.0.0.1:50056","file":"src/core/lib/surface/call.cc","file_line":966,"grpc_message":""79067ba35423057fce2950b2f685733ded4d98b60fdde618d24db8a583522b9a" not authenticated!","grpc_status":7}" >
# connection.close()