In order for data scientists to use BastionLab from data exploration to deep learning training and machine learning model fitting, it's important that they are able to convert remote data to their respective representations.
This tutorial introduces how you can convert a RemoteDataFrame to RemoteTensor, using a RemoteArray intermediary step, and use them for your deep learning model training.
Pre-requisites¶
Installation and dataset¶
In order to run this notebook, we need to:
- Have Python3.7 (or greater) and Python Pip installed
- Install BastionLab
- Install PyTorch 1.13.1
- Download the dataset we will be using in this tutorial.
We'll do so by running the code block below.
If you are running this notebook on your machine instead of Google Colab, you can see our Installation page to find the installation method that best suits your needs.
# pip packages
!pip install bastionlab
!pip install bastionlab_server
!pip install torch
# download the dataset
!wget 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
Our dataset is based on the Titanic dataset, one of the most popular ressource used for understanding machine learning, which contains information relating to the passengers aboard the Titanic.
Launch and connect to the server¶
# launch bastionlab_server test package
import bastionlab_server
srv = bastionlab_server.start()
Note that the bastionlab_server package we install here was created for testing purposes. You can also install BastionLab server using our Docker image or from source (especially for non-test purposes). Check out our Installation Tutorial for more details.
# connect to the server
from bastionlab import Connection
connection = Connection("localhost")
client = connection.client
/home/kbamponsem/base/lib/python3.8/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
Upload the dataframe to the server¶
We'll quickly upload the dataset to the server with an open safety policy, since setting up BastionLab is not the focus of this tutorial. It will allows us to demonstrate features without having to approve any data access requests. You can check out how to define a privacy policy here.
We'll also limit the size of the dataset sent to the server, with Polar's df.limit() method, to run this tutorial faster and use less ressources - since we are only performing data conversion and not full on data exploration.
import polars as pl
from bastionlab.polars.policy import Policy, TrueRule, Log
df = pl.read_csv("titanic.csv")
policy = Policy(safe_zone=TrueRule(), unsafe_handling=Log(), savable=False)
rdf = client.polars.send_df(df.limit(100), policy=policy)
rdf
FetchableLazyFrame(identifier=3354b658-09a1-48c7-827d-4fccf9797a0b)
Convert RemoteDataFrame to RemoteArray¶
To convert BastionLab's main data exploration object, the RemoteDataFrame, to it's AI training's main object RemoteTensor, we'll need to go through an intermediary step: the RemoteArray.
Since NumPy library's arrays are commonly used in machine learning training, we decided to make our user interface and experience similar. What we'll show in this tutorial will be as straightforward as fitting a Scikit-learn LinearRegression model on a NumPy array.
lr = LinearRegression()
lr.fit(array)
Except, in BastionLab, array will be RemoteArray, which are pointers to a RemoteDataFrame. When to_tensor() will be called, they'll convert the RemoteDataFrame to a RemoteTensor.
# Converting a RemoteDataFrame to a RemoteArray
rdf.to_array()
--------------------------------------------------------------------------- _InactiveRpcError Traceback (most recent call last) Cell In[20], line 2 1 # Converting a RemoteDataFrame to a RemoteArray ----> 2 rdf.to_array() File ~/Projects/bastionlab/client/src/bastionlab/polars/remote_polars.py:1011, in FetchableLazyFrame.to_array(self) 1004 """ 1005 Converts a FetchableLazyFrame into a RemoteArray 1006 1007 Returns: 1008 RemoteArray 1009 """ 1010 client = self._meta._polars_client.client -> 1011 res = client._converter._stub.ConvToArray( 1012 PbRemoteDataFrame(identifier=self._identifier) 1013 ) 1014 return RemoteArray(client, res.identifier) File ~/base/lib/python3.8/site-packages/grpc/_channel.py:946, in _UnaryUnaryMultiCallable.__call__(self, request, timeout, metadata, credentials, wait_for_ready, compression) 937 def __call__(self, 938 request, 939 timeout=None, (...) 942 wait_for_ready=None, 943 compression=None): 944 state, call, = self._blocking(request, timeout, metadata, credentials, 945 wait_for_ready, compression) --> 946 return _end_unary_response_blocking(state, call, False, None) File ~/base/lib/python3.8/site-packages/grpc/_channel.py:849, in _end_unary_response_blocking(state, call, with_call, deadline) 847 return state.response 848 else: --> 849 raise _InactiveRpcError(state) _InactiveRpcError: <_InactiveRpcError of RPC that terminated with: status = StatusCode.ABORTED details = "DataFrame with str columns cannot be converted directly to RemoteArray. Please tokenize strings first" debug_error_string = "{"created":"@1675413761.568726972","description":"Error received from peer ipv4:127.0.0.1:50056","file":"src/core/lib/surface/call.cc","file_line":966,"grpc_message":"DataFrame with str columns cannot be converted directly to RemoteArray. Please tokenize strings first","grpc_status":10}" >
Oh but wait. It didn't work! We got an error message: TypeError: DataFrame with str columns cannot be converted directly to RemoteArray. Please tokenize strings first.
This means we need to make sure our RemoteDataFrame only has numerical fields (ints, floats) before we convert it into a RemoteArray. This makes sense because tensors only accept numerical values, and arrays are here to prepare that next conversion step.
# We use Polar's pl.col() method to convert all values to numerical ones
rdf = rdf.select(pl.col([pl.Float64, pl.Float32, pl.Int64, pl.Int32])).collect()
rdf
FetchableLazyFrame(identifier=14c440b4-705f-4660-978d-c44effbf3731)
Let's try to convert our RemoteDataFrame once more to a RemoteArray.
# Converting RemoteDataFrame to RemoteArray
rdf.to_array()
--------------------------------------------------------------------------- _InactiveRpcError Traceback (most recent call last) Cell In[13], line 2 1 # Converting RemoteDataFrame to RemoteArray ----> 2 rdf.to_array() File ~/Projects/bastionlab/client/src/bastionlab/polars/remote_polars.py:1011, in FetchableLazyFrame.to_array(self) 1004 """ 1005 Converts a FetchableLazyFrame into a RemoteArray 1006 1007 Returns: 1008 RemoteArray 1009 """ 1010 client = self._meta._polars_client.client -> 1011 res = client._converter._stub.ConvToArray( 1012 PbRemoteDataFrame(identifier=self._identifier) 1013 ) 1014 return RemoteArray(client, res.identifier) File ~/base/lib/python3.8/site-packages/grpc/_channel.py:946, in _UnaryUnaryMultiCallable.__call__(self, request, timeout, metadata, credentials, wait_for_ready, compression) 937 def __call__(self, 938 request, 939 timeout=None, (...) 942 wait_for_ready=None, 943 compression=None): 944 state, call, = self._blocking(request, timeout, metadata, credentials, 945 wait_for_ready, compression) --> 946 return _end_unary_response_blocking(state, call, False, None) File ~/base/lib/python3.8/site-packages/grpc/_channel.py:849, in _end_unary_response_blocking(state, call, with_call, deadline) 847 return state.response 848 else: --> 849 raise _InactiveRpcError(state) _InactiveRpcError: <_InactiveRpcError of RPC that terminated with: status = StatusCode.ABORTED details = "DataTypes for all columns should be the same" debug_error_string = "{"created":"@1675413901.986313646","description":"Error received from peer ipv4:127.0.0.1:50056","file":"src/core/lib/surface/call.cc","file_line":966,"grpc_message":"DataTypes for all columns should be the same","grpc_status":10}" >
Again, our to_array() method gives out an error! TypeError: DataTypes for all columns should be the same.
This means we need to cast all our columns first before converting them into an array. Here, we'll choose Float64 to capture all numerical values.
It is very important that we cast all our columns into a single datatype to make our
RemoteArraycompatible with other libraries and machine learning applications - as arrays are supposed to be a collection of objects of the same type.
# Converting all values of the RemoteDataFrame to Float64
rdf = rdf.select(pl.all().cast(pl.Float64)).collect()
rdf
FetchableLazyFrame(identifier=be18403f-7bec-4a73-ba12-9ece528497ec)
We'll try again to convert RemoteDataFrame into RemoteArray.
# Converting RemoteDataFrame to RemoteArray
rdf.to_array()
RemoteArray(identifier=923b1f40-7e0f-4857-ba2d-aab315ef3e89)
It's a success!
Convert RemoteArray to RemoteTensor¶
Now that we converted our RemoteDataFrame to a Remote Array, we'll convert the RemoteArray to a RemoteTensor to be able to train our model. This shouldn't run into problems, since the RemoteArray step would have already taken care of eventual conversion issues.
# Converts `RemoteArray` into `RemoteTensor`
# (using the middle step of converting to RemoteArray)
remote_tensor = rdf.to_array().to_tensor()
Once the RemoteTensor has been created, we can go ahead and print its available properties, which are dtype and shape.
print(remote_tensor)
RemoteTensor(identifier=8e90ece5-11b1-4c88-b877-0e9c9b4152b8, dtype=torch.float64, shape=torch.Size([100, 7]))
We chose to only show you those two properties (the type of the tensor and its shape) to protect the privacy of the data - but still give you the vital information you need to train your model.
You can refer to our Covid 19 deep learning how-to-guide to see how we use
RemoteTensors in training a PyTorch Linear Regression model.
Updating the dtype of RemoteTensor¶
This is the only method you can use on RemoteTensor, because we need to limit access to guarantee the privacy of the data stored.
We'll use to(), just like with a regular torch tensor, to change the dtype of the tensor.
import torch
# Using the to() method to update the dtype of the RemoteTensor
remote_tensor.to(torch.int64)
RemoteTensor(identifier=8e90ece5-11b1-4c88-b877-0e9c9b4152b8, dtype=torch.int64, shape=torch.Size([100, 7]))
The dtype for the RemoteTensor has been updated to int64!
You now know how to convert RemoteDataframe to RemoteTensor.
All that's left to do now is to close your connection to the server and stop the server:
connection.close()
bastionlab_server.stop(srv)