Data visualization methods such as graphs and charts can be a great tool to help make data more accessible and tell the stories hidden in numbers.
Our implementations of key visualization methods focus on privacy. We extract data only when it's truly necessary for the visualization function, and we perform remote aggregation before extracting data where possible. This works in addition to all the other safety benefits and features, such as data access policies.
In this tutorial, we'll introduce the data visualization functions available in BastionLab and see how to use them with our
Installation and dataset¶
In order to run this notebook, we need to:
- Have Python3.7 (or greater) and Python Pip installed
- Install BastionLab
- Download the dataset we will be using in this tutorial.
We'll do so by running the code block below.
# pip packages !pip install bastionlab !#pip install bastionlab_server # download the dataset !wget 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
Our dataset is based on the Titanic dataset, one of the most popular datasets used for understanding machine learning which contains information relating to the passengers aboard the Titanic.
# launch bastionlab_server test package import bastionlab_server srv = bastionlab_server.start()
Note that the bastionlab_server package we install here was created for testing purposes. You can also install BastionLab server using our Docker image or from source (especially for non-test purposes). Check out our Installation Tutorial for more details.
# connect to the server from bastionlab import Connection connection = Connection("localhost") client = connection.client
import polars as pl from bastionlab.polars.policy import Policy, TrueRule, Log df = pl.read_csv("titanic.csv") policy = Policy(safe_zone=TrueRule(), unsafe_handling=Log(), savable=True) rdf = client.polars.send_df(df, policy=policy) rdf
This policy is not suitable for production. Please note that we only use it for demonstration purposes, to avoid having to approve any data access requests in the tutorial.
Since we are using the classic Titanic dataset, let's list the columns to verify we got the right dataset and give you an idea of the data we will be handling in this tutorial.
['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']
Now, let's take a look at the first of our data visualization options,
histplot() function accepts
y arguments, which are strings referring to a column name, plus a
bins integer value, which is set to 10 by default. You must provide either an
y argument, the rest is optional. We also accept any
kwargs arguments accepted by Seaborn's
barplot function, when you supply only an
Y argument, or
heatmap function, when you supply both
Y arguments (this is possible because we call these function internally once we have ran the relevant aggregated query and applied bins to our dataframe.)
Let's take a look at an example. Here, we create a histogram to show the number of passengers in each age category, with bins of 15.
rdf.histplot(x="Age", y="Survived", bins=15)
barplot function filters data down to necessary columns only, runs an aggregated query to get the mean (or other estimator function)
y value for the
x axes, and then calls Seaborn's barplot function to draw the barchart, forwarding on just this aggregated and filtered down dataset.
y values, which are strings referring to a column name. it also accepts an optional
estimator string argument, where you can change the default estimator (
"mean") to any of the following estimators:
"sum". There is also an optional
hue argument where you can specify the name of a column you want to be used to group results by. We also accept any
kwargs arguments accepted by Seaborn's barplot function.
Accepted options are listed in Seaborn's barplot documentation.
Here, we create histogram to show the number of passengers in each age category, with bins of 15:
rdf.barplot(x="Pclass", y="Fare", hue="Sex")
scatterplot function plot will display a scatter diagram based on x and y arguments which can be used to look for correlations between x and y columns.
y string arguments referring to the name of the columns to be used for the x and y axes in the scatterplot.
It will first narrow down the
RemoteLazyFrame to the necessary columns to make the function call, before calling Seaborn's
scatterplot function to plot your scatter graph.
This function also accepts the same
optional arguments as Seaborn's scatterplot function as
You can find those in Seaborn's scatterplot documentation.
rdf.scatterplot("Age", "Fare", hue="Sex")
lineplot function filters our dataframe down to necessary columns only and then draws a line graph using Seaborn's
y string arguments that refer to the names of columns to be used for the x and y axes.
It also accepts
string arguments. The arguments are the names of the columns to be used as grouping variables which will produce lines with different
colors, widths and dashes and/or markers respectively.
Additionally, lineplot accepts a
units argument which is the name of a column to be used as a grouping variable identifying sampling units. Note that you must also set the
estimator keyword to
None if you wish to use the
Finally, the function also accepts the same
optional arguments as Seaborn's
lineplot function as
You can find those in Seaborn's lineplot documentation.
Let's have a look at an example.
rdf.lineplot(x="Age", y="Fare", hue="Sex")
Boxplot() plots a boxplot by running aggregated queries to retrieve the necessary data and then working with matplotlib's
bxp function to plot the data.
Boxplots tells us several things about data:
- The `min` value represented by the bottom-most `whisker` line - The `max` value represented by the upper-most `whisker` line - The `25th percentile` represented by the box's bottom line - The `75th percentile` represented by the box's top line - The `median` represented by an additional line running through the box.
Some boxplots also show additional
outliers but we have not included outliers since they are not private by definition, and so, are likely to breach any aggregation-based security policy.
boxplot() has the following optional arguments:
- `x`name of column to be used for the `x` axes. (There must be *at least* an `x` or a `y` argument supplied.) - `y` name of column to be used for the `y` axes. (There must be *at least* an `x` or a `y` argument supplied.) - `colors` the color(s) or name of builtin BastionLab color palette to be used for boxes, provided as a string or list of strings. - `vertical`: boolean option for vertical (`True`) or horizontal (`False`) orientation - `ax`: matplotlib.axes to plot on. A new axes is created if none given. - `widths`: boxes' widths - `median_linestyle`: linestyle for median line - `median_color`: color for median line - `median_linewidth`: boxes' widths
boxplot also forwards any additional
kwargs arguments to Matplotlib's
Axes.bxp() function which is called internally to plot our boxes:
You can find those in Matplotlib's Axes.bxp() documentation.
Let's take a look at an example using a single
fare. This will show us the
75th percentile and
median fare paid on the Titanic.
Ah, we see the plot is stretched out quite a lot by the large difference between the majority of values and the
We can counter this by clipping our
Fare column to contain values only within a certain range.
Let's try that out now: we clip the data down to a range of 0-100 and then create a new single-axes boxplot of fares paid on the Titanic.
# clip data to include fare values from 0-100 to avoid stretched out plot clipped = rdf.select(pl.col("Fare").clip(min_val=0, max_val=100)) # boxplot our clipped version of data clipped.boxplot(y="Fare")
Next, let's take a look at an example using an
Here, the boxplot show the
75th percentile and
median fare paid on the Titanic per each
# Collect fare and Pclass data and clip data down to fares under 100 clipped = rdf.select([pl.col("Fare").clip(min_val=0, max_val=100), pl.col("Pclass")]) # create boxplot clipped.boxplot(y="Fare", x="Pclass")
Finally, let's do the same plot again but with some of the additional options we have.
vertical to False, to view the data horizontally.
colors to BastionLab's builtin
BastionLab has four builtin palettes:
You can also apply your own colors by sending
boxplot a lists of strings containing different hex codes!
Finally, we set
widths to 0.2 to make our boxes more narrow.
# create same boxplot as previously but with different aesthetic options clipped.boxplot(y="Fare", x="Pclass", vertical=False, colors="light", widths=0.2)
pieplot functions draws a pie chart where segment values are stored in one column and labels are provided. We calculate each individual cell in the column as a percentage of the sum of values in that column before calling Matplotlib's pie function.
This is particularly useful after running aggregated queries, which will become clear in the following example, but first, let's take a look at the arguments
- An mandatory
partsstring argument, which is the name of the column containing the values for each segment in the pie chart.
labelsargument, which is either the name of the column containing labels values, or a List[str] of the labels. In both cases, the order of the labels should follow the same order as the values in the
axargument, which allows you to send your own matplotlib axis if required. Note that if you do this, the following
fig_kwargsarguments will not be used.
fig_kwargsdictionary argument, which is where you can add any
kwargsyou wish to be forwarded onto
matplotlib.pyplot.subplots()when creating the figure that this piechart will be displayed on.
pie_labelsboolean value, which you can modify to
Falseif you do not with to label the segments of your pie chart.
keyboolean value, where you can specify if you want a color map key placed to the side of your pie chart.
key_bboxoptions, where you can specify the location, title and bbox options for your color map key. These are forwarded on to matplotlib's legend function.
Now, let's take a look at an example of where we might use
pieplot. We will run an aggregated query to get the number of deceased per passenger class on the Titanic.
We first filter the dataset to those who did not survive the Titanic, then we select all necessary columns and group data by Pclass, before we get a count of values per each class and sort the output by
We can then call our
pieplot function on this dataset specifying the "Survived" column as our
parts argument, a title for our pie chart and the
"Class" column to be used for labelling, to get our pie chart.
rdf_ex = ( rdf.filter(pl.col("Survived") == 0) .select([pl.col("Survived"), pl.col("Pclass")]) .groupby(pl.col("Pclass")) .agg(pl.col("Survived").count()) .sort(pl.col("Pclass")) ) print(rdf_ex.collect().fetch()) rdf_ex.pieplot(parts="Survived", title="Titanic: Deceased by Class", labels="Pclass")
shape: (3, 2) ┌────────┬──────────┐ │ Pclass ┆ Survived │ │ --- ┆ --- │ │ i64 ┆ u32 │ ╞════════╪══════════╡ │ 1 ┆ 80 │ ├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤ │ 2 ┆ 97 │ ├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤ │ 3 ┆ 372 │ └────────┴──────────┘
Facet grid plots¶
facet function lets you create a grid of plots that accepts a
row argument. You can then call the
curveplot functions to decide how you want to plot your data in the columns and rows of the grid.
For example, if you have a Facet with a row value of
"Pclass" and you call
my_facet.histplot(x="Age", bins=15), you will see three histplots: one showing the age of passengers in class 1, one for passengers in class 2 and the final one for class 3.
Before we continue any further, let's see the code for this example:
my_facet = rdf.facet(col="Pclass") my_facet.histplot(x="Age")
Now that we have seen an example with a row, let's add a column! We will also specify the
figsize, the size of the figure we want for our grid.
new_facet = rdf.facet(col="Pclass", row="Survived", figsize=(15, 10)) new_facet.histplot(x="Age", bins=15)
The grid now splits results into all the possible combinations of the column and row values.
As previously mentioned, this feature works with all the visualization functions except for the pieplot function.
Important note: the
unitskeywords cannot be used for
Here's a facet grid with
scatterplot(), for example:
new_facet = rdf.facet(col="Pclass", row="Survived") new_facet.scatterplot(x="Age", y="Fare")