# Altair-viz 3: Data transforms with altair: why, what and how

Since Altair is a framework of data-viz, talking about data transforms may seems odd. Especially if you are used to
*pandas*, as you are probably saying to yourself “But why the heck should I care about
data transformations, 🐼 is all about that! 🧐”

# Introduction

And yes, I agree. But before trying to change your mind about it, let’s remember together why we need data
transformations in our *life*:

- Filtering
- New features of any kind are required (whether they are simple type conversions or advanced transformations)
- Reshaping data (with aggregations eventually)
- Combining data

*Pandas* is a powerful library due to its extensive API which permits basically everything. That capacity, however,
comes at the cost of complexity. Not that it is hard to aggregate data nor to add a new column to a dataframe, but the
readablity of pandas’ code degrades over transformations, especially in *notebooks*:

- DataFrames are regularly copied, variable names become longer and longer
- Adding columns, filtering & grouping data becomes less and less clear even though transformations are kept simple

When performing data visualization, some of those transformations belong to only one chart.

Altair acknowledges the situation: some data transformations should only be bounded to a chart, and should be easy to write and to understand.

Altair transforms justify their existence by applying this philosophy.

In this article, we will cover most transforms associated to the four types described earlier, with code examples and visual results.

# Filtering

Associated transform:

- transform_filter
- transform_sample

Let’s assume the following codebase:

```
import altair as alt
import pandas as pd
import numpy as np
theta = np.linspace(-4*np.pi, 4 * np.pi, 100)
y = np.cos(theta) + np.random.randn(100) * 0.1
data = pd.DataFrame({
"theta": theta,
"y": y
})
chart = alt.Chart(data).encode(
x="theta:Q",
y="y:Q"
).mark_point()
```

That results in:

### transform_filter

Imagine that I want to restrict the study to the interval 0, pi.

One solution is to create a temporary dataframe (the 🐼 classic):

```
data_filtered_on_theta = data[(data.theta <= np.pi) & (data.theta >= 0)]
```

Otherwise, you just apply the method `transform_filter`

on the chart, specifying how you want to filter data:

```
chart = chart.transform_filter(
(alt.datum.theta >= 0) &
(alt.datum.theta <= np.pi)
)
```

By doing that, you realise that you do not need anymore to reference the name of the dataframe nor to create a new one,
but instead you reference a `datum`

object. What is it?

Well, according to the documentation of Vega, datum is:

The current input data object, available within data transform and event handler expressions.

It also tells that:

To lookup object properties, use normal JavaScript syntax:

`datum.value`

or`datum['My Value']`

.

### transform_sample

I kinda dislike this transform because it only picks uniformly sampled subset of data of the desired size. The only thing to remember is that

```
chart.transform_sample(100)
```

takes 100 rows in the dataset to display.

# Creating new features

## Adding columns

Associated transforms:

- transform_calculate
- transform_joinaggregate
- transform_window
- transform_stack
- transform_impute
- transform_regression
- transform_loess

*transform_calculate*

If by any chance or whatsoever, you were about to add a new column with theta**2 + y + 1, in the ol’ fashioned way you would do something like this:

```
data_filtered_with_column = data_filtered.copy()
data_filtered_with_column["feature"] = data_filtered_with_column.apply(lambda r: r.theta**2 + r.y + 1, axis=1)
```

Then using altair’s `transform_calculate`

and Vega Expressions you can
do it in a powerful one-liner:

```
chart = chart.transform_calculate(
feature="pow(datum.theta, 2) + datum.y + 1" # any name can be given to the feature
)
```

You’ll be able to use the `feature`

column in any of the encodings (`y`

here).

I recently discovered that you can also use directly the datum object to add the column, in a more pythonic way. However, it does not work for all expressions (for example, conditional expressions are not supported)

```
chart = chart.transform_calculate(feature=alt.datum.theta ** 2 + alt.datum.y + 1)
```

### transform_joinaggregate

Now, you can imagine that you need to compute an aggregate, and add it as a new column to your dataframe. First, let’s make up some categories A (theta < pi/2) and B (theta >= pi/2:

In this example, we assume that you want to compute a category-wise aggregate: the sum of *y* values for each value of
the column *category*. How do you do it in pandas?

Well, probably this way:

```
# Create the column category
data_filtered_with_column_and_category = data_filtered_with_column.copy()
data_filtered_with_column_and_category["category"] = data_filtered_with_column_and_category.theta.apply(
lambda value: 'A' if value > 0 else 'B'
)
# Create the aggregate
data_filtered_with_column_and_category["aggregate"] = data_filtered_with_column_and_category.groupby(
"category"
).y.transform(sum)
```

Tedious, isn’t it? If you desire to do it with altair, here is how you could do:

```
chart.transform_calculate(
category="datum.theta < PI/2 ? 'A' : 'B'"
).transform_joinaggregate(
groupby=["category"],
aggregate="sum(y)"
)
```

By setting the `size`

encoding using the newly created `aggregate`

column, one can obtain the following graph:

The fact that you can **chain transforms** make this mechanism really powerful. In a single one of code you can declare in
a very clear way what columns you want to add.

The other awesome thing is that you can create multiple columns at once in each transform:

```
chart.transform_calculate(
category="datum.theta < PI/2 ? 'A' : 'B'",
feature="pow(datum.theta, 2) + datum.y + 1",
# ...
).transform_joinaggregate(
groupby=["category"],
aggregate="sum(y)",
aggregate_z="stdev(z)",
aggregate_feature = "sum(feature)"
# ...
)
```

I’m sure you like it, your eyes don’t lie 👀

**⚠️ Remark**: columns are created in the order of the transforms. You can therefore reference a column in a transform
if it was created in a previous transform

### transform_window

Let’s go quick on this one and the remaining transform, we already discussed a lot about the philosophy of pandas vs altair data transforms.

Let’s assume that you want to compute a cumulated sum of *y* values for each *category*, along the *theta* axis:

```
chart.transform_window(
window_feature="sum(y)", # Aggregate to compute
groupby=["category"], # Equivalent of PARTITION BY clause
sort=[{"field": "theta"}], # Equivalent of ORDER BY clause
frame=[None, 0], # Equivalent of UNBOUNDED PRECEDING until CURRENT ROW
)
```

Whoo! It looks like SQL, which is really convenient for creating windowed columns. If you need to have, let’s say the 5
last elements before the last element of the window, you just need to play with the *frame* argument: `frame=[-5, -1]`

### transform_stack

Stacking is used primarly for bar charts, pies and area plots. Instead of superposing graph objects (bars) such that they overlap, each graph object has an offset such that the displayed quantity is enterly visible.

Assume having data like this:

```
import pandas as pd
data = pd.DataFrame({
"category": ["M", "F", "M", "F", "M", "F"],
"day": ["Monday", "Monday", "Tuesday", "Tuesday", "Wednesday", "Wednesday"],
"amount": [10, 12, 4, 10, 8, 9]
})
```

By default, bars (and pies) are stacked. But not if we specify `stack=False`

on the axis where *values* are projected:

```
alt.Chart(data).mark_bar().encode(
x="day:N",
y="amount:Q",
color="category:N"
)
```

```
alt.Chart(data).mark_bar().encode(
x="day:N",
y=alt.Y("amount:Q", stack=False),
color="category:N"
)
```

This is generally enough to decide if you want to stack on the Y axis or not. But if you need to use the edge values of each bar, transform_stack is your friend; to reproduce the same as above:

```
alt.Chart(data).mark_bar().encode(
x="day:N",
y="amount_low:Q",
y2="amount_high:Q",
color="category:N"
).transform_stack(
stack="amount",
groupby=["day"],
as_=["amount_low", "amount_high"]
)
```

### transform_impute

Very quick here. it’s Altair’s *fillna*.

I never used it, but the cool thing is that you can fill values using moving averages:

```
chart.transform_impute(
impute="amount", # Change null values of "amount"
method="mean", # By computing the mean of its non-null values
key="day", # Over the "day" axis
frame=[None, 0], # By taking rows since the beginning until the current row
groupby=["category"], # Rows should have the same "category" of the current row
)
```

For simple use cases, I would definityly go for pandas’ *fillna* but for advanced case I would hesitate.
Nonetheless, if I need new features that I want to visualize quickly, I’ll give a shot for *transform_impute*!

I advise you to check the documentation if you want to know more about it:

### transform_regression, transform_loess

Both of these techniques compute the regression of a given variable in function of another.

The former has 6 methods, `linear`

, `log`

, `exp`

, `pow`

, `quad`

, `poly`

and seem to permit regression on a **per-group**
basis.

The latter is a *Locally Estimated Scatterplot Smoothing* technique; which is basically a regression technique that
involves multiple regression models, each being trained on segments of nearby data points instead of using the whole
dataset.

In the context of data visualization, it seems that this technique could be used when you have a lot of data points that you need to summarize without spending time on finding the right regression method, or that computing the average of the data points (y) for each value of the explicative variable (x) is not sufficient for representing well your dataset.

## Aggregated features

- transform_density
- transform_bin
- transform_timeunit
- transform_aggregate
- transform_quantile

### transform_density

In *seaborn*, you have the function *kdeplot* that displays the density curve of a given variable, by performing a
*Kernel Density Estimation* (KDE). It is easy to setup:

```
kdeplot(data, x=<x_column>)
```

Using altair, here is how we do:

```
import pandas as pd
import altair as alt
import numpy as np
data = pd.DataFrame({"y": np.random.randn(300)})
alt.Chart(data).transform_density(
"y",
as_=["values_of_y", "density"]
).mark_area(color="lightblue").encode(
x="values_of_y:Q",
y="density:Q"
)
```

You should remember that the mind of altair, a distribution plot is just a plot like all others, but with KDE features
computed using *transform_density*. Therefore, one can display the results with an area using `mark_area`

(as shown
here), but also using only lines (`mark_line`

) or whatever mark you want.

### transform_bin

This transformation is similar to transform_stack. It bins a column following a certain specification, resulting in two columns corresponding to the bin start and stop.

You usually don’t need it, unless you want to make it clear that you are using a binned column within all your chart.

Binning forces you to use aggregations, as each bin of a given column corresponds to a bucket of values.

You can compute the histogram of `y`

this way (following data defined in transform_density:

```
alt.Chart(data).transform_bin(
as_="bin",
bin=True, # bin parameters, or True
field="y"
).encode(
# Ordinal type is better when using only one bin column
x="bin:O",
y="count(bin):Q"
).mark_bar()
```

```
alt.Chart(data).transform_bin(
as_=["bin_start", "bin_end"],
bin=True, # bin parameters, or True
field="y"
).encode(
x="bin_start:Q",
x2="bin_end:Q",
y="count(bin):Q"
).mark_bar()
```

### transform_timeunit

I’m gonna be honest with you folks, I never used this transform.

But it looks like it’s the same thing as transform_bin, except that it’s dedicated to time data.
The difference is how you control bin sizes, as it powers the discretization of data into *known* time units, from
milliseconds to years.

Similarly to transform_bin, you can obtain the start and end columns of each time interval.

More info can be found here or there

### transform_aggregate

The *groupby* of Altair is *transform_aggregate*. I find it very readable, and can do the basic stuff that you need most
of the time.

By computing the mean, the sum and the standard deviation of a column of data grouped by another, one would do it as:

```
alt.Chart(
<data>
).transform_aggregate(
groupby=[<column_groupby>],
mean_col="mean(<column>)",
sum_col="sum(<column>)",
stdev_col="stdev(<column>)"
)
```

Like in transform_calculate, you can compute several aggregates within one transform_aggregate call. It’s crystal clear and quick to setup.

### transform_quantile

I never used this transform as well, but it basically compute two columns corresponding to quantile (as a probability), value pairs.

transform_quantile therefore compute a normalized cumulated histogram.

In other words, the first column corresponds to the cumulated count of values of a given field (relatively to the total count) while the second column corresponds to binned values of the same field.

Let’s check the difference with:

#### transform_quantile

```
import pandas as pd
import altair as alt
import numpy as np
df = pd.DataFrame({'x': np.random.randn(200)})
alt.Chart(df).transform_quantile(
'x',
step=0.01
).mark_point().encode(
x='prob:Q',
y='value:Q'
)
```

#### composition of transforms

```
alt.Chart(df).transform_bin(
"value", "x", bin=alt.BinParams(step=0.01)
).transform_aggregate(
groupby=["value"], count_x="count(x)"
).transform_joinaggregate(
count_x_total="sum(count_x)"
).transform_window(
cumsum_x="sum(count_x)",
frame=[None, 0],
sort=[{"field": "value"}]
).transform_calculate(
prob="datum.cumsum_x / datum.count_x_total"
).mark_point(color="orange").encode(
y="value:Q",x="prob:Q"
)
```

You can see that the two curves are almost the same (at the exception of the edge points), but nonetheless one of the curves is far easier to build 😉

# Reshaping data

Associated transforms:

- transform_flatten
- transform_fold
- transform_pivot

### transform_flatten

If any of your columns has an array-like object type, you may want to create one row per value in the array:

```
import pandas as pd
pd.DataFrame({
"category": ['A', 'B', 'C'],
"nested_values": [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
})
```

With a transform_flatten, your dataframe will be like:

```
pd.DataFrame({
"category": ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],
"unnested_values": [1, 2, 3, 4, 5, 6, 7, 8, 9]
})
```

The use case where this is appropriate is when loading JSON data. documentation

### transform_fold

You got too many columns? You want more rows? Then transform_fold is for you.
What it does is converting *wide-form data to long-form*, meaning:
You have N columns that represent a similar thing, a metric for example:

```
data = pd.DataFrame({"Count": [19], "Sum": [10], "Stdev": [5], "Mean": [1.5]})
```

Now, if I fold those columns:

```
alt.Chart(data).transform_fold(
fold=["Count", "Sum", "Stdev", "Mean"],
as_=["Aggregate", "Value"]
)
```

I will have the following table:

```
data = pd.DataFrame({
"Aggregate": ["Count", "Sum", "Stdev", "Mean"],
"Value": [19, 10, 5, 1.5]
})
```

It is useful to display several different metrics on the same chart: you start by folding the metric columns, then you encode the “Aggregate” field as mark color such that multiple curves of the “Value” are displayed.

### transform_pivot

transform_pivot is to transform_fold what Yin is to Yang ☯

Need more explanations?

If you start with this data:

```
data = pd.DataFrame({
"Aggregate": ["Count", "Sum", "Stdev", "Mean"],
"Value": [19, 10, 5, 1.5]
})
```

Then by applying transform_pivot on the “Aggregate” column:

```
alt.Chart(data).transform_pivot(
pivot="Aggregate",
value="Value"
)
```

You will obtain:

```
data = pd.DataFrame({"Count": [19], "Sum": [10], "Stdev": [5], "Mean": [1.5]})
```

There seems to be a little more cool stuff as you can compute grouping and aggregations during de pivot transform, but in this scenario it is not relevant.

# Combining data

- transform_lookup

### transform_lookup

transform_lookup is your friend if you need to join dataframes together in the scope of a given chart. Again, if you need to use the merged dataframe later in your notebook, you should prioritize pandas’ API over Altair transforms.

Basically, here is an example of how you use it. Given the following data:

```
import pandas as pd
import altair as alt
main_data = pd.DataFrame({
"category": ["A", "B", "C", "A", "B", "C"],
"values": [1, 2, 5, 4, 7, 2]
})
auxiliary_data = pd.DataFrame({
"category": ["A", "B", "C"],
"size": [10, 10, 15]
})
```

You can then use the transform that way:

```
chart = alt.Chart(main_data).transform_lookup(
from_=alt.LookupData( # Join main_data with
data=auxiliary_data, # auxiliary_data
fields=["size"], # By selecting "size" from the lookup data
key="category" # Using "category" as right key
),
lookup="category" # And "category" as left key
).encode(
x="category:N",
y="values:Q",
size=alt.Size("size:Q",
# Unnecessary legend customization
legend=alt.Legend(values=auxiliary_data["size"].drop_duplicates().tolist()))
).mark_point()
```

That would result in:

I’m not sure whether I would prefer this method over `DataFrame.merge`

, but for the sake of consistency I would give a
try to this Altair’s transform.

# Conclusion

By now, you should have a broad overview of Altair data transformations.

They permit to add *new columns*, to compute *aggregates* and to *reshape* or *combine* dataframes.

Some are useful and some are less in practice, but it should always be kept in mind that the only legitimate use of Altair transforms is when you need a data transformation only for one chart, and that the data you have is small enough.

In the next episode, we will talk about layout handling: how to manage multiple plots for a single chart.

Stay tuned!