Skip to content

Data transforms

Data transforms are explicit pre-processing steps applied to your data before rendering. They let you filter, reshape, aggregate, and compute derived columns directly in the chart specification — without mutating your source DataFrame.

This is different from stat transforms (like mark_smooth, mark_density, mark_histogram), which are implicit: they're triggered by the mark you choose and computed automatically by the engine. Data transforms are declared explicitly via .transform() and run before any mark-level statistics.

When to use data transforms

Use data transforms when you need to:

  • Filter rows before visualization (e.g. only show values above a threshold)
  • Compute derived columns from expressions (e.g. ratios, differences)
  • Reshape data between wide and long format
  • Aggregate or bin data for summary views
  • Apply window functions (running averages, rankings, lag/lead)
  • Sample large datasets down to a manageable size

If the operation is purely about how marks are drawn (KDE shape, regression line, bin counts), prefer a stat mark. If it's about which data or what columns exist before any mark logic runs, use a data transform.

Basic usage

Attach transforms to a chart with .transform():

import ferrum as fm
import polars as pl

df = pl.DataFrame({
    "name": ["Alice", "Bob", "Carol", "Dave", "Eve"],
    "age": [25, 17, 34, 16, 42],
    "score": [88.0, 72.0, 95.0, 61.0, 79.0],
})

chart = (
    fm.Chart(df)
    .transform(fm.transform_filter("datum.age >= 18"))
    .mark_bar()
    .encode(x="name:N", y="score:Q")
)

The transform runs first: only rows where age >= 18 survive to the mark stage. The bar chart renders three bars (Alice, Carol, Eve).

Filtering and calculation

transform_filter — row filtering

Filter rows using an expression string:

import ferrum as fm
import polars as pl

df = pl.DataFrame({
    "country": ["US", "UK", "DE", "FR", "JP"],
    "gdp": [21.0, 2.8, 3.8, 2.7, 5.1],
    "population": [331, 67, 83, 67, 126],
})

chart = (
    fm.Chart(df)
    .transform(fm.transform_filter("datum.gdp > 3"))
    .mark_bar()
    .encode(x="country:N", y="gdp:Q")
)

You can also pass a dict for common filter patterns:

# Keep only specific categories
fm.transform_filter({"country": ["US", "DE", "JP"]})

# Comparison operators
fm.transform_filter({"gdp": {">": 3, "<": 20}})

transform_calculate — derived columns

Add a new column computed from an expression:

import ferrum as fm
import polars as pl

df = pl.DataFrame({
    "city": ["NYC", "LA", "Chicago"],
    "revenue": [500.0, 300.0, 200.0],
    "cost": [350.0, 180.0, 150.0],
})

chart = (
    fm.Chart(df)
    .transform(fm.transform_calculate("profit", "datum.revenue - datum.cost"))
    .mark_bar()
    .encode(x="city:N", y="profit:Q")
)

The first argument is the output column name; the second is the expression.

Reshaping

transform_fold — wide to long

Fold (melt) multiple columns into key-value pairs:

import ferrum as fm
import polars as pl

df = pl.DataFrame({
    "year": [2020, 2021, 2022],
    "revenue": [100.0, 120.0, 150.0],
    "expenses": [80.0, 95.0, 110.0],
})

chart = (
    fm.Chart(df)
    .transform(fm.transform_fold(["revenue", "expenses"], as_=("metric", "amount")))
    .mark_line()
    .encode(x="year:O", y="amount:Q", color="metric:N")
)

Each row in the original becomes two rows in the output — one for revenue, one for expenses. The as_ parameter names the key and value columns.

transform_pivot — long to wide

The inverse of fold — spread a categorical column into multiple columns:

import ferrum as fm
import polars as pl

df = pl.DataFrame({
    "date": ["Mon", "Mon", "Tue", "Tue"],
    "category": ["A", "B", "A", "B"],
    "sales": [10.0, 20.0, 15.0, 25.0],
})

# Pivot so each category becomes its own column
fm.transform_pivot("category", "sales", groupby=["date"])

transform_flatten — expand list columns

Unnest array/list columns into individual rows:

import ferrum as fm
import polars as pl

df = pl.DataFrame({
    "person": ["Alice", "Bob"],
    "tags": [["python", "rust"], ["go", "python", "java"]],
})

chart = (
    fm.Chart(df)
    .transform(fm.transform_flatten(["tags"]))
    .mark_bar()
    .encode(x="tags:N", y="count():Q")
)

Aggregation

transform_aggregate — group-by aggregation

Collapse rows into group summaries. Each aggregate spec is a dict with field, fn, and as keys:

import ferrum as fm
import polars as pl

df = pl.DataFrame({
    "department": ["eng", "eng", "sales", "sales", "sales"],
    "salary": [120.0, 130.0, 80.0, 90.0, 85.0],
})

chart = (
    fm.Chart(df)
    .transform(fm.transform_aggregate(
        {"field": "salary", "fn": "mean", "as": "avg_salary"},
        {"field": "salary", "fn": "count", "as": "headcount"},
        groupby=["department"],
    ))
    .mark_bar()
    .encode(x="department:N", y="avg_salary:Q")
)

Supported aggregation functions: "sum", "mean", "median", "min", "max", "count", "variance", "stdev", "q1", "q3", "distinct".

transform_join_aggregate — aggregate without collapsing

Adds aggregate columns to every row (like a SQL window function without ordering). The original rows are preserved:

import ferrum as fm
import polars as pl

df = pl.DataFrame({
    "region": ["East", "East", "West", "West"],
    "sales": [100.0, 150.0, 200.0, 80.0],
})

# Add total_sales per region to each row, then compute percentage
chart = (
    fm.Chart(df)
    .transform(fm.transform_join_aggregate(
        {"field": "sales", "fn": "sum", "as": "region_total"},
        groupby=["region"],
    ))
    .transform(fm.transform_calculate("pct", "datum.sales / datum.region_total"))
    .mark_bar()
    .encode(x="region:N", y="pct:Q")
)

transform_top_k — keep top groups

Keep only the top-k groups ranked by an aggregate:

import ferrum as fm
import polars as pl

df = pl.DataFrame({
    "product": ["A", "B", "C", "D", "E", "A", "B", "C", "D", "E"],
    "revenue": [50.0, 30.0, 80.0, 10.0, 45.0, 55.0, 25.0, 90.0, 15.0, 40.0],
})

chart = (
    fm.Chart(df)
    .transform(fm.transform_top_k(3, field="revenue", op="sum"))
    .mark_bar()
    .encode(x="product:N", y="sum(revenue):Q")
)

This keeps only the three products with the highest total revenue.

transform_bin — binning

Bin a continuous field into discrete intervals:

import ferrum as fm
import polars as pl
import numpy as np

rng = np.random.default_rng(42)
df = pl.DataFrame({"value": rng.normal(0, 1, 200).tolist()})

chart = (
    fm.Chart(df)
    .transform(fm.transform_bin("value", maxbins=15))
    .mark_bar()
    .encode(x="value_bin:Q", y="count():Q")
)

Parameters: maxbins sets the target bin count, step sets an explicit bin width, and nice=True (the default) rounds boundaries to clean values.

Window functions

transform_window — rolling operations, rank, lag/lead

Window transforms compute values over a sliding or partitioned frame:

import ferrum as fm
import polars as pl

df = pl.DataFrame({
    "day": list(range(1, 11)),
    "sales": [12.0, 15.0, 13.0, 18.0, 20.0, 17.0, 22.0, 19.0, 25.0, 23.0],
})

# 3-day rolling average
chart = (
    fm.Chart(df)
    .transform(fm.transform_window(
        {"op": "mean", "field": "sales", "as": "rolling_avg"},
        sort=["day"],
        frame=(-1, 1),  # 1 preceding, current, 1 following
    ))
    .mark_line()
    .encode(x="day:Q", y="rolling_avg:Q")
)

Window operations include:

Operation Description
"row_number" Row index within the window
"rank" Rank with ties
"dense_rank" Rank without gaps
"lag" Previous row value (use param for offset)
"lead" Next row value (use param for offset)
"sum", "mean", "min", "max" Rolling aggregates
"first_value", "last_value" Frame boundary values

The frame parameter specifies (preceding, following) bounds. Use None for unbounded:

# Cumulative sum (unbounded preceding to current row)
fm.transform_window(
    {"op": "sum", "field": "sales", "as": "cumulative"},
    sort=["day"],
    frame=(None, 0),
)

# Rank within each group
fm.transform_window(
    {"op": "rank", "as": "rank"},
    sort=["score"],
    groupby=["department"],
)

Statistical transforms

These transforms produce new datasets from statistical computations. They replace the original rows with computed output.

transform_density — kernel density estimation

Compute a KDE curve from a single field:

import ferrum as fm
import polars as pl
import numpy as np

rng = np.random.default_rng(42)
df = pl.DataFrame({
    "group": ["A"] * 100 + ["B"] * 100,
    "value": rng.normal(0, 1, 100).tolist() + rng.normal(2, 0.5, 100).tolist(),
})

chart = (
    fm.Chart(df)
    .transform(fm.transform_density("value", groupby=["group"]))
    .mark_area(opacity=0.5)
    .encode(x="value:Q", y="density:Q", color="group:N")
)

The output columns default to "value" and "density" (configurable via as_). Use groupby to compute separate densities per category.

The kernel= parameter selects the kernel function. Supported values: "gaussian" (default), "epanechnikov", "tophat", "cosine".

chart = (
    fm.Chart(df)
    .transform(fm.transform_density("value", kernel="epanechnikov", groupby=["group"]))
    .mark_area(opacity=0.5)
    .encode(x="value:Q", y="density:Q", color="group:N")
)

transform_regression — regression line

Fit a regression model and output the fitted curve:

import ferrum as fm
import polars as pl
import numpy as np

rng = np.random.default_rng(42)
x = rng.uniform(0, 10, 50)
df = pl.DataFrame({"x": x.tolist(), "y": (2 * x + rng.normal(0, 1, 50)).tolist()})

chart = (
    fm.Chart(df)
    .transform(fm.transform_regression("x", "y", method="linear"))
    .mark_line()
    .encode(x="x:Q", y="y:Q")
)

Methods: "linear", "poly" (set order= for degree), "exp", "log", "pow".

mark_smooth method aliases

mark_smooth(method=) accepts several friendly aliases in addition to the primary names. "linear", "quadratic", and "cubic" map to OLS polynomial fits of degree 1, 2, and 3 respectively; "log" and "sqrt" fit log and square-root curves. "lm" is the primary name for linear fits; "loess" (or "lowess") is the primary name for locally weighted regression. All of the following are valid: method="lm", method="linear", method="loess", method="quadratic", method="cubic", method="log", method="sqrt".

transform_loess — LOESS smoothing

Non-parametric local regression:

import ferrum as fm
import polars as pl
import numpy as np

rng = np.random.default_rng(42)
x = np.linspace(0, 4 * np.pi, 80)
df = pl.DataFrame({"x": x.tolist(), "y": (np.sin(x) + rng.normal(0, 0.3, 80)).tolist()})

chart = (
    fm.Chart(df)
    .transform(fm.transform_loess("x", "y", bandwidth=0.3))
    .mark_line()
    .encode(x="x:Q", y="y:Q")
)

The bandwidth parameter controls smoothness — smaller values follow the data more closely; larger values produce smoother curves.

Utilities

transform_sample — random subsample

Downsample large datasets for faster rendering:

import ferrum as fm
import polars as pl
import numpy as np

rng = np.random.default_rng(42)
df = pl.DataFrame({
    "x": rng.normal(0, 1, 10000).tolist(),
    "y": rng.normal(0, 1, 10000).tolist(),
})

chart = (
    fm.Chart(df)
    .transform(fm.transform_sample(500, seed=42))
    .mark_point(opacity=0.5)
    .encode(x="x:Q", y="y:Q")
)

The seed parameter ensures deterministic output across renders.

transform_impute — fill missing values

Replace nulls in a column using a specified strategy:

import ferrum as fm
import polars as pl

df = pl.DataFrame({
    "month": [1, 2, 3, 4, 5, 6],
    "sales": [100.0, None, 120.0, None, 140.0, 150.0],
})

chart = (
    fm.Chart(df)
    .transform(fm.transform_impute("sales", method="median"))
    .mark_line()
    .encode(x="month:O", y="sales:Q")
)

Methods: "value" (constant, set value=), "mean", "median", "min", "max". Use groupby to impute within groups.

transform_stack — stacking positions

Compute cumulative start/end positions for stacked bar or area charts:

import ferrum as fm
import polars as pl

df = pl.DataFrame({
    "quarter": ["Q1", "Q1", "Q2", "Q2"],
    "region": ["East", "West", "East", "West"],
    "revenue": [100.0, 80.0, 120.0, 90.0],
})

fm.transform_stack("revenue", groupby=["quarter", "region"], offset="normalize")

The offset parameter controls stacking strategy: "zero" (standard cumulative), "normalize" (100% stack), or "center" (streamgraph).

transform_timeunit — temporal extraction

Extract a unit from a datetime field:

import ferrum as fm
import polars as pl
from datetime import date

df = pl.DataFrame({
    "date": [date(2023, 1, 15), date(2023, 3, 20), date(2023, 6, 10), date(2023, 11, 5)],
    "value": [10.0, 25.0, 18.0, 30.0],
})

chart = (
    fm.Chart(df)
    .transform(fm.transform_timeunit("date", "month"))
    .mark_bar()
    .encode(x="month_date:O", y="value:Q")
)

Units: "year", "month", "day", "hour", "minute", "second", "day_of_week", "week", "quarter".

Expression syntax

Transform expressions (used in transform_filter and transform_calculate) follow a Vega-style syntax:

Syntax Example
Field access datum.field_name
Bracket access datum["field with spaces"]
Arithmetic datum.x * 2 + datum.y
Comparison datum.age >= 18
Logical AND datum.x > 0 && datum.y > 0
Logical OR datum.x < 0 \|\| datum.x > 100
Logical NOT !datum.active
Ternary datum.x > 0 ? 'positive' : 'non-positive'
String literals datum.status == 'active'
Membership indexof([1, 2, 3], datum.id) >= 0

Chaining transforms

Multiple transforms execute in sequence — each one operates on the output of the previous. You can chain separate .transform() calls or pass multiple transforms in a single call (.transform(a, b, c) is equivalent to .transform(a).transform(b).transform(c)):

import ferrum as fm
import polars as pl

df = pl.DataFrame({
    "product": ["A", "B", "C", "D", "E"] * 20,
    "region": (["North", "South"] * 50),
    "revenue": [float(i * 7 % 13 + 5) for i in range(100)],
})

chart = (
    fm.Chart(df)
    # Step 1: keep only top-3 products by total revenue
    .transform(fm.transform_top_k(3, field="revenue", op="sum"))
    # Step 2: aggregate by product and region
    .transform(fm.transform_aggregate(
        {"field": "revenue", "fn": "sum", "as": "total"},
        groupby=["product", "region"],
    ))
    # Step 3: compute percentage within each product
    .transform(fm.transform_join_aggregate(
        {"field": "total", "fn": "sum", "as": "product_total"},
        groupby=["product"],
    ))
    .transform(fm.transform_calculate("pct", "datum.total / datum.product_total"))
    .mark_bar()
    .encode(x="product:N", y="pct:Q", color="region:N")
)

The order matters: filtering before aggregation reduces the data that gets summarized; aggregating before filtering lets you filter on computed values.

Where to go next

  • Marks & encodings for the mark types and encoding channels that consume transform output.
  • Composition for combining multiple transformed views into compound charts.
  • Figure-level helpers for one-line chart functions that handle common transform patterns internally.
  • The API Reference for the full signatures of every transform function.