import ibis
import ibis.selectors as s
= True
ibis.options.interactive repr.interactive.max_rows = 3 ibis.options.
Overview
Ibis 8.0 marks the first release of stream processing backends in Ibis! This enhances the composable data ecosystem vision by allowing users to implement data transformation logic in a standard Python dataframe API and execute it against either batch or streaming systems.
This release includes Apache Flink, a streaming backend, and RisingWave, a streaming database backend. We’ve also added a new batch backend with Exasol, bringing the total number of backends Ibis supports to 20.
Most geospatial operations are now supported in the DuckDB backend, making Ibis a great local option for geospatial analytics.
What is stream processing?
Stream processing systems are designed to handle high-throughput, low-latency data processing with time semantics. They are used to process data in real-time with minimum latency and are often used in applications such as fraud detection, real-time analytics, and IoT. Systems using stream processing are increasingly common in modern data applications.
Apache Flink is the most popular open-source stream processing framework, with numerous cloud options. RisingWave is an open-source Postgres-compatible streaming database with a cloud offering that is gaining popularity and simplifies the streaming experience.
Ibis now supports both and going forward can add more streaming backends to unify the Python user experience across batch and streaming systems.
Unifying batch and streaming UX in Python
Whether you’re using a batch or streaming data platform – and the lines are continually blurring between them – you’ll need a frontend to interact with as a data engineer, analyst, or scientist. If you’re using Python, that frontend is likely a dataframe API.
Standards benefit individual users by reducing the cognitive load of learning and understanding new data systems. Organizations benefit from this in the form of lower onboarding costs, easier collaboration between teams, and better interfaces for data systems.
We saw in the recent one billion row challenge post how even CSV reader keyword arguments can differ greatly between APIs. This is compounded by tightly coupling a dataframe API to every query engine, whether batch or streaming.
Ibis aims to solve this dilemma by providing a standard dataframe API that can work across data systems, whether batch or streaming. This is a long-term vision and we’re excited to take the first steps toward it in Ibis 8.0 with the launch of two streaming backends (and one more batch backend).
This allows a user to leverage DuckDB or Polars or DataFusion locally, then scale out batch processing to Snowflake or BigQuery or ClickHouse in the cloud, then switch from batch to stream processing with Apache Flink or RisingWave, all without changing their dataframe code. As Ibis adds new features and implements them across backends, users can take advantage of these features without needing to learn new APIs.
Backends
Three new backends were added in this release.
Apache Flink
In collaboration with Claypot AI (recently acquired by Voltron Data), we’ve added the first streaming backend with Apache Flink. You can check out the blog post and tutorial to get started with this new backend.
RisingWave
RisingWave has contributed second streaming backend with RisingWave. This backend is earlier in development, but we’re excited to have it in Ibis and it will continue to improve it.
Exasol
Exasol has contributed the Exasol backend. This is a traditional batch backend and brings another great option for fast batch analytics to Ibis.
Breaking changes
You can view the full changelog for additional breaking changes. There have been few that we expect to affect most users.
The PM for the team was distracted playing with LLMs and didn’t write a v7 blog post, so we’re covering breaking changes and features from both below.
If you’re new to Ibis, see how to install and the getting started tutorial.
To follow along with this blog, ensure you’re on 'ibis-framework>=8,<9'
. First, we’ll setup Ibis and fetch some sample data to use.
Now, fetch the penguins dataset.
= ibis.examples.penguins.fetch()
t t
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓ ┃ species ┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex ┃ year ┃ ┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩ │ string │ string │ float64 │ float64 │ int64 │ int64 │ string │ int64 │ ├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤ │ Adelie │ Torgersen │ 39.1 │ 18.7 │ 181 │ 3750 │ male │ 2007 │ │ Adelie │ Torgersen │ 39.5 │ 17.4 │ 186 │ 3800 │ female │ 2007 │ │ Adelie │ Torgersen │ 40.3 │ 18.0 │ 195 │ 3250 │ female │ 2007 │ │ … │ … │ … │ … │ … │ … │ … │ … │ └─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘
rename
The largest breaking change in Ibis 7/8 is the deprecation of relabel
in favor of rename
, swapping the order of the arguments. This change was made to be consistent with the rest of the Ibis API. We apologize for any inconvenience this may cause, but we believe this change will make Ibis a better and more consistent dataframe standard going forward.
In the past, you would use relabel
like this:
"species": "SPECIES"}) t.relabel({
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓ ┃ SPECIES ┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex ┃ year ┃ ┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩ │ string │ string │ float64 │ float64 │ int64 │ int64 │ string │ int64 │ ├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤ │ Adelie │ Torgersen │ 39.1 │ 18.7 │ 181 │ 3750 │ male │ 2007 │ │ Adelie │ Torgersen │ 39.5 │ 17.4 │ 186 │ 3800 │ female │ 2007 │ │ Adelie │ Torgersen │ 40.3 │ 18.0 │ 195 │ 3250 │ female │ 2007 │ │ … │ … │ … │ … │ … │ … │ … │ … │ └─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘
Now, you would use rename
like this:
"SPECIES": "species"}) t.rename({
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓ ┃ SPECIES ┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex ┃ year ┃ ┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩ │ string │ string │ float64 │ float64 │ int64 │ int64 │ string │ int64 │ ├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤ │ Adelie │ Torgersen │ 39.1 │ 18.7 │ 181 │ 3750 │ male │ 2007 │ │ Adelie │ Torgersen │ 39.5 │ 17.4 │ 186 │ 3800 │ female │ 2007 │ │ Adelie │ Torgersen │ 40.3 │ 18.0 │ 195 │ 3250 │ female │ 2007 │ │ … │ … │ … │ … │ … │ … │ … │ … │ └─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘
or this:
="species") t.rename(SPECIES
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓ ┃ SPECIES ┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex ┃ year ┃ ┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩ │ string │ string │ float64 │ float64 │ int64 │ int64 │ string │ int64 │ ├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤ │ Adelie │ Torgersen │ 39.1 │ 18.7 │ 181 │ 3750 │ male │ 2007 │ │ Adelie │ Torgersen │ 39.5 │ 17.4 │ 186 │ 3800 │ female │ 2007 │ │ Adelie │ Torgersen │ 40.3 │ 18.0 │ 195 │ 3250 │ female │ 2007 │ │ … │ … │ … │ … │ … │ … │ … │ … │ └─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘
Functionality
A lot of new functionality has been added in Ibis 7/8.
pandas batches
The .to_pandas_batches()
method can be used to output batches of pandas dataframes:
= t.to_pandas_batches(chunk_size=200)
batches for df in batches:
print(df.shape)
(200, 8)
(144, 8)
range
The range()
function can be used to create a monotonic sequence of integers:
= ibis.range(10)
s s
[0, 1, ... +8]
You can turn it into a table:
"index").as_table() s.unnest().name(
┏━━━━━━━┓ ┃ index ┃ ┡━━━━━━━┩ │ int8 │ ├───────┤ │ 0 │ │ 1 │ │ 2 │ │ … │ └───────┘
This can be useful for creating synthetic data and other use cases.
relocate
The .relocate()
method can be used to move columns to the beginning of a table, which is very useful for interactive data exploration with wide tables:
t
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓ ┃ species ┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex ┃ year ┃ ┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩ │ string │ string │ float64 │ float64 │ int64 │ int64 │ string │ int64 │ ├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤ │ Adelie │ Torgersen │ 39.1 │ 18.7 │ 181 │ 3750 │ male │ 2007 │ │ Adelie │ Torgersen │ 39.5 │ 17.4 │ 186 │ 3800 │ female │ 2007 │ │ Adelie │ Torgersen │ 40.3 │ 18.0 │ 195 │ 3250 │ female │ 2007 │ │ … │ … │ … │ … │ … │ … │ … │ … │ └─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘
Then:
"sex", "year") t.relocate(
┏━━━━━━━━┳━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓ ┃ sex ┃ year ┃ species ┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ ┡━━━━━━━━╇━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩ │ string │ int64 │ string │ string │ float64 │ float64 │ int64 │ int64 │ ├────────┼───────┼─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┤ │ male │ 2007 │ Adelie │ Torgersen │ 39.1 │ 18.7 │ 181 │ 3750 │ │ female │ 2007 │ Adelie │ Torgersen │ 39.5 │ 17.4 │ 186 │ 3800 │ │ female │ 2007 │ Adelie │ Torgersen │ 40.3 │ 18.0 │ 195 │ 3250 │ │ … │ … │ … │ … │ … │ … │ … │ … │ └────────┴───────┴─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┘
sample
The .sample()
method can be used to sample rows from a table:
Number of rows returned may vary by invocation.
t.count()
344
=0.1).count() t.sample(fraction
28
negative slicing
More Pythonic slicing is now supported:
3] t[:
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓ ┃ species ┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex ┃ year ┃ ┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩ │ string │ string │ float64 │ float64 │ int64 │ int64 │ string │ int64 │ ├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤ │ Adelie │ Torgersen │ 39.1 │ 18.7 │ 181 │ 3750 │ male │ 2007 │ │ Adelie │ Torgersen │ 39.5 │ 17.4 │ 186 │ 3800 │ female │ 2007 │ │ Adelie │ Torgersen │ 40.3 │ 18.0 │ 195 │ 3250 │ female │ 2007 │ └─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘
-3:] t[
┏━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓ ┃ species ┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex ┃ year ┃ ┡━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩ │ string │ string │ float64 │ float64 │ int64 │ int64 │ string │ int64 │ ├───────────┼────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤ │ Chinstrap │ Dream │ 49.6 │ 18.2 │ 193 │ 3775 │ male │ 2009 │ │ Chinstrap │ Dream │ 50.8 │ 19.0 │ 210 │ 4100 │ male │ 2009 │ │ Chinstrap │ Dream │ 50.2 │ 18.7 │ 198 │ 3775 │ female │ 2009 │ └───────────┴────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘
-6:-3] t[
┏━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓ ┃ species ┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex ┃ year ┃ ┡━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩ │ string │ string │ float64 │ float64 │ int64 │ int64 │ string │ int64 │ ├───────────┼────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤ │ Chinstrap │ Dream │ 45.7 │ 17.0 │ 195 │ 3650 │ female │ 2009 │ │ Chinstrap │ Dream │ 55.8 │ 19.8 │ 207 │ 4000 │ male │ 2009 │ │ Chinstrap │ Dream │ 43.5 │ 18.1 │ 202 │ 3400 │ female │ 2009 │ └───────────┴────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘
geospatial operations in DuckDB
Ibis supports over 50 geospatial operations, with many being recently added to DuckDB backend. While backend-specific, this is worth calling out because it brings a great local option for geospatial analytics to Ibis. Read the first geospatial blog or the second geospatial blog to learn more.
A new zones
example dataset with a geometric datatype has been added for a quick demonstration:
= ibis.examples.zones.fetch()
z = z.relocate("geom")
z z
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ geom ┃ OBJECTID ┃ Shape_Leng ┃ Shape_Area ┃ zone ┃ LocationID ┃ borough ┃ x_cent ┃ y_cent ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ geospatial:geometry │ int32 │ float64 │ float64 │ string │ int32 │ string │ float64 │ float64 │ ├──────────────────────────────────────────────────────────────────────────────────┼──────────┼────────────┼────────────┼─────────────────────────┼────────────┼─────────┼──────────────┼───────────────┤ │ <POLYGON ((933100.918 192536.086, 933091.011 192572.175, 933088.585 192604.9...> │ 1 │ 0.116357 │ 0.000782 │ Newark Airport │ 1 │ EWR │ 9.359968e+05 │ 191376.749531 │ │ <MULTIPOLYGON (((1033269.244 172126.008, 1033439.643 170883.946, 1033473.265...> │ 2 │ 0.433470 │ 0.004866 │ Jamaica Bay │ 2 │ Queens │ 1.031086e+06 │ 164018.754403 │ │ <POLYGON ((1026308.77 256767.698, 1026495.593 256638.616, 1026567.23 256589....> │ 3 │ 0.084341 │ 0.000314 │ Allerton/Pelham Gardens │ 3 │ Bronx │ 1.026453e+06 │ 254265.478659 │ │ … │ … │ … │ … │ … │ … │ … │ … │ … │ └──────────────────────────────────────────────────────────────────────────────────┴──────────┴────────────┴────────────┴─────────────────────────┴────────────┴─────────┴──────────────┴───────────────┘
We can use geospatial operations on that column:
= z.mutate(
z =z.geom.area(),
area=z.geom.centroid(),
centroid"area", "centroid")
).relocate( z
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ area ┃ centroid ┃ geom ┃ OBJECTID ┃ Shape_Leng ┃ Shape_Area ┃ zone ┃ LocationID ┃ borough ┃ x_cent ┃ y_cent ┃ ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ float64 │ point │ geospatial:geometry │ int32 │ float64 │ float64 │ string │ int32 │ string │ float64 │ float64 │ ├──────────────┼──────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────┼──────────┼────────────┼────────────┼─────────────────────────┼────────────┼─────────┼──────────────┼───────────────┤ │ 7.903953e+07 │ <POINT (935996.821 191376.75)> │ <POLYGON ((933100.918 192536.086, 933091.011 192572.175, 933088.585 192604.9...> │ 1 │ 0.116357 │ 0.000782 │ Newark Airport │ 1 │ EWR │ 9.359968e+05 │ 191376.749531 │ │ 1.439095e+08 │ <POINT (1031085.719 164018.754)> │ <MULTIPOLYGON (((1033269.244 172126.008, 1033439.643 170883.946, 1033473.265...> │ 2 │ 0.433470 │ 0.004866 │ Jamaica Bay │ 2 │ Queens │ 1.031086e+06 │ 164018.754403 │ │ 3.168508e+07 │ <POINT (1026452.617 254265.479)> │ <POLYGON ((1026308.77 256767.698, 1026495.593 256638.616, 1026567.23 256589....> │ 3 │ 0.084341 │ 0.000314 │ Allerton/Pelham Gardens │ 3 │ Bronx │ 1.026453e+06 │ 254265.478659 │ │ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │ └──────────────┴──────────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────┴──────────┴────────────┴────────────┴─────────────────────────┴────────────┴─────────┴──────────────┴───────────────┘
Wrapping up
Ibis 8.0 brings exciting new features and the first streaming backends into Ibis! We hope you’re excited as we are about breaking down barriers between batch and streaming systems with a standard Python dataframe API.
As always, try Ibis by installing and getting started.
If you run into any issues or find support is lacking for your backend, open an issue or discussion and let us know!