墨水记忆

快速入门

本章旨在帮助您快速入门 Polars。它涵盖了该库的所有基本特性和功能，使新用户能够轻松熟悉从初始安装和设置到核心功能的基础知识。

安装 Polars

Python

1pip install polars

读写数据

示例

Polars 支持常见文件格式（例如 csv、json、parquet）、云存储（例如 S3、Azure Blob、BigQuery）和数据库（例如 postgres、mysql）的读写。下面，我们创建一个DataFrame，并展示如何将其写入磁盘并重新读取。

Pyhton

1import polars as pl
2import datetime as dt
3
4df = pl.DataFrame(
5    {
6        "name": ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
7        "birthdate": [
8            dt.date(1997, 1, 10),
9            dt.date(1985, 2, 15),
10            dt.date(1983, 3, 22),
11            dt.date(1981, 4, 30),
12        ],
13        "weight": [57.9, 72.5, 53.6, 83.1],  # (kg)
14        "height": [1.56, 1.77, 1.65, 1.75],  # (m)
15    }
16)
17
18print(df)

1shape: (4, 4)
2┌────────────────┬────────────┬────────┬────────┐
3│ name           ┆ birthdate  ┆ weight ┆ height │
4│ ---            ┆ ---        ┆ ---    ┆ ---    │
5│ str            ┆ date       ┆ f64    ┆ f64    │
6╞════════════════╪════════════╪════════╪════════╡
7│ Alice Archer   ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   │
8│ Ben Brown      ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
9│ Chloe Cooper   ┆ 1983-03-22 ┆ 53.6   ┆ 1.65   │
10│ Daniel Donovan ┆ 1981-04-30 ┆ 83.1   ┆ 1.75   │
11└────────────────┴────────────┴────────┴────────┘

示例

接着上面的示例，现在我们将数据写入一个名为output.csv的csv文件中。然后，我们再使用read_csv进行读取，最后打印结果。

Pyhton

1path = "output.csv"
2df.write_csv(path)
3df_csv = pl.read_csv(path, try_parse_dates=True)
4print(df_csv)

try_parse_dates参数表示在读取文件内容时，是否解析日期类型。

1shape: (4, 4)
2┌────────────────┬────────────┬────────┬────────┐
3│ name           ┆ birthdate  ┆ weight ┆ height │
4│ ---            ┆ ---        ┆ ---    ┆ ---    │
5│ str            ┆ date       ┆ f64    ┆ f64    │
6╞════════════════╪════════════╪════════╪════════╡
7│ Alice Archer   ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   │
8│ Ben Brown      ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
9│ Chloe Cooper   ┆ 1983-03-22 ┆ 53.6   ┆ 1.65   │
10│ Daniel Donovan ┆ 1981-04-30 ┆ 83.1   ┆ 1.75   │
11└────────────────┴────────────┴────────┴────────┘

表达式与上下文

表达式是 Polars 的主要优势之一，因为它们提供了一种模块化且灵活的方式来表达数据转换。

以下是 Polars 表达式的一个例子：

Pyhton

1pl.col("weight") / (pl.col("height") ** 2)

上面表达式的意义很简单，取名为weight的列，并将其值除以height列值的平方，从而计算出一个人的BMI。请注意，上面的代码表达的是一个计算逻辑，但该逻辑只有在Polars上下文中，表达式才能转化为包含结果的序列。就好比Pyhton代码能在任何文本里面写，但是只能在有Pyhton环境里面运行。

下面，我们将展示不同上下文中 Polars 表达式的示例：

select
with_columns
filter
group_by

select

上下文select允许您从DataFrame中选择和操作列。在最简单的情况下，您提供的每个表达式将映射到结果DataFrame中的一列：

Python

1df = pl.DataFrame(
2    {
3        "name": ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
4        "birthdate": [
5            dt.date(1997, 1, 10),
6            dt.date(1985, 2, 15),
7            dt.date(1983, 3, 22),
8            dt.date(1981, 4, 30),
9        ],
10        "weight": [57.9, 72.5, 53.6, 83.1],  # (kg)
11        "height": [1.56, 1.77, 1.65, 1.75],  # (m)
12    }
13)
14
15result = df.select(
16    pl.col("name"),
17    pl.col("birthdate").dt.year().alias("birth_year"),
18    (pl.col("weight") / (pl.col("height") ** 2)).alias("bmi"),
19)
20print(result)

1shape: (4, 3)
2┌────────────────┬────────────┬───────────┐
3│ name           ┆ birth_year ┆ bmi       │
4│ ---            ┆ ---        ┆ ---       │
5│ str            ┆ i32        ┆ f64       │
6╞════════════════╪════════════╪═══════════╡
7│ Alice Archer   ┆ 1997       ┆ 23.791913 │
8│ Ben Brown      ┆ 1985       ┆ 23.141498 │
9│ Chloe Cooper   ┆ 1983       ┆ 19.687787 │
10│ Daniel Donovan ┆ 1981       ┆ 27.134694 │
11└────────────────┴────────────┴───────────┘

Polars 还支持一项名为“表达式扩展”的功能，即一个表达式可以作为多个表达式的简写。在下面的示例中，我们使用表达式扩展通过单个表达式来操作“体重”和“身高”列。使用表达式扩展时，您可以使用.name.prefix和.name.suffix为原始列的名称添加前缀和后缀来作为新列名：

Python

1result = df.select(
2    pl.col("name"),
3    (pl.col("weight", "height") * 0.95).round(2).name.prefix("5%"),
4    (pl.col("weight", "height") * 0.95).round(2).name.suffix("-5%"),
5)
6print(result)

1shape: (4, 5)
2┌────────────────┬──────────┬──────────┬───────────┬───────────┐
3│ name           ┆ 5%weight ┆ 5%height ┆ weight-5% ┆ height-5% │
4│ ---            ┆ ---      ┆ ---      ┆ ---       ┆ ---       │
5│ str            ┆ f64      ┆ f64      ┆ f64       ┆ f64       │
6╞════════════════╪══════════╪══════════╪═══════════╪═══════════╡
7│ Alice Archer   ┆ 55.0     ┆ 1.48     ┆ 55.0      ┆ 1.48      │
8│ Ben Brown      ┆ 68.88    ┆ 1.68     ┆ 68.88     ┆ 1.68      │
9│ Chloe Cooper   ┆ 50.92    ┆ 1.57     ┆ 50.92     ┆ 1.57      │
10│ Daniel Donovan ┆ 78.94    ┆ 1.66     ┆ 78.94     ┆ 1.66      │
11└────────────────┴──────────┴──────────┴───────────┴───────────┘

with_columns

with_columns与select相似，with_columns用来向DataFrame中添加列，生成的DataFrame包含原始DataFrame的列以及引入的新列。with_columns就是用来新增列的，而新增列的数据是根据表达式而来。

Python

1df = pl.DataFrame(
2    {
3        "name": ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
4        "birthdate": [
5            dt.date(1997, 1, 10),
6            dt.date(1985, 2, 15),
7            dt.date(1983, 3, 22),
8            dt.date(1981, 4, 30),
9        ],
10        "weight": [57.9, 72.5, 53.6, 83.1],  # (kg)
11        "height": [1.56, 1.77, 1.65, 1.75],  # (m)
12    }
13)
14result = df.with_columns(
15    pl.col("birthdate").dt.year().alias("birth_year"),
16    (pl.col("weight") / (pl.col("height") ** 2)).alias("bmi"),
17)
18
19print(result)

1shape: (4, 6)
2┌────────────────┬────────────┬────────┬────────┬────────────┬───────────┐
3│ name           ┆ birthdate  ┆ weight ┆ height ┆ birth_year ┆ bmi       │
4│ ---            ┆ ---        ┆ ---    ┆ ---    ┆ ---        ┆ ---       │
5│ str            ┆ date       ┆ f64    ┆ f64    ┆ i32        ┆ f64       │
6╞════════════════╪════════════╪════════╪════════╪════════════╪═══════════╡
7│ Alice Archer   ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   ┆ 1997       ┆ 23.791913 │
8│ Ben Brown      ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   ┆ 1985       ┆ 23.141498 │
9│ Chloe Cooper   ┆ 1983-03-22 ┆ 53.6   ┆ 1.65   ┆ 1983       ┆ 19.687787 │
10│ Daniel Donovan ┆ 1981-04-30 ┆ 83.1   ┆ 1.75   ┆ 1981       ┆ 27.134694 │
11└────────────────┴────────────┴────────┴────────┴────────────┴───────────┘

列名命名

新增列的列名，也可以通过命名表达式来实现。

Python

1result = df.with_columns(
2    birth_year=pl.col("birthdate").dt.year(),
3    bmi=pl.col("weight") / (pl.col("height") ** 2),
4)

filter

用于对DataFrame中的数据进行过滤，满足条件的数据将会被过滤出来成为一个新的DataFrame。

Python

1result = df.filter(pl.col("birthdate").dt.year() < 1990)
2print(result)

1shape: (3, 4)
2┌────────────────┬────────────┬────────┬────────┐
3│ name           ┆ birthdate  ┆ weight ┆ height │
4│ ---            ┆ ---        ┆ ---    ┆ ---    │
5│ str            ┆ date       ┆ f64    ┆ f64    │
6╞════════════════╪════════════╪════════╪════════╡
7│ Ben Brown      ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
8│ Chloe Cooper   ┆ 1983-03-22 ┆ 53.6   ┆ 1.65   │
9│ Daniel Donovan ┆ 1981-04-30 ┆ 83.1   ┆ 1.75   │
10└────────────────┴────────────┴────────┴────────┘

多条件过滤：

Python

1result = df.filter(
2    (pl.col("birthdate").is_between(dt.date(1982, 12, 31), dt.date(1996, 1, 1))) & (pl.col("height") > 1.7)
3)
4print(result)

1shape: (1, 4)
2┌───────────┬────────────┬────────┬────────┐
3│ name      ┆ birthdate  ┆ weight ┆ height │
4│ ---       ┆ ---        ┆ ---    ┆ ---    │
5│ str       ┆ date       ┆ f64    ┆ f64    │
6╞═══════════╪════════════╪════════╪════════╡
7│ Ben Brown ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
8└───────────┴────────────┴────────┴────────┘

还可以提供多个谓词表达式实现多条件过滤：

Python

1result = df.filter(
2    pl.col("birthdate").is_between(dt.date(1982, 12, 31), dt.date(1996, 1, 1)),
3    pl.col("height") > 1.7,
4)
5print(result)

group_by

group_by可用于将 DataFrame 中单列或多列中存在相同值的行组合在一起。以下示例统计了每个十年出生的人数：

Python

1result = df.group_by(
2    (pl.col("birthdate").dt.year() // 10 * 10).alias("decade"),
3    maintain_order=True,
4).len()
5print(result)

1shape: (2, 2)
2┌────────┬─────┐
3│ decade ┆ len │
4│ ---    ┆ --- │
5│ i32    ┆ u32 │
6╞════════╪═════╡
7│ 1990   ┆ 1   │
8│ 1980   ┆ 3   │
9└────────┴─────┘

maintain_order参数

参数maintain_order强制 Polars 按照原始DataFrame中出现的顺序呈现结果组。这会减慢分组操作的速度。

使用group_by之后，我们可以用agg来聚合分组后的结果组：

Python

1result = df.group_by(
2    (pl.col("birthdate").dt.year() // 10 * 10).alias("decade"),
3    maintain_order=True,
4).agg(
5    pl.len().alias("sample_size"),
6    pl.col("weight").mean().round(2).alias("avg_weight"),
7    pl.col("height").max().alias("tallest"),
8)
9print(result)

1shape: (2, 4)
2┌────────┬─────────────┬────────────┬─────────┐
3│ decade ┆ sample_size ┆ avg_weight ┆ tallest │
4│ ---    ┆ ---         ┆ ---        ┆ ---     │
5│ i32    ┆ u32         ┆ f64        ┆ f64     │
6╞════════╪═════════════╪════════════╪═════════╡
7│ 1990   ┆ 1           ┆ 57.9       ┆ 1.56    │
8│ 1980   ┆ 3           ┆ 69.73      ┆ 1.77    │
9└────────┴─────────────┴────────────┴─────────┘

联合示例

Python

1result = (
2    df.with_columns(
3        (pl.col("birthdate").dt.year() // 10 * 10).alias("decade"),
4        pl.col("name").str.split(by=" ").list.first(),
5    )
6    .select(
7        pl.all().exclude("birthdate"),
8    )
9    .group_by(
10        pl.col("decade"),
11        maintain_order=True,
12    )
13    .agg(
14        pl.col("name"),
15        pl.col("weight", "height").mean().round(2).name.prefix("avg_"),
16    )
17)
18print(result)

1shape: (2, 4)
2┌────────┬────────────────────────────┬────────────┬────────────┐
3│ decade ┆ name                       ┆ avg_weight ┆ avg_height │
4│ ---    ┆ ---                        ┆ ---        ┆ ---        │
5│ i32    ┆ list[str]                  ┆ f64        ┆ f64        │
6╞════════╪════════════════════════════╪════════════╪════════════╡
7│ 1990   ┆ ["Alice"]                  ┆ 57.9       ┆ 1.56       │
8│ 1980   ┆ ["Ben", "Chloe", "Daniel"] ┆ 69.73      ┆ 1.72       │
9└────────┴────────────────────────────┴────────────┴────────────┘

合并DataFrame

Polars 提供了许多工具来合并两个 DataFrame。在本节中，我们将展示一个连接 (join) 和一个串联 (concatenation) 的示例。

关联DataFrame

Polars 提供了许多不同的连接算法。下面的示例展示了如何使用左外连接来组合两个 DataFrame，当某个列可以作为唯一标识符来建立 DataFrame 中行与行之间的对应关系时：

Python

1df = pl.DataFrame(
2    {
3        "name": ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
4        "birthdate": [
5            dt.date(1997, 1, 10),
6            dt.date(1985, 2, 15),
7            dt.date(1983, 3, 22),
8            dt.date(1981, 4, 30),
9        ],
10        "weight": [57.9, 72.5, 53.6, 83.1],  # (kg)
11        "height": [1.56, 1.77, 1.65, 1.75],  # (m)
12    }
13)
14df2 = pl.DataFrame(
15    {
16        "name": ["Ben Brown", "Daniel Donovan",  "Chloe Cooper", "Eve Davis"],
17        "parent": [True, False, False, True],
18        "siblings": [1, 2, 3, 4],
19    }
20)
21
22print(df.join(df2, on="name", how="left"))

1shape: (4, 6)
2┌────────────────┬────────────┬────────┬────────┬────────┬──────────┐
3│ name           ┆ birthdate  ┆ weight ┆ height ┆ parent ┆ siblings │
4│ ---            ┆ ---        ┆ ---    ┆ ---    ┆ ---    ┆ ---      │
5│ str            ┆ date       ┆ f64    ┆ f64    ┆ bool   ┆ i64      │
6╞════════════════╪════════════╪════════╪════════╪════════╪══════════╡
7│ Alice Archer   ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   ┆ null   ┆ null     │
8│ Ben Brown      ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   ┆ true   ┆ 1        │
9│ Chloe Cooper   ┆ 1983-03-22 ┆ 53.6   ┆ 1.65   ┆ false  ┆ 3        │
10│ Daniel Donovan ┆ 1981-04-30 ┆ 83.1   ┆ 1.75   ┆ false  ┆ 2        │
11└────────────────┴────────────┴────────┴────────┴────────┴──────────┘

拼接DataFrame

拼接DataFrame会创建一个更高或更宽的DataFrame，具体取决于所使用的方法。假设我们有第二个数据框，其中包含来自其他人的数据，我们可以使用垂直连接来创建一个更高的DataFrame：

Python

1df = pl.DataFrame(
2    {
3        "name": ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
4        "birthdate": [
5            dt.date(1997, 1, 10),
6            dt.date(1985, 2, 15),
7            dt.date(1983, 3, 22),
8            dt.date(1981, 4, 30),
9        ],
10        "weight": [57.9, 72.5, 53.6, 83.1],  # (kg)
11        "height": [1.56, 1.77, 1.65, 1.75],  # (m)
12    }
13)
14df3 = pl.DataFrame(
15    {
16        "name": ["Ethan Edwards", "Fiona Foster", "Grace Gibson", "Henry Harris"],
17        "birthdate": [
18            dt.date(1977, 5, 10),
19            dt.date(1975, 6, 23),
20            dt.date(1973, 7, 22),
21            dt.date(1971, 8, 3),
22        ],
23        "weight": [67.9, 72.5, 57.6, 93.1],  # (kg)
24        "height": [1.76, 1.6, 1.66, 1.8],  # (m)
25    }
26)
27
28print(pl.concat([df, df3], how="vertical"))

1shape: (8, 4)
2┌────────────────┬────────────┬────────┬────────┐
3│ name           ┆ birthdate  ┆ weight ┆ height │
4│ ---            ┆ ---        ┆ ---    ┆ ---    │
5│ str            ┆ date       ┆ f64    ┆ f64    │
6╞════════════════╪════════════╪════════╪════════╡
7│ Alice Archer   ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   │
8│ Ben Brown      ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
9│ Chloe Cooper   ┆ 1983-03-22 ┆ 53.6   ┆ 1.65   │
10│ Daniel Donovan ┆ 1981-04-30 ┆ 83.1   ┆ 1.75   │
11│ Ethan Edwards  ┆ 1977-05-10 ┆ 67.9   ┆ 1.76   │
12│ Fiona Foster   ┆ 1975-06-23 ┆ 72.5   ┆ 1.6    │
13│ Grace Gibson   ┆ 1973-07-22 ┆ 57.6   ┆ 1.66   │
14│ Henry Harris   ┆ 1971-08-03 ┆ 93.1   ┆ 1.8    │
15└────────────────┴────────────┴────────┴────────┘

Polars与Pandas互相转换

Python

1import polars as pl
2import pandas as pd
3
4pddf = pd.DataFrame()
5pldf = pl.DataFrame(pddf) # Pandas转为polars
6newpddf = pldf.to_pandas() # polars转为Pandas

快速入门#

安装 Polars#

读写数据#

表达式与上下文#

select#

with_columns#

filter#

group_by#

联合示例#

合并DataFrame#

关联DataFrame#

拼接DataFrame#

Polars与Pandas互相转换#

快速入门

安装 Polars

读写数据

表达式与上下文

select

with_columns

filter

group_by

联合示例

合并DataFrame

关联DataFrame

拼接DataFrame

Polars与Pandas互相转换