墨水记忆

基本使用

本章旨在帮助您通过 Polars 完成常用功能。

数据准备

Python

1df = pl.DataFrame({
2    'id': [1, 2, 2, 3, 4, 5, None, 6],
3    'name': ['张三', '李四a', '李四b', '王五', '赵六', '钱七', '未知', '孙八'],
4    'value': [10.5, 20.3, 20.3, np.nan, 40.1, 1000.0, 60.8, 70.2],
5    'category': ['A', 'B', 'B', 'A', None, 'C', 'A', 'C'],
6    'date': [date(2023, 1, 1), date(2023, 1, 2), date(2023, 1, 2), date(2023, 1, 3),
7             date(2023, 1, 4), date(2023, 1, 5), None, date(2023, 1, 7)],
8    'flag': [True, False, False, True, True, None, False, True]
9})
10
11print(df.dtypes)

1[Int64, String, Float64, String, Date, Boolean]

去重

重复值处理

DataFrame.unique( subset: ColumnNameOrSelector | Collection[ColumnNameOrSelector] | None = None, *, keep: UniqueKeepStrategy = 'any', maintain_order: bool = False, ) → DataFrame

subset：参与判断的列。默认所有列。

keep：保留策略。当有重复数据时，如何处理重复数据。 any：保留任意一个，默认值；first：保留第一个；last：保留最后一个；none：不保留；

maintain_order：数据顺序是否按照原数据顺序，默认False。如果该值为False时，并且keep值为first，每一次执行保留的值会不一样；设置为True时，keep参数保留的结果才会符合预期。但该参数在一定程度上影响性能。

Python

1result = df.unique(subset=['id'], keep='last', maintain_order=True)
2print(result)

1shape: (7, 6)
2┌──────┬───────┬────────┬──────────┬────────────┬───────┐
3│ id   ┆ name  ┆ value  ┆ category ┆ date       ┆ flag  │
4│ ---  ┆ ---   ┆ ---    ┆ ---      ┆ ---        ┆ ---   │
5│ i64  ┆ str   ┆ f64    ┆ str      ┆ date       ┆ bool  │
6╞══════╪═══════╪════════╪══════════╪════════════╪═══════╡
7│ 1    ┆ 张三  ┆ 10.5   ┆ A        ┆ 2023-01-01 ┆ true  │
8│ 2    ┆ 李四b ┆ 20.3   ┆ B        ┆ 2023-01-02 ┆ false │
9│ 3    ┆ 王五  ┆ NaN    ┆ A        ┆ 2023-01-03 ┆ true  │
10│ 4    ┆ 赵六  ┆ 40.1   ┆ null     ┆ 2023-01-04 ┆ true  │
11│ 5    ┆ 钱七  ┆ 1000.0 ┆ C        ┆ 2023-01-05 ┆ null  │
12│ null ┆ 未知  ┆ 60.8   ┆ A        ┆ null       ┆ false │
13│ 6    ┆ 孙八  ┆ 70.2   ┆ C        ┆ 2023-01-07 ┆ true  │
14└──────┴───────┴────────┴──────────┴────────────┴───────┘

判断重复行

DataFrame.is_unique() → Series

统计的是所有列作为唯一值判断条件。

Python

1df2 = pl.DataFrame({
2    'id': [1, 2, 2, 3, 4, 5, None, 6],
3    'flag': [True, False, False, True, True, None, False, True]
4})
5result = df2.is_unique()
6print(result)

1shape: (8,)
2Series: '' [bool]
3[
4	true
5	false
6	false
7	true
8	true
9	true
10	true
11	true
12]

统计唯一值个数

DataFrame.n_unique(subset: str | Expr | Sequence[str | Expr] | None = None) → int

统计去重后的个数。

Python

1result = df.n_unique(subset=['id'])
2print(result)

1df2 = pl.DataFrame({
2    'id': [1, 2, 2, 3, 4, 5, 6],
3    'value': [20.5, 20.3, 30.3, 60.1, 10.0, 60.8, 50.2],
4})
5result = df2.n_unique(subset=[((pl.col("value") / 10).cast(pl.Int64))])
6print(result)

特殊值处理

null值处理

填充null值

使用指定的值或策略填充空值，选择其一即可。

DataFrame.fill_null( value: Any | Expr | None = None, strategy: FillNullStrategy | None = None, limit: int | None = None, *, matches_supertype: bool = True, ) → DataFrame

value：指定值进行填充替换null值。

strategy：替换策略。forward：使用前一个非空值进行填充；backward：使用后一个非空值进行填充；min：使用最小值进行填充；max：使用最大值进行填充；mean：使用平均值进行填充；zero：使用0填充；one：使用1填充；

limit：使用forward或backward策略时要填充的连续空值数，即不是所有null值都进行填充，只填充前n个或者后n个。

指定值填充

Python

1result = df.fill_null(value='fillv')
2print(result)

1shape: (8, 6)
2┌──────┬───────┬────────┬──────────┬────────────┬───────┐
3│ id   ┆ name  ┆ value  ┆ category ┆ date       ┆ flag  │
4│ ---  ┆ ---   ┆ ---    ┆ ---      ┆ ---        ┆ ---   │
5│ i64  ┆ str   ┆ f64    ┆ str      ┆ date       ┆ bool  │
6╞══════╪═══════╪════════╪══════════╪════════════╪═══════╡
7│ 1    ┆ 张三  ┆ 10.5   ┆ A        ┆ 2023-01-01 ┆ true  │
8│ 2    ┆ 李四a ┆ 20.3   ┆ B        ┆ 2023-01-02 ┆ false │
9│ 2    ┆ 李四b ┆ 20.3   ┆ B        ┆ 2023-01-02 ┆ false │
10│ 3    ┆ 王五  ┆ NaN    ┆ A        ┆ 2023-01-03 ┆ true  │
11│ 4    ┆ 赵六  ┆ 40.1   ┆ fillv    ┆ 2023-01-04 ┆ true  │
12│ 5    ┆ 钱七  ┆ 1000.0 ┆ C        ┆ 2023-01-05 ┆ null  │
13│ null ┆ 未知  ┆ 60.8   ┆ A        ┆ null       ┆ false │
14│ 6    ┆ 孙八  ┆ 70.2   ┆ C        ┆ 2023-01-07 ┆ true  │
15└──────┴───────┴────────┴──────────┴────────────┴───────┘

提示

通过上面的示例中也可以看出，这种指定值的方式，只能填充给定的值的数据类型的null值，其他类型的null值并没有进行填充。

指定策略填充

Python

1result = df.fill_null(strategy='forward')
2print(result)

1shape: (8, 6)
2┌─────┬───────┬────────┬──────────┬────────────┬───────┐
3│ id  ┆ name  ┆ value  ┆ category ┆ date       ┆ flag  │
4│ --- ┆ ---   ┆ ---    ┆ ---      ┆ ---        ┆ ---   │
5│ i64 ┆ str   ┆ f64    ┆ str      ┆ date       ┆ bool  │
6╞═════╪═══════╪════════╪══════════╪════════════╪═══════╡
7│ 1   ┆ 张三  ┆ 10.5   ┆ A        ┆ 2023-01-01 ┆ true  │
8│ 2   ┆ 李四a ┆ 20.3   ┆ B        ┆ 2023-01-02 ┆ false │
9│ 2   ┆ 李四b ┆ 20.3   ┆ B        ┆ 2023-01-02 ┆ false │
10│ 3   ┆ 王五  ┆ NaN    ┆ A        ┆ 2023-01-03 ┆ true  │
11│ 4   ┆ 赵六  ┆ 40.1   ┆ A        ┆ 2023-01-04 ┆ true  │
12│ 5   ┆ 钱七  ┆ 1000.0 ┆ C        ┆ 2023-01-05 ┆ true  │
13│ 5   ┆ 未知  ┆ 60.8   ┆ A        ┆ 2023-01-05 ┆ false │
14│ 6   ┆ 孙八  ┆ 70.2   ┆ C        ┆ 2023-01-07 ┆ true  │
15└─────┴───────┴────────┴──────────┴────────────┴───────┘

使用limit限制

Python

1df2 = pl.DataFrame({
2    'id': [1, 2, 3, None, None, None, None, None, 6],
3})
4
5result1 = df2.fill_null(strategy='forward')
6print(result1)
7result2 = df2.fill_null(strategy='forward', limit=3)
8print(result2)

1shape: (9, 1)
2┌─────┐
3│ id  │
4│ --- │
5│ i64 │
6╞═════╡
7│ 1   │
8│ 2   │
9│ 3   │
10│ 3   │
11│ 3   │
12│ 3   │
13│ 3   │
14│ 3   │
15│ 6   │
16└─────┘
17shape: (9, 1)
18┌──────┐
19│ id   │
20│ ---  │
21│ i64  │
22╞══════╡
23│ 1    │
24│ 2    │
25│ 3    │
26│ 3    │
27│ 3    │
28│ 3    │
29│ null │
30│ null │
31│ 6    │
32└──────┘

统计null值

Series.null_count() → int

Python

1result = df.null_count()
2print(result)

1shape: (1, 6)
2┌─────┬──────┬───────┬──────────┬──────┬──────┐
3│ id  ┆ name ┆ value ┆ category ┆ date ┆ flag │
4│ --- ┆ ---  ┆ ---   ┆ ---      ┆ ---  ┆ ---  │
5│ u32 ┆ u32  ┆ u32   ┆ u32      ┆ u32  ┆ u32  │
6╞═════╪══════╪═══════╪══════════╪══════╪══════╡
7│ 1   ┆ 0    ┆ 0     ┆ 1        ┆ 1    ┆ 1    │
8└─────┴──────┴───────┴──────────┴──────┴──────┘

删除null值行

DataFrame.drop_nulls( subset: ColumnNameOrSelector | Collection[ColumnNameOrSelector] | None = None, ) → DataFrame

subset：参与判断的列，默认所有列。

Python

1result = df.drop_nulls()
2print(result)

1shape: (5, 6)
2┌─────┬───────┬───────┬──────────┬────────────┬───────┐
3│ id  ┆ name  ┆ value ┆ category ┆ date       ┆ flag  │
4│ --- ┆ ---   ┆ ---   ┆ ---      ┆ ---        ┆ ---   │
5│ i64 ┆ str   ┆ f64   ┆ str      ┆ date       ┆ bool  │
6╞═════╪═══════╪═══════╪══════════╪════════════╪═══════╡
7│ 1   ┆ 张三  ┆ 10.5  ┆ A        ┆ 2023-01-01 ┆ true  │
8│ 2   ┆ 李四a ┆ 20.3  ┆ B        ┆ 2023-01-02 ┆ false │
9│ 2   ┆ 李四b ┆ 20.3  ┆ B        ┆ 2023-01-02 ┆ false │
10│ 3   ┆ 王五  ┆ NaN   ┆ A        ┆ 2023-01-03 ┆ true  │
11│ 6   ┆ 孙八  ┆ 70.2  ┆ C        ┆ 2023-01-07 ┆ true  │
12└─────┴───────┴───────┴──────────┴────────────┴───────┘

Python

1result = df.drop_nulls(subset=['id', 'name'])
2print(result)

1shape: (7, 6)
2┌─────┬───────┬────────┬──────────┬────────────┬───────┐
3│ id  ┆ name  ┆ value  ┆ category ┆ date       ┆ flag  │
4│ --- ┆ ---   ┆ ---    ┆ ---      ┆ ---        ┆ ---   │
5│ i64 ┆ str   ┆ f64    ┆ str      ┆ date       ┆ bool  │
6╞═════╪═══════╪════════╪══════════╪════════════╪═══════╡
7│ 1   ┆ 张三  ┆ 10.5   ┆ A        ┆ 2023-01-01 ┆ true  │
8│ 2   ┆ 李四a ┆ 20.3   ┆ B        ┆ 2023-01-02 ┆ false │
9│ 2   ┆ 李四b ┆ 20.3   ┆ B        ┆ 2023-01-02 ┆ false │
10│ 3   ┆ 王五  ┆ NaN    ┆ A        ┆ 2023-01-03 ┆ true  │
11│ 4   ┆ 赵六  ┆ 40.1   ┆ null     ┆ 2023-01-04 ┆ true  │
12│ 5   ┆ 钱七  ┆ 1000.0 ┆ C        ┆ 2023-01-05 ┆ null  │
13│ 6   ┆ 孙八  ┆ 70.2   ┆ C        ┆ 2023-01-07 ┆ true  │
14└─────┴───────┴────────┴──────────┴────────────┴───────┘

nan值处理

与处理null值的方法类似。

填充nan值

DataFrame.fill_nan(value: Expr | int | float | None) → DataFrame

Python

1result = df.fill_nan(777)
2print(result)

1shape: (8, 6)
2┌──────┬───────┬────────┬──────────┬────────────┬───────┐
3│ id   ┆ name  ┆ value  ┆ category ┆ date       ┆ flag  │
4│ ---  ┆ ---   ┆ ---    ┆ ---      ┆ ---        ┆ ---   │
5│ i64  ┆ str   ┆ f64    ┆ str      ┆ date       ┆ bool  │
6╞══════╪═══════╪════════╪══════════╪════════════╪═══════╡
7│ 1    ┆ 张三  ┆ 10.5   ┆ A        ┆ 2023-01-01 ┆ true  │
8│ 2    ┆ 李四a ┆ 20.3   ┆ B        ┆ 2023-01-02 ┆ false │
9│ 2    ┆ 李四b ┆ 20.3   ┆ B        ┆ 2023-01-02 ┆ false │
10│ 3    ┆ 王五  ┆ 777.0  ┆ A        ┆ 2023-01-03 ┆ true  │
11│ 4    ┆ 赵六  ┆ 40.1   ┆ null     ┆ 2023-01-04 ┆ true  │
12│ 5    ┆ 钱七  ┆ 1000.0 ┆ C        ┆ 2023-01-05 ┆ null  │
13│ null ┆ 未知  ┆ 60.8   ┆ A        ┆ null       ┆ false │
14│ 6    ┆ 孙八  ┆ 70.2   ┆ C        ┆ 2023-01-07 ┆ true  │
15└──────┴───────┴────────┴──────────┴────────────┴───────┘

删除nan值行

DataFrame.drop_nans( subset: ColumnNameOrSelector | Collection[ColumnNameOrSelector] | None = None, ) → DataFrame

Python

1result = df.drop_nans(subset=['id', 'value'])
2print(result)

1shape: (7, 6)
2┌──────┬───────┬────────┬──────────┬────────────┬───────┐
3│ id   ┆ name  ┆ value  ┆ category ┆ date       ┆ flag  │
4│ ---  ┆ ---   ┆ ---    ┆ ---      ┆ ---        ┆ ---   │
5│ i64  ┆ str   ┆ f64    ┆ str      ┆ date       ┆ bool  │
6╞══════╪═══════╪════════╪══════════╪════════════╪═══════╡
7│ 1    ┆ 张三  ┆ 10.5   ┆ A        ┆ 2023-01-01 ┆ true  │
8│ 2    ┆ 李四a ┆ 20.3   ┆ B        ┆ 2023-01-02 ┆ false │
9│ 2    ┆ 李四b ┆ 20.3   ┆ B        ┆ 2023-01-02 ┆ false │
10│ 4    ┆ 赵六  ┆ 40.1   ┆ null     ┆ 2023-01-04 ┆ true  │
11│ 5    ┆ 钱七  ┆ 1000.0 ┆ C        ┆ 2023-01-05 ┆ null  │
12│ null ┆ 未知  ┆ 60.8   ┆ A        ┆ null       ┆ false │
13│ 6    ┆ 孙八  ┆ 70.2   ┆ C        ┆ 2023-01-07 ┆ true  │
14└──────┴───────┴────────┴──────────┴────────────┴───────┘

TIP

处理nan值的方法相比于处理null值的方法要少，所以在处理nan时，可以先将其处理成null值，再按照null值的处理方式进行处理，如下所示：

Python

1result = df.fill_nan(None)
2print(result)

1shape: (8, 6)
2┌──────┬───────┬────────┬──────────┬────────────┬───────┐
3│ id   ┆ name  ┆ value  ┆ category ┆ date       ┆ flag  │
4│ ---  ┆ ---   ┆ ---    ┆ ---      ┆ ---        ┆ ---   │
5│ i64  ┆ str   ┆ f64    ┆ str      ┆ date       ┆ bool  │
6╞══════╪═══════╪════════╪══════════╪════════════╪═══════╡
7│ 1    ┆ 张三  ┆ 10.5   ┆ A        ┆ 2023-01-01 ┆ true  │
8│ 2    ┆ 李四a ┆ 20.3   ┆ B        ┆ 2023-01-02 ┆ false │
9│ 2    ┆ 李四b ┆ 20.3   ┆ B        ┆ 2023-01-02 ┆ false │
10│ 3    ┆ 王五  ┆ null   ┆ A        ┆ 2023-01-03 ┆ true  │
11│ 4    ┆ 赵六  ┆ 40.1   ┆ null     ┆ 2023-01-04 ┆ true  │
12│ 5    ┆ 钱七  ┆ 1000.0 ┆ C        ┆ 2023-01-05 ┆ null  │
13│ null ┆ 未知  ┆ 60.8   ┆ A        ┆ null       ┆ false │
14│ 6    ┆ 孙八  ┆ 70.2   ┆ C        ┆ 2023-01-07 ┆ true  │
15└──────┴───────┴────────┴──────────┴────────────┴───────┘

这样，原本value列中为nan值的数据就被处理成了null值，接下来就可以按照处理null值的方式进行处理。

拼接数据

concat

polars.concat( items: Iterable[PolarsType], *, how: ConcatMethod = 'vertical', rechunk: bool = False, parallel: bool = True, ) → PolarsType

主要的是以下两个参数：

items：参与拼接的多个对象。请注意，Series仅支持垂直策略(vertical)。所以，一般都是DataFrame对象。

how：拼接方式，默认垂直拼接(vertical)。这里只介绍几种常用的，其他参考官方接口。vertical：垂直合并；horizontal：水平合并；diagonal：对角线合并。

垂直拼接

使用concat函数和how="vertical"参数即可, 但是要注意, 如果没有相同的列名, 则会报错(比如一个3列一个2列, 或者都是2列但是列名不完全相同)。

Python

1df1 = pl.DataFrame({
2    'id': [1, 2, 3],
3    'name': ['张三', '李四', '王五'],
4    'category': ['A', 'B', 'A'],
5    'flag': [True, False, True]
6})
7
8df2 = pl.DataFrame({
9
10    'id': [4, 5, 6],
11    'name': ['赵六', '钱七', '孙八'],
12    'category': ['C', 'A', 'C'],
13    'flag': [True, False, True]
14})
15
16result = pl.concat([df1, df2])
17print(result)

1shape: (6, 4)
2┌─────┬──────┬──────────┬───────┐
3│ id  ┆ name ┆ category ┆ flag  │
4│ --- ┆ ---  ┆ ---      ┆ ---   │
5│ i64 ┆ str  ┆ str      ┆ bool  │
6╞═════╪══════╪══════════╪═══════╡
7│ 1   ┆ 张三 ┆ A        ┆ true  │
8│ 2   ┆ 李四 ┆ B        ┆ false │
9│ 3   ┆ 王五 ┆ A        ┆ true  │
10│ 4   ┆ 赵六 ┆ C        ┆ true  │
11│ 5   ┆ 钱七 ┆ A        ┆ false │
12│ 6   ┆ 孙八 ┆ C        ┆ true  │
13└─────┴──────┴──────────┴───────┘

水平拼接

水平拼接时，如果包含相同的列名则会报错。多个对象的行数可以不一致，默认会使用null来填充。

Python

1df1 = pl.DataFrame({
2    'id': [1, 2, 3],
3    'name': ['张三', '李四', '王五'],
4    'category': ['A', 'B', 'A'],
5    'flag': [True, False, True]
6})
7
8df3 = pl.DataFrame({
9    'value': [10.5, 20.3, 20.3, 40.1, 1000.0, 60.8, 70.2],
10    'date': [date(2023, 1, 1), date(2023, 1, 2), date(2023, 1, 2), date(2023, 1, 3),
11             date(2023, 1, 4), date(2023, 1, 5), date(2023, 1, 7)],
12})
13
14result = pl.concat([df1, df3], how='horizontal')
15print(result)

1shape: (7, 6)
2┌──────┬──────┬──────────┬───────┬────────┬────────────┐
3│ id   ┆ name ┆ category ┆ flag  ┆ value  ┆ date       │
4│ ---  ┆ ---  ┆ ---      ┆ ---   ┆ ---    ┆ ---        │
5│ i64  ┆ str  ┆ str      ┆ bool  ┆ f64    ┆ date       │
6╞══════╪══════╪══════════╪═══════╪════════╪════════════╡
7│ 1    ┆ 张三 ┆ A        ┆ true  ┆ 10.5   ┆ 2023-01-01 │
8│ 2    ┆ 李四 ┆ B        ┆ false ┆ 20.3   ┆ 2023-01-02 │
9│ 3    ┆ 王五 ┆ A        ┆ true  ┆ 20.3   ┆ 2023-01-02 │
10│ null ┆ null ┆ null     ┆ null  ┆ 40.1   ┆ 2023-01-03 │
11│ null ┆ null ┆ null     ┆ null  ┆ 1000.0 ┆ 2023-01-04 │
12│ null ┆ null ┆ null     ┆ null  ┆ 60.8   ┆ 2023-01-05 │
13│ null ┆ null ┆ null     ┆ null  ┆ 70.2   ┆ 2023-01-07 │
14└──────┴──────┴──────────┴───────┴────────┴────────────┘

对角线拼接

新生成的DataFrame会更长或更宽。如果存在相同列，则该列会增加行。

Python

1df1 = pl.DataFrame({
2    'id': [1, 2, 3],
3    'flag': [True, False, True]
4})
5
6df2 = pl.DataFrame({
7    'id': [4, 5, 6, 1],
8    'flag': [True, False, True, True],
9    'value': [10.5, 20.3, 20.3, 70.2],
10})
11
12result = pl.concat([df1, df2], how='diagonal')
13print(result)

1shape: (7, 3)
2┌─────┬───────┬───────┐
3│ id  ┆ flag  ┆ value │
4│ --- ┆ ---   ┆ ---   │
5│ i64 ┆ bool  ┆ f64   │
6╞═════╪═══════╪═══════╡
7│ 1   ┆ true  ┆ null  │
8│ 2   ┆ false ┆ null  │
9│ 3   ┆ true  ┆ null  │
10│ 4   ┆ true  ┆ 10.5  │
11│ 5   ┆ false ┆ 20.3  │
12│ 6   ┆ true  ┆ 20.3  │
13│ 1   ┆ true  ┆ 70.2  │
14└─────┴───────┴───────┘

关联数据

join

DataFrame.join( other: DataFrame, on: str | Expr | Sequence[str | Expr] | None = None, how: JoinStrategy = 'inner', *, left_on: str | Expr | Sequence[str | Expr] | None = None, right_on: str | Expr | Sequence[str | Expr] | None = None, suffix: str = '_right', validate: JoinValidation = 'm:m', nulls_equal: bool = False, coalesce: bool | None = None, maintain_order: MaintainOrderJoin | None = None, ) → DataFrame

other：待关联对象。

on：关联列名。当两个关联的对象关联列名相同时可用，如果是多列关联，则["A","B","C",...]。与left_on和right_on不可同时存在。

how：关联方式，默认内连接(inner)。inner-内连接、left-左连接、right-右连接、full-全连接、semi-半连接、anti-反连接、cross-笛卡尔积。

left_on：左侧DataFrame中进行关联的列名称，可多列。与on不可同时存在。

right_on：右侧DataFrame中进行关联的列名称，可多列。与on不可同时存在。

suffix：当存在重复列时，右侧列中重复的列名将会添加该参数值作为后缀。默认：_right。

coalesce：当存在重复列时，是否合并重复列。

maintain_order：合并后的数据的排序方式。left：按照左侧DataFrame顺序； right：按照右侧DataFrame顺序； left_right：先按照左侧再按照右侧, right_left：先按照右侧再按照左侧。

数据准备

Python

1df1 = pl.DataFrame(
2    {
3        "foo": [1, 2, 3],
4        "bar": [6.0, 7.0, 8.0],
5        "ham": ["a", "b", "c"],
6    }
7)
8
9df2 = pl.DataFrame(
10    {
11        "apple": ["x", "y", "z"],
12        "ham": ["a", "b", "d"],
13    }
14)

内连接

Python

1result = df1.join(df2, how="inner", on="ham")
2print(result)

1shape: (2, 4)
2┌─────┬─────┬─────┬───────┐
3│ foo ┆ bar ┆ ham ┆ apple │
4│ --- ┆ --- ┆ --- ┆ ---   │
5│ i64 ┆ f64 ┆ str ┆ str   │
6╞═════╪═════╪═════╪═══════╡
7│ 1   ┆ 6.0 ┆ a   ┆ x     │
8│ 2   ┆ 7.0 ┆ b   ┆ y     │
9└─────┴─────┴─────┴───────┘

左连接

Python

1result = df1.join(df2, on="ham", how="left")
2print(result)

1shape: (3, 4)
2┌─────┬─────┬─────┬───────┐
3│ foo ┆ bar ┆ ham ┆ apple │
4│ --- ┆ --- ┆ --- ┆ ---   │
5│ i64 ┆ f64 ┆ str ┆ str   │
6╞═════╪═════╪═════╪═══════╡
7│ 1   ┆ 6.0 ┆ a   ┆ x     │
8│ 2   ┆ 7.0 ┆ b   ┆ y     │
9│ 3   ┆ 8.0 ┆ c   ┆ null  │
10└─────┴─────┴─────┴───────┘

右连接

Python

1result = df1.join(df2, on="ham", how="right")
2print(result)

1shape: (3, 4)
2┌──────┬──────┬───────┬─────┐
3│ foo  ┆ bar  ┆ apple ┆ ham │
4│ ---  ┆ ---  ┆ ---   ┆ --- │
5│ i64  ┆ f64  ┆ str   ┆ str │
6╞══════╪══════╪═══════╪═════╡
7│ 1    ┆ 6.0  ┆ x     ┆ a   │
8│ 2    ┆ 7.0  ┆ y     ┆ b   │
9│ null ┆ null ┆ z     ┆ d   │
10└──────┴──────┴───────┴─────┘

全连接

Python

1result = df1.join(df2, on="ham", how="full")
2print(result)
3result = df1.join(df2, on="ham", how="full", coalesce=True)
4print(result)

1shape: (4, 5)
2┌──────┬──────┬──────┬───────┬───────────┐
3│ foo  ┆ bar  ┆ ham  ┆ apple ┆ ham_right │
4│ ---  ┆ ---  ┆ ---  ┆ ---   ┆ ---       │
5│ i64  ┆ f64  ┆ str  ┆ str   ┆ str       │
6╞══════╪══════╪══════╪═══════╪═══════════╡
7│ 1    ┆ 6.0  ┆ a    ┆ x     ┆ a         │
8│ 2    ┆ 7.0  ┆ b    ┆ y     ┆ b         │
9│ null ┆ null ┆ null ┆ z     ┆ d         │
10│ 3    ┆ 8.0  ┆ c    ┆ null  ┆ null      │
11└──────┴──────┴──────┴───────┴───────────┘
12shape: (4, 4)
13┌──────┬──────┬─────┬───────┐
14│ foo  ┆ bar  ┆ ham ┆ apple │
15│ ---  ┆ ---  ┆ --- ┆ ---   │
16│ i64  ┆ f64  ┆ str ┆ str   │
17╞══════╪══════╪═════╪═══════╡
18│ 1    ┆ 6.0  ┆ a   ┆ x     │
19│ 2    ┆ 7.0  ┆ b   ┆ y     │
20│ null ┆ null ┆ d   ┆ z     │
21│ 3    ┆ 8.0  ┆ c   ┆ null  │
22└──────┴──────┴─────┴───────┘

半连接

和内连接有点类似，但是结果只返回了左侧DataFrame中的数据。可以将结果与内连接的结果进行对比。

Python

1result = df1.join(df2, on="ham", how="semi")
2print(result)

1shape: (2, 3)
2┌─────┬─────┬─────┐
3│ foo ┆ bar ┆ ham │
4│ --- ┆ --- ┆ --- │
5│ i64 ┆ f64 ┆ str │
6╞═════╪═════╪═════╡
7│ 1   ┆ 6.0 ┆ a   │
8│ 2   ┆ 7.0 ┆ b   │
9└─────┴─────┴─────┘

反连接

和半连接类似，半连接返回的是匹配的左侧DataFrame中的数据；而反连接返回的是不匹配的左侧DataFrame中的数据。

Python

1result = df1.join(df2, on="ham", how="anti")
2print(result)

1shape: (1, 3)
2┌─────┬─────┬─────┐
3│ foo ┆ bar ┆ ham │
4│ --- ┆ --- ┆ --- │
5│ i64 ┆ f64 ┆ str │
6╞═════╪═════╪═════╡
7│ 3   ┆ 8.0 ┆ c   │
8└─────┴─────┴─────┘

笛卡尔积

Python

1result = df1.join(df2, how="cross")
2print(result)

1shape: (9, 5)
2┌─────┬─────┬─────┬───────┬───────────┐
3│ foo ┆ bar ┆ ham ┆ apple ┆ ham_right │
4│ --- ┆ --- ┆ --- ┆ ---   ┆ ---       │
5│ i64 ┆ f64 ┆ str ┆ str   ┆ str       │
6╞═════╪═════╪═════╪═══════╪═══════════╡
7│ 1   ┆ 6.0 ┆ a   ┆ x     ┆ a         │
8│ 1   ┆ 6.0 ┆ a   ┆ y     ┆ b         │
9│ 1   ┆ 6.0 ┆ a   ┆ z     ┆ d         │
10│ 2   ┆ 7.0 ┆ b   ┆ x     ┆ a         │
11│ 2   ┆ 7.0 ┆ b   ┆ y     ┆ b         │
12│ 2   ┆ 7.0 ┆ b   ┆ z     ┆ d         │
13│ 3   ┆ 8.0 ┆ c   ┆ x     ┆ a         │
14│ 3   ┆ 8.0 ┆ c   ┆ y     ┆ b         │
15│ 3   ┆ 8.0 ┆ c   ┆ z     ┆ d         │
16└─────┴─────┴─────┴───────┴───────────┘

参考示例

Python

1# 员工表（左表）：包含 user_id、dept_id 作为联合键
2df_employees = pl.DataFrame({
3    "user_id": [1, 2, 2, 3, 4],
4    "dept_id": [10, 20, 20, 30, 40],  # 部门ID，与 user_id 组成联合键
5    "name": ["张三", "李四", "李四", "王五", "赵六"],
6    "age": [25, 30, 30, 35, 40]
7})
8
9# 绩效表（右表）：同样包含 user_id、dept_id 作为联合键
10df_performance = pl.DataFrame({
11    "user_id": [2, 2, 3, 5],
12    "dept_id": [20, 20, 30, 50],  # 需与左表的 dept_id 共同匹配
13    "score": [85, 90, 88, 92],  # 绩效分数
14    "review": ["良好", "优秀", "良好", "优秀"]
15})
16
17inner_join = df_employees.join(
18    df_performance,
19    on=["user_id", "dept_id"],  # 多列关联：同时匹配 user_id 和 dept_id
20    how="inner"
21)
22print("内连接（多列匹配）结果:")
23print(inner_join)
24
25left_join = df_employees.join(
26    df_performance,
27    on=["user_id", "dept_id"],
28    how="left"
29)
30
31print("左连接（多列匹配）结果:")
32print(left_join)
33
34right_join = df_employees.join(
35    df_performance,
36    on=["user_id", "dept_id"],
37    how="right"
38)
39
40print("右连接（多列匹配）结果:")
41print(right_join)

1内连接（多列匹配）结果:
2shape: (5, 6)
3┌─────────┬─────────┬──────┬─────┬───────┬────────┐
4│ user_id ┆ dept_id ┆ name ┆ age ┆ score ┆ review │
5│ ---     ┆ ---     ┆ ---  ┆ --- ┆ ---   ┆ ---    │
6│ i64     ┆ i64     ┆ str  ┆ i64 ┆ i64   ┆ str    │
7╞═════════╪═════════╪══════╪═════╪═══════╪════════╡
8│ 2       ┆ 20      ┆ 李四 ┆ 30  ┆ 85    ┆ 良好   │
9│ 2       ┆ 20      ┆ 李四 ┆ 30  ┆ 90    ┆ 优秀   │
10│ 2       ┆ 20      ┆ 李四 ┆ 30  ┆ 85    ┆ 良好   │
11│ 2       ┆ 20      ┆ 李四 ┆ 30  ┆ 90    ┆ 优秀   │
12│ 3       ┆ 30      ┆ 王五 ┆ 35  ┆ 88    ┆ 良好   │
13└─────────┴─────────┴──────┴─────┴───────┴────────┘
14左连接（多列匹配）结果:
15shape: (7, 6)
16┌─────────┬─────────┬──────┬─────┬───────┬────────┐
17│ user_id ┆ dept_id ┆ name ┆ age ┆ score ┆ review │
18│ ---     ┆ ---     ┆ ---  ┆ --- ┆ ---   ┆ ---    │
19│ i64     ┆ i64     ┆ str  ┆ i64 ┆ i64   ┆ str    │
20╞═════════╪═════════╪══════╪═════╪═══════╪════════╡
21│ 1       ┆ 10      ┆ 张三 ┆ 25  ┆ null  ┆ null   │
22│ 2       ┆ 20      ┆ 李四 ┆ 30  ┆ 85    ┆ 良好   │
23│ 2       ┆ 20      ┆ 李四 ┆ 30  ┆ 90    ┆ 优秀   │
24│ 2       ┆ 20      ┆ 李四 ┆ 30  ┆ 85    ┆ 良好   │
25│ 2       ┆ 20      ┆ 李四 ┆ 30  ┆ 90    ┆ 优秀   │
26│ 3       ┆ 30      ┆ 王五 ┆ 35  ┆ 88    ┆ 良好   │
27│ 4       ┆ 40      ┆ 赵六 ┆ 40  ┆ null  ┆ null   │
28└─────────┴─────────┴──────┴─────┴───────┴────────┘
29右连接（多列匹配）结果:
30shape: (6, 6)
31┌──────┬──────┬─────────┬─────────┬───────┬────────┐
32│ name ┆ age  ┆ user_id ┆ dept_id ┆ score ┆ review │
33│ ---  ┆ ---  ┆ ---     ┆ ---     ┆ ---   ┆ ---    │
34│ str  ┆ i64  ┆ i64     ┆ i64     ┆ i64   ┆ str    │
35╞══════╪══════╪═════════╪═════════╪═══════╪════════╡
36│ 李四 ┆ 30   ┆ 2       ┆ 20      ┆ 85    ┆ 良好   │
37│ 李四 ┆ 30   ┆ 2       ┆ 20      ┆ 85    ┆ 良好   │
38│ 李四 ┆ 30   ┆ 2       ┆ 20      ┆ 90    ┆ 优秀   │
39│ 李四 ┆ 30   ┆ 2       ┆ 20      ┆ 90    ┆ 优秀   │
40│ 王五 ┆ 35   ┆ 3       ┆ 30      ┆ 88    ┆ 良好   │
41│ null ┆ null ┆ 5       ┆ 50      ┆ 92    ┆ 优秀   │
42└──────┴──────┴─────────┴─────────┴───────┴────────┘

IO操作

文件操作

读取文件和写入文件是两个相对通用的方法，也很简单，这里列一些常用的文件：

读取文件：pl.read_csv("data.csv")
写入文件：df.write_csv("data.csv")
读取Excel：pl.read_excel("data.xlsx")
写入Excel：df.write_excel("data.xlsx")
读取JSON：pl.read_json("data.json")
写入JSON：df.write_json("data.json")
读取Parquet：pl.read_parquet("data.parquet")
写入Parquet：df.write_parquet("data.parquet")

WARNING

需要注意的是，在写入或者读取文件时，对于部分文件还需要安装对应的依赖包，比如Excel需要额外安装fastexcel xlsx2csv openpyxl xlsxwriter等依赖。

数据库操作

读取数据库

Python

1uri = "postgresql://username:password@server:port/database"
2query = "SELECT * FROM foo"
3pl.read_database_uri(query=query, uri=uri)
4
5
6from sqlalchemy import create_engine
7conn = create_engine(f"sqlite:///test.db")
8query = "SELECT * FROM foo"
9pl.read_database(query=query, connection=conn.connect())

写入数据库

Python

1uri = "postgresql://username:password@server:port/database"
2df = pl.DataFrame({"foo": [1, 2, 3]})
3df.write_database(table_name="records",  connection=uri)

时间序列

基本使用

数据准备

去重

重复值处理

判断重复行

统计唯一值个数

特殊值处理

null值处理

填充null值

指定值填充

指定策略填充

统计null值

删除null值行

nan值处理

填充nan值

删除nan值行

拼接数据

垂直拼接

水平拼接

对角线拼接

关联数据

数据准备

内连接

左连接

右连接

全连接

半连接

反连接

笛卡尔积

参考示例

IO操作

文件操作

数据库操作

读取数据库

写入数据库

常用方法

基本使用#

数据准备#

去重#

重复值处理#

判断重复行#

统计唯一值个数#

特殊值处理#

null值处理#

填充null值#

指定值填充#

指定策略填充#

统计null值#

删除null值行#

nan值处理#

填充nan值#

删除nan值行#

拼接数据#

垂直拼接#

水平拼接#

对角线拼接#

关联数据#

数据准备#

内连接#

左连接#

右连接#

全连接#

半连接#

反连接#

笛卡尔积#

参考示例#

IO操作#

文件操作#

数据库操作#

读取数据库#

写入数据库#

常用方法#

基本使用

数据准备

去重

重复值处理

判断重复行

统计唯一值个数

特殊值处理

null值处理

填充null值

指定值填充

指定策略填充

统计null值

删除null值行

nan值处理

填充nan值

删除nan值行

拼接数据

垂直拼接

水平拼接

对角线拼接

关联数据

数据准备

内连接

左连接

右连接

全连接

半连接

反连接

笛卡尔积

参考示例

IO操作

文件操作

数据库操作

读取数据库

写入数据库

常用方法