墨水记忆

列表和数组

TIP

Polars 有两种容器数据类型, 列表（List）和数组（Array），两者有很多类似的功能。
官方API-List
官方API-Array

列表`List`

List: 存放同一数据类型的一维容器，长度可变。

Python

1from datetime import datetime
2
3df = pl.DataFrame(
4    {
5        "names": [
6            ["Anne", "Averill", "Adams"],
7            ["Brandon", "Brooke", "Borden", "Branson"],
8            ["Camila", "Campbell"],
9            ["Dennis", "Doyle"],
10        ],
11        "children_ages": [
12            [5, 7],
13            [],
14            [],
15            [8, 11, 18],
16        ],
17        "medical_appointments": [
18            [],
19            [],
20            [],
21            [datetime(2022, 5, 22, 16, 30)],
22        ],
23    }
24)
25
26print(df)

1shape: (4, 3)
2┌─────────────────────────────────┬───────────────┬───────────────────────┐
3│ names                           ┆ children_ages ┆ medical_appointments  │
4│ ---                             ┆ ---           ┆ ---                   │
5│ list[str]                       ┆ list[i64]     ┆ list[datetime[μs]]    │
6╞═════════════════════════════════╪═══════════════╪═══════════════════════╡
7│ ["Anne", "Averill", "Adams"]    ┆ [5, 7]        ┆ []                    │
8│ ["Brandon", "Brooke", … "Brans… ┆ []            ┆ []                    │
9│ ["Camila", "Campbell"]          ┆ []            ┆ []                    │
10│ ["Dennis", "Doyle"]             ┆ [8, 11, 18]   ┆ [2022-05-22 16:30:00] │
11└─────────────────────────────────┴───────────────┴───────────────────────┘

NOTE

数据类型List与Python的list不同。

数组`Array`

Array: 存放同一数据类型的多维容器，长度固定。所以适合于已知大小和数据类型的数据存储。

下面代码中, 一共两列, 对于每一列来说, 要求每行的元素长度相同, 数据元素类型相同

Python

1df = pl.DataFrame(
2    {
3        "bit_flags": [
4            [True, True, True, True, False],
5            [False, True, True, True, True],
6        ],
7        "tic_tac_toe": [
8            [
9                [" ", "x", "o"],
10                [" ", "x", " "],
11                ["o", "x", " "],
12            ],
13            [
14                ["o", "x", "x"],
15                [" ", "o", "x"],
16                [" ", " ", "o"],
17            ],
18        ],
19    },
20    schema={
21        "bit_flags": pl.Array(pl.Boolean, 5),
22        "tic_tac_toe": pl.Array(pl.String, (3, 3)),
23    },
24)
25
26print(df)

1shape: (2, 2)
2┌───────────────────────┬─────────────────────────────────┐
3│ bit_flags             ┆ tic_tac_toe                     │
4│ ---                   ┆ ---                             │
5│ array[bool, 5]        ┆ array[str, (3, 3)]              │
6╞═══════════════════════╪═════════════════════════════════╡
7│ [true, true, … false] ┆ [[" ", "x", "o"], [" ", "x", "… │
8│ [false, true, … true] ┆ [["o", "x", "x"], [" ", "o", "… │
9└───────────────────────┴─────────────────────────────────┘

上面示例中，通过schema参数显式的指定了bit_flags是一维Boolean类型的数组,tic_tac_toe是三行三列的二维String类型数组。因为通常情况下，出于性能原因， Polars 默认使用List，而不是使用Array。如下所示：

Python

1df = pl.DataFrame(
2    {
3        "bit_flags": [
4            [True, True, True, True, False],
5            [False, True, True, True, True],
6        ],
7        "tic_tac_toe": [
8            [
9                [" ", "x", "o"],
10                [" ", "x", " "],
11                ["o", "x", " "],
12            ],
13            [
14                ["o", "x", "x"],
15                [" ", "o", "x"],
16                [" ", " ", "o"],
17            ],
18        ],
19    }
20)
21
22print(df)

1shape: (2, 2)
2┌───────────────────────┬─────────────────────────────────┐
3│ bit_flags             ┆ tic_tac_toe                     │
4│ ---                   ┆ ---                             │
5│ list[bool]            ┆ list[list[str]]                 │
6╞═══════════════════════╪═════════════════════════════════╡
7│ [true, true, … false] ┆ [[" ", "x", "o"], [" ", "x", "… │
8│ [false, true, … true] ┆ [["o", "x", "x"], [" ", "o", "… │
9└───────────────────────┴─────────────────────────────────┘

但是，有一个特例。就是当用 NumPy 数组来构建列时：

Python

1import numpy as np
2
3array = np.arange(0, 120).reshape((5, 2, 3, 4))  # 4D array
4
5print(pl.Series(array).dtype)  # Column with the 3D subarrays

1Array(Int64, shape=(2, 3, 4))

使用`List`

数据准备

Polars提供了很多函数去处理List数据类型, 这些函数在命名空间list中。如下是一个不同气象站的天气数据：

Python

1weather = pl.DataFrame(
2    {
3        "station": [f"Station {idx}" for idx in range(1, 6)],
4        "temperatures": [
5            "20 5 5 E1 7 13 19 9 6 20",
6            "18 8 16 11 23 E2 8 E2 E2 E2 90 70 40",
7            "19 24 E9 16 6 12 10 22",
8            "E2 E0 15 7 8 10 E1 24 17 13 6",
9            "14 8 E0 16 22 24 E1",
10        ],
11    }
12)
13
14print(weather)

1shape: (5, 2)
2┌───────────┬─────────────────────────────────┐
3│ station   ┆ temperatures                    │
4│ ---       ┆ ---                             │
5│ str       ┆ str                             │
6╞═══════════╪═════════════════════════════════╡
7│ Station 1 ┆ 20 5 5 E1 7 13 19 9 6 20        │
8│ Station 2 ┆ 18 8 16 11 23 E2 8 E2 E2 E2 90… │
9│ Station 3 ┆ 19 24 E9 16 6 12 10 22          │
10│ Station 4 ┆ E2 E0 15 7 8 10 E1 24 17 13 6   │
11│ Station 5 ┆ 14 8 E0 16 22 24 E1             │
12└───────────┴─────────────────────────────────┘

对于上面的示例数据，当想对每个站点捕获的温度进行分析时, 需要先把temperatures这一列中的每一行按照空格进行分割成列表：

Python

1weather = weather.with_columns(
2    pl.col("temperatures").str.split(" "),
3)
4print(weather)

1shape: (5, 2)
2┌───────────┬──────────────────────┐
3│ station   ┆ temperatures         │
4│ ---       ┆ ---                  │
5│ str       ┆ list[str]            │
6╞═══════════╪══════════════════════╡
7│ Station 1 ┆ ["20", "5", … "20"]  │
8│ Station 2 ┆ ["18", "8", … "40"]  │
9│ Station 3 ┆ ["19", "24", … "22"] │
10│ Station 4 ┆ ["E2", "E0", … "6"]  │
11│ Station 5 ┆ ["14", "8", … "E1"]  │
12└───────────┴──────────────────────┘

操作列表

Polars在命名空间list中提供了很多操作列表的函数, 包括和之前处理字符串一样类似功能的slice、 head、 tail等函数。

Python

1result = weather.with_columns(
2    pl.col("temperatures").list.head(3).alias("head"), # 获取列表中的前3个元素
3    pl.col("temperatures").list.tail(3).alias("tail"), # 获取列表中的后3个元素
4    pl.col("temperatures").list.slice(-3, 2).alias("two_next_to_last"), # 从列表倒数第3个开始，连续获取两个元素
5)
6print(result)

1shape: (5, 5)
2┌───────────┬──────────────────────┬────────────────────┬────────────────────┬──────────────────┐
3│ station   ┆ temperatures         ┆ head               ┆ tail               ┆ two_next_to_last │
4│ ---       ┆ ---                  ┆ ---                ┆ ---                ┆ ---              │
5│ str       ┆ list[str]            ┆ list[str]          ┆ list[str]          ┆ list[str]        │
6╞═══════════╪══════════════════════╪════════════════════╪════════════════════╪══════════════════╡
7│ Station 1 ┆ ["20", "5", … "20"]  ┆ ["20", "5", "5"]   ┆ ["9", "6", "20"]   ┆ ["9", "6"]       │
8│ Station 2 ┆ ["18", "8", … "40"]  ┆ ["18", "8", "16"]  ┆ ["90", "70", "40"] ┆ ["90", "70"]     │
9│ Station 3 ┆ ["19", "24", … "22"] ┆ ["19", "24", "E9"] ┆ ["12", "10", "22"] ┆ ["12", "10"]     │
10│ Station 4 ┆ ["E2", "E0", … "6"]  ┆ ["E2", "E0", "15"] ┆ ["17", "13", "6"]  ┆ ["17", "13"]     │
11│ Station 5 ┆ ["14", "8", … "E1"]  ┆ ["14", "8", "E0"]  ┆ ["22", "24", "E1"] ┆ ["22", "24"]     │
12└───────────┴──────────────────────┴────────────────────┴────────────────────┴──────────────────┘

列表遍历

遐想

Python：.apply(lambda x: (x.rate + 1))
Java：.forEach(x -> x.rate + 1)

通过list命名空间中的eval和element配合, 可以实现对列表中的每个元素执行操作:

Python

1result = weather.with_columns(
2    (pl.col("temperatures").list.eval(pl.element().str.contains("E"))).alias("flag"),
3)
4
5print(result)

1shape: (5, 3)
2┌───────────┬──────────────────────┬─────────────────────────┐
3│ station   ┆ temperatures         ┆ flag                    │
4│ ---       ┆ ---                  ┆ ---                     │
5│ str       ┆ list[str]            ┆ list[bool]              │
6╞═══════════╪══════════════════════╪═════════════════════════╡
7│ Station 1 ┆ ["20", "5", … "20"]  ┆ [false, false, … false] │
8│ Station 2 ┆ ["18", "8", … "40"]  ┆ [false, false, … false] │
9│ Station 3 ┆ ["19", "24", … "22"] ┆ [false, false, … false] │
10│ Station 4 ┆ ["E2", "E0", … "6"]  ┆ [true, true, … false]   │
11│ Station 5 ┆ ["14", "8", … "E1"]  ┆ [false, false, … true]  │
12└───────────┴──────────────────────┴─────────────────────────┘

统计列表

找出每个站点的气温异常的次数:

将测量结果转化为数字, 异常气温的会变为null。（或者直接简单判断是否有字母E）
统计列表中null的数量
将此列输出重命名为"error"

Python

1result = weather.with_columns(
2    pl.col("temperatures")  # 选中temperatures这一列
3    .list  # 获取命名空间 list
4    .eval(  # 使用eval方法, 针对列表中的每个元素进行操作
5        pl.element()  # 获取列表中的元素
6        .cast(pl.Int64, strict=False)  # 转换为数字, 失败的会变为null
7        .is_null()  # 获取是否为null
8    )
9    .list.sum()  # 对列表中的元素求和, True为1, False为0
10    .alias("errors"),
11)
12print(result)

1shape: (5, 3)
2┌───────────┬──────────────────────┬────────┐
3│ station   ┆ temperatures         ┆ errors │
4│ ---       ┆ ---                  ┆ ---    │
5│ str       ┆ list[str]            ┆ u32    │
6╞═══════════╪══════════════════════╪════════╡
7│ Station 1 ┆ ["20", "5", … "20"]  ┆ 1      │
8│ Station 2 ┆ ["18", "8", … "40"]  ┆ 4      │
9│ Station 3 ┆ ["19", "24", … "22"] ┆ 1      │
10│ Station 4 ┆ ["E2", "E0", … "6"]  ┆ 3      │
11│ Station 5 ┆ ["14", "8", … "E1"]  ┆ 2      │
12└───────────┴──────────────────────┴────────┘

逐行计算

创建新数据集：

Python

1weather_by_day = pl.DataFrame(
2    {
3        "station": [f"Station {idx}" for idx in range(1, 11)],
4        "day_1": [17, 11, 8, 22, 9, 21, 20, 8, 8, 17],
5        "day_2": [15, 11, 10, 8, 7, 14, 18, 21, 15, 13],
6        "day_3": [16, 15, 24, 24, 8, 23, 19, 23, 16, 10],
7    }
8)
9print(weather_by_day)

1shape: (10, 4)
2┌────────────┬───────┬───────┬───────┐
3│ station    ┆ day_1 ┆ day_2 ┆ day_3 │
4│ ---        ┆ ---   ┆ ---   ┆ ---   │
5│ str        ┆ i64   ┆ i64   ┆ i64   │
6╞════════════╪═══════╪═══════╪═══════╡
7│ Station 1  ┆ 17    ┆ 15    ┆ 16    │
8│ Station 2  ┆ 11    ┆ 11    ┆ 15    │
9│ Station 3  ┆ 8     ┆ 10    ┆ 24    │
10│ Station 4  ┆ 22    ┆ 8     ┆ 24    │
11│ Station 5  ┆ 9     ┆ 7     ┆ 8     │
12│ Station 6  ┆ 21    ┆ 14    ┆ 23    │
13│ Station 7  ┆ 20    ┆ 18    ┆ 19    │
14│ Station 8  ┆ 8     ┆ 21    ┆ 23    │
15│ Station 9  ┆ 8     ┆ 15    ┆ 16    │
16│ Station 10 ┆ 17    ┆ 13    ┆ 10    │
17└────────────┴───────┴───────┴───────┘

现在计算各个站点每日温度的百分比排名：

Python

1rank_pct = (pl.element().rank(descending=True) / pl.element().count()).round(2)
2
3result = weather_by_day.with_columns(
4    # concat_list将每一行中的所有元素合并成一个列表, 用于生成一个临时的中间列 "all_temps"
5    pl.concat_list(pl.all().exclude("station")).alias("all_temps")
6).select(
7    # 选择除了 all_temps 这一列之外的所有列
8    pl.all().exclude("all_temps"),
9    pl.col("all_temps").list.eval(rank_pct, parallel=True).alias("temps_rank"),
10)
11
12print(result)

1shape: (10, 5)
2┌────────────┬───────┬───────┬───────┬────────────────────┐
3│ station    ┆ day_1 ┆ day_2 ┆ day_3 ┆ temps_rank         │
4│ ---        ┆ ---   ┆ ---   ┆ ---   ┆ ---                │
5│ str        ┆ i64   ┆ i64   ┆ i64   ┆ list[f64]          │
6╞════════════╪═══════╪═══════╪═══════╪════════════════════╡
7│ Station 1  ┆ 17    ┆ 15    ┆ 16    ┆ [0.33, 1.0, 0.67]  │
8│ Station 2  ┆ 11    ┆ 11    ┆ 15    ┆ [0.83, 0.83, 0.33] │
9│ Station 3  ┆ 8     ┆ 10    ┆ 24    ┆ [1.0, 0.67, 0.33]  │
10│ Station 4  ┆ 22    ┆ 8     ┆ 24    ┆ [0.67, 1.0, 0.33]  │
11│ Station 5  ┆ 9     ┆ 7     ┆ 8     ┆ [0.33, 1.0, 0.67]  │
12│ Station 6  ┆ 21    ┆ 14    ┆ 23    ┆ [0.67, 1.0, 0.33]  │
13│ Station 7  ┆ 20    ┆ 18    ┆ 19    ┆ [0.33, 1.0, 0.67]  │
14│ Station 8  ┆ 8     ┆ 21    ┆ 23    ┆ [1.0, 0.67, 0.33]  │
15│ Station 9  ┆ 8     ┆ 15    ┆ 16    ┆ [1.0, 0.67, 0.33]  │
16│ Station 10 ┆ 17    ┆ 13    ┆ 10    ┆ [0.33, 0.67, 1.0]  │
17└────────────┴───────┴───────┴───────┴────────────────────┘

使用 `Array`

Polars通常不会对Array做类型推导, 所以必须在创建Series/DataFrame是显式指定数据类型, 或者使用 NumPy 数组创建该列。

Python

1df = pl.DataFrame(
2    {
3        "first_last": [
4            ["Anne", "Adams"],
5            ["Brandon", "Branson"],
6            ["Camila", "Campbell"],
7            ["Dennis", "Doyle"],
8        ],
9        "fav_numbers": [
10            [42, 0, 1],
11            [2, 3, 5],
12            [13, 21, 34],
13            [73, 3, 7],
14        ],
15    },
16    schema={
17        "first_last": pl.Array(pl.String, 2),
18        "fav_numbers": pl.Array(pl.Int32, 3),
19    },
20)
21
22result = df.select(
23    pl.col("first_last").arr.join(" ").alias("name"),
24    pl.col("fav_numbers").arr.sort(),
25    pl.col("fav_numbers").arr.max().alias("largest_fav"),
26    pl.col("fav_numbers").arr.sum().alias("summed"),
27    pl.col("fav_numbers").arr.contains(3).alias("likes_3"),
28)
29print(result)

1shape: (4, 5)
2┌─────────────────┬───────────────┬─────────────┬────────┬─────────┐
3│ name            ┆ fav_numbers   ┆ largest_fav ┆ summed ┆ likes_3 │
4│ ---             ┆ ---           ┆ ---         ┆ ---    ┆ ---     │
5│ str             ┆ array[i32, 3] ┆ i32         ┆ i32    ┆ bool    │
6╞═════════════════╪═══════════════╪═════════════╪════════╪═════════╡
7│ Anne Adams      ┆ [0, 1, 42]    ┆ 42          ┆ 43     ┆ false   │
8│ Brandon Branson ┆ [2, 3, 5]     ┆ 5           ┆ 10     ┆ true    │
9│ Camila Campbell ┆ [13, 21, 34]  ┆ 34          ┆ 68     ┆ false   │
10│ Dennis Doyle    ┆ [3, 7, 73]    ┆ 73          ┆ 83     ┆ true    │
11└─────────────────┴───────────────┴─────────────┴────────┴─────────┘

列表和数组#

列表List#

数组Array#

使用List#

数据准备#

操作列表#

列表遍历#

统计列表#

逐行计算#

使用 Array#

列表和数组

列表`List`

数组`Array`

使用`List`

数据准备

操作列表

列表遍历

统计列表

逐行计算

使用 `Array`