List
)和数组(Array
),两者有很多类似的功能。List
#List
: 存放同一数据类型的一维
容器,长度可变。
1from datetime import datetime
2
3df = pl.DataFrame(
4 {
5 "names": [
6 ["Anne", "Averill", "Adams"],
7 ["Brandon", "Brooke", "Borden", "Branson"],
8 ["Camila", "Campbell"],
9 ["Dennis", "Doyle"],
10 ],
11 "children_ages": [
12 [5, 7],
13 [],
14 [],
15 [8, 11, 18],
16 ],
17 "medical_appointments": [
18 [],
19 [],
20 [],
21 [datetime(2022, 5, 22, 16, 30)],
22 ],
23 }
24)
25
26print(df)
1shape: (4, 3)
2┌─────────────────────────────────┬───────────────┬───────────────────────┐
3│ names ┆ children_ages ┆ medical_appointments │
4│ --- ┆ --- ┆ --- │
5│ list[str] ┆ list[i64] ┆ list[datetime[μs]] │
6╞═════════════════════════════════╪═══════════════╪═══════════════════════╡
7│ ["Anne", "Averill", "Adams"] ┆ [5, 7] ┆ [] │
8│ ["Brandon", "Brooke", … "Brans… ┆ [] ┆ [] │
9│ ["Camila", "Campbell"] ┆ [] ┆ [] │
10│ ["Dennis", "Doyle"] ┆ [8, 11, 18] ┆ [2022-05-22 16:30:00] │
11└─────────────────────────────────┴───────────────┴───────────────────────┘
数据类型List
与Python的list
不同。
Array
#Array
: 存放同一数据类型的多维
容器,长度固定。所以适合于已知大小和数据类型的数据存储。
下面代码中, 一共两列, 对于每一列来说, 要求每行的元素长度相同, 数据元素类型相同
1df = pl.DataFrame(
2 {
3 "bit_flags": [
4 [True, True, True, True, False],
5 [False, True, True, True, True],
6 ],
7 "tic_tac_toe": [
8 [
9 [" ", "x", "o"],
10 [" ", "x", " "],
11 ["o", "x", " "],
12 ],
13 [
14 ["o", "x", "x"],
15 [" ", "o", "x"],
16 [" ", " ", "o"],
17 ],
18 ],
19 },
20 schema={
21 "bit_flags": pl.Array(pl.Boolean, 5),
22 "tic_tac_toe": pl.Array(pl.String, (3, 3)),
23 },
24)
25
26print(df)
1shape: (2, 2)
2┌───────────────────────┬─────────────────────────────────┐
3│ bit_flags ┆ tic_tac_toe │
4│ --- ┆ --- │
5│ array[bool, 5] ┆ array[str, (3, 3)] │
6╞═══════════════════════╪═════════════════════════════════╡
7│ [true, true, … false] ┆ [[" ", "x", "o"], [" ", "x", "… │
8│ [false, true, … true] ┆ [["o", "x", "x"], [" ", "o", "… │
9└───────────────────────┴─────────────────────────────────┘
上面示例中,通过schema
参数显式的指定了bit_flags
是一维Boolean
类型的数组,tic_tac_toe
是三行三列的二维String类型数组。因为通常情况下,出于性能原因, Polars 默认使用List
,而不是使用Array
。如下所示:
1df = pl.DataFrame(
2 {
3 "bit_flags": [
4 [True, True, True, True, False],
5 [False, True, True, True, True],
6 ],
7 "tic_tac_toe": [
8 [
9 [" ", "x", "o"],
10 [" ", "x", " "],
11 ["o", "x", " "],
12 ],
13 [
14 ["o", "x", "x"],
15 [" ", "o", "x"],
16 [" ", " ", "o"],
17 ],
18 ],
19 }
20)
21
22print(df)
1shape: (2, 2)
2┌───────────────────────┬─────────────────────────────────┐
3│ bit_flags ┆ tic_tac_toe │
4│ --- ┆ --- │
5│ list[bool] ┆ list[list[str]] │
6╞═══════════════════════╪═════════════════════════════════╡
7│ [true, true, … false] ┆ [[" ", "x", "o"], [" ", "x", "… │
8│ [false, true, … true] ┆ [["o", "x", "x"], [" ", "o", "… │
9└───────────────────────┴─────────────────────────────────┘
但是,有一个特例。就是当用 NumPy 数组来构建列时:
1import numpy as np
2
3array = np.arange(0, 120).reshape((5, 2, 3, 4)) # 4D array
4
5print(pl.Series(array).dtype) # Column with the 3D subarrays
1Array(Int64, shape=(2, 3, 4))
List
#Polars提供了很多函数去处理List
数据类型, 这些函数在命名空间list
中。如下是一个不同气象站的天气数据:
1weather = pl.DataFrame(
2 {
3 "station": [f"Station {idx}" for idx in range(1, 6)],
4 "temperatures": [
5 "20 5 5 E1 7 13 19 9 6 20",
6 "18 8 16 11 23 E2 8 E2 E2 E2 90 70 40",
7 "19 24 E9 16 6 12 10 22",
8 "E2 E0 15 7 8 10 E1 24 17 13 6",
9 "14 8 E0 16 22 24 E1",
10 ],
11 }
12)
13
14print(weather)
1shape: (5, 2)
2┌───────────┬─────────────────────────────────┐
3│ station ┆ temperatures │
4│ --- ┆ --- │
5│ str ┆ str │
6╞═══════════╪═════════════════════════════════╡
7│ Station 1 ┆ 20 5 5 E1 7 13 19 9 6 20 │
8│ Station 2 ┆ 18 8 16 11 23 E2 8 E2 E2 E2 90… │
9│ Station 3 ┆ 19 24 E9 16 6 12 10 22 │
10│ Station 4 ┆ E2 E0 15 7 8 10 E1 24 17 13 6 │
11│ Station 5 ┆ 14 8 E0 16 22 24 E1 │
12└───────────┴─────────────────────────────────┘
对于上面的示例数据,当想对每个站点捕获的温度进行分析时, 需要先把temperatures
这一列中的每一行按照空格进行分割成列表:
1weather = weather.with_columns(
2 pl.col("temperatures").str.split(" "),
3)
4print(weather)
1shape: (5, 2)
2┌───────────┬──────────────────────┐
3│ station ┆ temperatures │
4│ --- ┆ --- │
5│ str ┆ list[str] │
6╞═══════════╪══════════════════════╡
7│ Station 1 ┆ ["20", "5", … "20"] │
8│ Station 2 ┆ ["18", "8", … "40"] │
9│ Station 3 ┆ ["19", "24", … "22"] │
10│ Station 4 ┆ ["E2", "E0", … "6"] │
11│ Station 5 ┆ ["14", "8", … "E1"] │
12└───────────┴──────────────────────┘
Polars在命名空间list
中提供了很多操作列表的函数, 包括和之前处理字符串一样类似功能的slice
、 head
、 tail
等函数。
1result = weather.with_columns(
2 pl.col("temperatures").list.head(3).alias("head"), # 获取列表中的前3个元素
3 pl.col("temperatures").list.tail(3).alias("tail"), # 获取列表中的后3个元素
4 pl.col("temperatures").list.slice(-3, 2).alias("two_next_to_last"), # 从列表倒数第3个开始,连续获取两个元素
5)
6print(result)
1shape: (5, 5)
2┌───────────┬──────────────────────┬────────────────────┬────────────────────┬──────────────────┐
3│ station ┆ temperatures ┆ head ┆ tail ┆ two_next_to_last │
4│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
5│ str ┆ list[str] ┆ list[str] ┆ list[str] ┆ list[str] │
6╞═══════════╪══════════════════════╪════════════════════╪════════════════════╪══════════════════╡
7│ Station 1 ┆ ["20", "5", … "20"] ┆ ["20", "5", "5"] ┆ ["9", "6", "20"] ┆ ["9", "6"] │
8│ Station 2 ┆ ["18", "8", … "40"] ┆ ["18", "8", "16"] ┆ ["90", "70", "40"] ┆ ["90", "70"] │
9│ Station 3 ┆ ["19", "24", … "22"] ┆ ["19", "24", "E9"] ┆ ["12", "10", "22"] ┆ ["12", "10"] │
10│ Station 4 ┆ ["E2", "E0", … "6"] ┆ ["E2", "E0", "15"] ┆ ["17", "13", "6"] ┆ ["17", "13"] │
11│ Station 5 ┆ ["14", "8", … "E1"] ┆ ["14", "8", "E0"] ┆ ["22", "24", "E1"] ┆ ["22", "24"] │
12└───────────┴──────────────────────┴────────────────────┴────────────────────┴──────────────────┘
通过list
命名空间中的eval
和element
配合, 可以实现对列表中的每个元素执行操作:
1result = weather.with_columns(
2 (pl.col("temperatures").list.eval(pl.element().str.contains("E"))).alias("flag"),
3)
4
5print(result)
1shape: (5, 3)
2┌───────────┬──────────────────────┬─────────────────────────┐
3│ station ┆ temperatures ┆ flag │
4│ --- ┆ --- ┆ --- │
5│ str ┆ list[str] ┆ list[bool] │
6╞═══════════╪══════════════════════╪═════════════════════════╡
7│ Station 1 ┆ ["20", "5", … "20"] ┆ [false, false, … false] │
8│ Station 2 ┆ ["18", "8", … "40"] ┆ [false, false, … false] │
9│ Station 3 ┆ ["19", "24", … "22"] ┆ [false, false, … false] │
10│ Station 4 ┆ ["E2", "E0", … "6"] ┆ [true, true, … false] │
11│ Station 5 ┆ ["14", "8", … "E1"] ┆ [false, false, … true] │
12└───────────┴──────────────────────┴─────────────────────────┘
找出每个站点的气温异常的次数:
将测量结果转化为数字, 异常气温的会变为null
。(或者直接简单判断是否有字母E)
统计列表中null的数量
将此列输出重命名为"error"
1result = weather.with_columns(
2 pl.col("temperatures") # 选中temperatures这一列
3 .list # 获取命名空间 list
4 .eval( # 使用eval方法, 针对列表中的每个元素进行操作
5 pl.element() # 获取列表中的元素
6 .cast(pl.Int64, strict=False) # 转换为数字, 失败的会变为null
7 .is_null() # 获取是否为null
8 )
9 .list.sum() # 对列表中的元素求和, True为1, False为0
10 .alias("errors"),
11)
12print(result)
1shape: (5, 3)
2┌───────────┬──────────────────────┬────────┐
3│ station ┆ temperatures ┆ errors │
4│ --- ┆ --- ┆ --- │
5│ str ┆ list[str] ┆ u32 │
6╞═══════════╪══════════════════════╪════════╡
7│ Station 1 ┆ ["20", "5", … "20"] ┆ 1 │
8│ Station 2 ┆ ["18", "8", … "40"] ┆ 4 │
9│ Station 3 ┆ ["19", "24", … "22"] ┆ 1 │
10│ Station 4 ┆ ["E2", "E0", … "6"] ┆ 3 │
11│ Station 5 ┆ ["14", "8", … "E1"] ┆ 2 │
12└───────────┴──────────────────────┴────────┘
创建新数据集:
1weather_by_day = pl.DataFrame(
2 {
3 "station": [f"Station {idx}" for idx in range(1, 11)],
4 "day_1": [17, 11, 8, 22, 9, 21, 20, 8, 8, 17],
5 "day_2": [15, 11, 10, 8, 7, 14, 18, 21, 15, 13],
6 "day_3": [16, 15, 24, 24, 8, 23, 19, 23, 16, 10],
7 }
8)
9print(weather_by_day)
1shape: (10, 4)
2┌────────────┬───────┬───────┬───────┐
3│ station ┆ day_1 ┆ day_2 ┆ day_3 │
4│ --- ┆ --- ┆ --- ┆ --- │
5│ str ┆ i64 ┆ i64 ┆ i64 │
6╞════════════╪═══════╪═══════╪═══════╡
7│ Station 1 ┆ 17 ┆ 15 ┆ 16 │
8│ Station 2 ┆ 11 ┆ 11 ┆ 15 │
9│ Station 3 ┆ 8 ┆ 10 ┆ 24 │
10│ Station 4 ┆ 22 ┆ 8 ┆ 24 │
11│ Station 5 ┆ 9 ┆ 7 ┆ 8 │
12│ Station 6 ┆ 21 ┆ 14 ┆ 23 │
13│ Station 7 ┆ 20 ┆ 18 ┆ 19 │
14│ Station 8 ┆ 8 ┆ 21 ┆ 23 │
15│ Station 9 ┆ 8 ┆ 15 ┆ 16 │
16│ Station 10 ┆ 17 ┆ 13 ┆ 10 │
17└────────────┴───────┴───────┴───────┘
现在计算各个站点每日温度的百分比排名:
1rank_pct = (pl.element().rank(descending=True) / pl.element().count()).round(2)
2
3result = weather_by_day.with_columns(
4 # concat_list将每一行中的所有元素合并成一个列表, 用于生成一个临时的中间列 "all_temps"
5 pl.concat_list(pl.all().exclude("station")).alias("all_temps")
6).select(
7 # 选择除了 all_temps 这一列之外的所有列
8 pl.all().exclude("all_temps"),
9 pl.col("all_temps").list.eval(rank_pct, parallel=True).alias("temps_rank"),
10)
11
12print(result)
1shape: (10, 5)
2┌────────────┬───────┬───────┬───────┬────────────────────┐
3│ station ┆ day_1 ┆ day_2 ┆ day_3 ┆ temps_rank │
4│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
5│ str ┆ i64 ┆ i64 ┆ i64 ┆ list[f64] │
6╞════════════╪═══════╪═══════╪═══════╪════════════════════╡
7│ Station 1 ┆ 17 ┆ 15 ┆ 16 ┆ [0.33, 1.0, 0.67] │
8│ Station 2 ┆ 11 ┆ 11 ┆ 15 ┆ [0.83, 0.83, 0.33] │
9│ Station 3 ┆ 8 ┆ 10 ┆ 24 ┆ [1.0, 0.67, 0.33] │
10│ Station 4 ┆ 22 ┆ 8 ┆ 24 ┆ [0.67, 1.0, 0.33] │
11│ Station 5 ┆ 9 ┆ 7 ┆ 8 ┆ [0.33, 1.0, 0.67] │
12│ Station 6 ┆ 21 ┆ 14 ┆ 23 ┆ [0.67, 1.0, 0.33] │
13│ Station 7 ┆ 20 ┆ 18 ┆ 19 ┆ [0.33, 1.0, 0.67] │
14│ Station 8 ┆ 8 ┆ 21 ┆ 23 ┆ [1.0, 0.67, 0.33] │
15│ Station 9 ┆ 8 ┆ 15 ┆ 16 ┆ [1.0, 0.67, 0.33] │
16│ Station 10 ┆ 17 ┆ 13 ┆ 10 ┆ [0.33, 0.67, 1.0] │
17└────────────┴───────┴───────┴───────┴────────────────────┘
Array
#Polars通常不会对Array做类型推导, 所以必须在创建Series
/DataFrame
是显式指定数据类型, 或者使用 NumPy 数组创建该列。
1df = pl.DataFrame(
2 {
3 "first_last": [
4 ["Anne", "Adams"],
5 ["Brandon", "Branson"],
6 ["Camila", "Campbell"],
7 ["Dennis", "Doyle"],
8 ],
9 "fav_numbers": [
10 [42, 0, 1],
11 [2, 3, 5],
12 [13, 21, 34],
13 [73, 3, 7],
14 ],
15 },
16 schema={
17 "first_last": pl.Array(pl.String, 2),
18 "fav_numbers": pl.Array(pl.Int32, 3),
19 },
20)
21
22result = df.select(
23 pl.col("first_last").arr.join(" ").alias("name"),
24 pl.col("fav_numbers").arr.sort(),
25 pl.col("fav_numbers").arr.max().alias("largest_fav"),
26 pl.col("fav_numbers").arr.sum().alias("summed"),
27 pl.col("fav_numbers").arr.contains(3).alias("likes_3"),
28)
29print(result)
1shape: (4, 5)
2┌─────────────────┬───────────────┬─────────────┬────────┬─────────┐
3│ name ┆ fav_numbers ┆ largest_fav ┆ summed ┆ likes_3 │
4│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
5│ str ┆ array[i32, 3] ┆ i32 ┆ i32 ┆ bool │
6╞═════════════════╪═══════════════╪═════════════╪════════╪═════════╡
7│ Anne Adams ┆ [0, 1, 42] ┆ 42 ┆ 43 ┆ false │
8│ Brandon Branson ┆ [2, 3, 5] ┆ 5 ┆ 10 ┆ true │
9│ Camila Campbell ┆ [13, 21, 34] ┆ 34 ┆ 68 ┆ false │
10│ Dennis Doyle ┆ [3, 7, 73] ┆ 73 ┆ 83 ┆ true │
11└─────────────────┴───────────────┴─────────────┴────────┴─────────┘