枚举

枚举,就是用来记录那些只能取有限个可能值的数据,比如:性别、单位(如时间、距离等)、操作系统等等。Polars提供了两种:EnumCategorical,本章节只关注Enum,若对 Categorical 感兴趣的可以到官网进行查看

创建使用Enum

使用Enum时必须提前指定类型,如下所示:

Python
1bears_enum = pl.Enum(["Polar", "Panda", "Brown"]) # 定义一个枚举类型
2bears = pl.Series(["Polar", "Panda", "Brown", "Brown", "Polar"], dtype=bears_enum) # 创建一个序列,元素类型为枚举类型
3print(bears)
1shape: (5,)
2Series: '' [enum]
3[
4	"Polar"
5	"Panda"
6	"Brown"
7	"Brown"
8	"Polar"
9]

枚举同样可以使用用于DataFrame

Python
1log_levels = pl.Enum(["debug", "info", "warning", "error"])
2
3logs = pl.DataFrame(
4    {
5        "level": ["debug", "info", "debug", "error"],
6        "message": [
7            "process id: 525",
8            "Service started correctly",
9            "startup time: 67ms",
10            "Cannot connect to DB!",
11        ],
12    },
13    schema_overrides={
14        "level": log_levels,
15    },
16)
17print(logs)
1shape: (4, 2)
2┌───────┬───────────────────────────┐
3│ level ┆ message                   │
4│ ---   ┆ ---                       │
5│ enum  ┆ str                       │
6╞═══════╪═══════════════════════════╡
7│ debug ┆ process id: 525           │
8│ info  ┆ Service started correctly │
9│ debug ┆ startup time: 67ms        │
10│ error ┆ Cannot connect to DB!     │
11└───────┴───────────────────────────┘

无效值

如果已经指定了某列的数据类型为Enum,那么当列中的数据存在值不在Enum中时, 就会报错。如下示例:

Python
1bears_enum = pl.Enum(["Polar", "Panda", "Brown"])
2bears = pl.Series(["Polar", "Panda", "Brown", "Brown", "Polar","dms"], dtype=bears_enum)
3
4log_levels = pl.Enum(["debug", "info", "warning", "error"])
5logs = pl.DataFrame(
6    {
7        "level": ["debug", "info", "debug", "error", "dms"],
8        "message": [
9            "process id: 525",
10            "Service started correctly",
11            "startup time: 67ms",
12            "Cannot connect to DB!",
13        ],
14    },
15    schema_overrides={
16        "level": log_levels,
17    },
18)

上述两个示例,都会报如下类似的错误:

1polars.exceptions.InvalidOperationError: conversion from `str` to `enum` failed in column '' for 1 out of 6 values: ["dms"]

类别排序和比较

Enum是有序的, 其顺序由初始化时的先后顺序决定。如下示例:

Python
1log_levels = pl.Enum(["debug", "info", "warning", "error"]) # debug < info < warning < error
2
3logs = pl.DataFrame(
4    {
5        "level": ["debug", "info", "debug", "error"],
6        "message": [
7            "process id: 525",
8            "Service started correctly",
9            "startup time: 67ms",
10            "Cannot connect to DB!",
11        ],
12    },
13    schema_overrides={
14        "level": log_levels,
15    },
16)
17
18non_debug_logs = logs.filter(
19    pl.col("level") > "info",
20)
21print(non_debug_logs)
1shape: (1, 2)
2┌───────┬───────────────────────┐
3│ level ┆ message               │
4│ ---   ┆ ---                   │
5│ enum  ┆ str                   │
6╞═══════╪═══════════════════════╡
7│ error ┆ Cannot connect to DB! │
8└───────┴───────────────────────┘
Python
1log_levels = pl.Enum(["debug", "info", "warning", "error"])
2
3logs2 = pl.DataFrame(
4    {
5        "level": ["debug", "info", "debug", "error"],
6        "message": [
7            "process id: 525",
8            "Service started correctly",
9            "startup time: 67ms",
10            "Cannot connect to DB!",
11        ],
12    },
13    schema_overrides={
14        "level": log_levels,
15    },
16)
17
18non_debug_logs = logs2.filter(
19    pl.col("level") < "info",
20)
21print(non_debug_logs)
1shape: (2, 2)
2┌───────┬────────────────────┐
3│ level ┆ message            │
4│ ---   ┆ ---                │
5│ enum  ┆ str                │
6╞═══════╪════════════════════╡
7│ debug ┆ process id: 525    │
8│ debug ┆ startup time: 67ms │
9└───────┴────────────────────┘