墨水记忆

枚举

枚举，就是用来记录那些只能取有限个可能值的数据，比如：性别、单位（如时间、距离等）、操作系统等等。Polars提供了两种：Enum 和 Categorical，本章节只关注Enum，若对 Categorical 感兴趣的可以到官网进行查看。

创建使用`Enum`

使用Enum时必须提前指定类型，如下所示：

Python

1bears_enum = pl.Enum(["Polar", "Panda", "Brown"]) # 定义一个枚举类型
2bears = pl.Series(["Polar", "Panda", "Brown", "Brown", "Polar"], dtype=bears_enum) # 创建一个序列，元素类型为枚举类型
3print(bears)

1shape: (5,)
2Series: '' [enum]
3[
4	"Polar"
5	"Panda"
6	"Brown"
7	"Brown"
8	"Polar"
9]

枚举同样可以使用用于DataFrame：

Python

1log_levels = pl.Enum(["debug", "info", "warning", "error"])
2
3logs = pl.DataFrame(
4    {
5        "level": ["debug", "info", "debug", "error"],
6        "message": [
7            "process id: 525",
8            "Service started correctly",
9            "startup time: 67ms",
10            "Cannot connect to DB!",
11        ],
12    },
13    schema_overrides={
14        "level": log_levels,
15    },
16)
17print(logs)

1shape: (4, 2)
2┌───────┬───────────────────────────┐
3│ level ┆ message                   │
4│ ---   ┆ ---                       │
5│ enum  ┆ str                       │
6╞═══════╪═══════════════════════════╡
7│ debug ┆ process id: 525           │
8│ info  ┆ Service started correctly │
9│ debug ┆ startup time: 67ms        │
10│ error ┆ Cannot connect to DB!     │
11└───────┴───────────────────────────┘

无效值

如果已经指定了某列的数据类型为Enum，那么当列中的数据存在值不在Enum中时，就会报错。如下示例：

Python

1bears_enum = pl.Enum(["Polar", "Panda", "Brown"])
2bears = pl.Series(["Polar", "Panda", "Brown", "Brown", "Polar","dms"], dtype=bears_enum)
3
4log_levels = pl.Enum(["debug", "info", "warning", "error"])
5logs = pl.DataFrame(
6    {
7        "level": ["debug", "info", "debug", "error", "dms"],
8        "message": [
9            "process id: 525",
10            "Service started correctly",
11            "startup time: 67ms",
12            "Cannot connect to DB!",
13        ],
14    },
15    schema_overrides={
16        "level": log_levels,
17    },
18)

上述两个示例，都会报如下类似的错误：

1polars.exceptions.InvalidOperationError: conversion from `str` to `enum` failed in column '' for 1 out of 6 values: ["dms"]

类别排序和比较

Enum是有序的，其顺序由初始化时的先后顺序决定。如下示例：

Python

1log_levels = pl.Enum(["debug", "info", "warning", "error"]) # debug < info < warning < error
2
3logs = pl.DataFrame(
4    {
5        "level": ["debug", "info", "debug", "error"],
6        "message": [
7            "process id: 525",
8            "Service started correctly",
9            "startup time: 67ms",
10            "Cannot connect to DB!",
11        ],
12    },
13    schema_overrides={
14        "level": log_levels,
15    },
16)
17
18non_debug_logs = logs.filter(
19    pl.col("level") > "info",
20)
21print(non_debug_logs)

1shape: (1, 2)
2┌───────┬───────────────────────┐
3│ level ┆ message               │
4│ ---   ┆ ---                   │
5│ enum  ┆ str                   │
6╞═══════╪═══════════════════════╡
7│ error ┆ Cannot connect to DB! │
8└───────┴───────────────────────┘

Python

1log_levels = pl.Enum(["debug", "info", "warning", "error"])
2
3logs2 = pl.DataFrame(
4    {
5        "level": ["debug", "info", "debug", "error"],
6        "message": [
7            "process id: 525",
8            "Service started correctly",
9            "startup time: 67ms",
10            "Cannot connect to DB!",
11        ],
12    },
13    schema_overrides={
14        "level": log_levels,
15    },
16)
17
18non_debug_logs = logs2.filter(
19    pl.col("level") < "info",
20)
21print(non_debug_logs)

1shape: (2, 2)
2┌───────┬────────────────────┐
3│ level ┆ message            │
4│ ---   ┆ ---                │
5│ enum  ┆ str                │
6╞═══════╪════════════════════╡
7│ debug ┆ process id: 525    │
8│ debug ┆ startup time: 67ms │
9└───────┴────────────────────┘

枚举#

创建使用Enum#

无效值#

类别排序和比较#

枚举

创建使用`Enum`

无效值

类别排序和比较