枚举,就是用来记录那些只能取有限个可能值的数据,比如:性别、单位(如时间、距离等)、操作系统等等。Polars提供了两种:Enum 和 Categorical,本章节只关注Enum,若对 Categorical 感兴趣的可以到官网进行查看。
Enum#使用Enum时必须提前指定类型,如下所示:
1bears_enum = pl.Enum(["Polar", "Panda", "Brown"]) # 定义一个枚举类型
2bears = pl.Series(["Polar", "Panda", "Brown", "Brown", "Polar"], dtype=bears_enum) # 创建一个序列,元素类型为枚举类型
3print(bears)1shape: (5,)
2Series: '' [enum]
3[
4 "Polar"
5 "Panda"
6 "Brown"
7 "Brown"
8 "Polar"
9]枚举同样可以使用用于DataFrame:
1log_levels = pl.Enum(["debug", "info", "warning", "error"])
2
3logs = pl.DataFrame(
4 {
5 "level": ["debug", "info", "debug", "error"],
6 "message": [
7 "process id: 525",
8 "Service started correctly",
9 "startup time: 67ms",
10 "Cannot connect to DB!",
11 ],
12 },
13 schema_overrides={
14 "level": log_levels,
15 },
16)
17print(logs)1shape: (4, 2)
2┌───────┬───────────────────────────┐
3│ level ┆ message │
4│ --- ┆ --- │
5│ enum ┆ str │
6╞═══════╪═══════════════════════════╡
7│ debug ┆ process id: 525 │
8│ info ┆ Service started correctly │
9│ debug ┆ startup time: 67ms │
10│ error ┆ Cannot connect to DB! │
11└───────┴───────────────────────────┘如果已经指定了某列的数据类型为Enum,那么当列中的数据存在值不在Enum中时, 就会报错。如下示例:
1bears_enum = pl.Enum(["Polar", "Panda", "Brown"])
2bears = pl.Series(["Polar", "Panda", "Brown", "Brown", "Polar","dms"], dtype=bears_enum)
3
4log_levels = pl.Enum(["debug", "info", "warning", "error"])
5logs = pl.DataFrame(
6 {
7 "level": ["debug", "info", "debug", "error", "dms"],
8 "message": [
9 "process id: 525",
10 "Service started correctly",
11 "startup time: 67ms",
12 "Cannot connect to DB!",
13 ],
14 },
15 schema_overrides={
16 "level": log_levels,
17 },
18)上述两个示例,都会报如下类似的错误:
1polars.exceptions.InvalidOperationError: conversion from `str` to `enum` failed in column '' for 1 out of 6 values: ["dms"]Enum是有序的, 其顺序由初始化时的先后顺序决定。如下示例:
1log_levels = pl.Enum(["debug", "info", "warning", "error"]) # debug < info < warning < error
2
3logs = pl.DataFrame(
4 {
5 "level": ["debug", "info", "debug", "error"],
6 "message": [
7 "process id: 525",
8 "Service started correctly",
9 "startup time: 67ms",
10 "Cannot connect to DB!",
11 ],
12 },
13 schema_overrides={
14 "level": log_levels,
15 },
16)
17
18non_debug_logs = logs.filter(
19 pl.col("level") > "info",
20)
21print(non_debug_logs)1shape: (1, 2)
2┌───────┬───────────────────────┐
3│ level ┆ message │
4│ --- ┆ --- │
5│ enum ┆ str │
6╞═══════╪═══════════════════════╡
7│ error ┆ Cannot connect to DB! │
8└───────┴───────────────────────┘1log_levels = pl.Enum(["debug", "info", "warning", "error"])
2
3logs2 = pl.DataFrame(
4 {
5 "level": ["debug", "info", "debug", "error"],
6 "message": [
7 "process id: 525",
8 "Service started correctly",
9 "startup time: 67ms",
10 "Cannot connect to DB!",
11 ],
12 },
13 schema_overrides={
14 "level": log_levels,
15 },
16)
17
18non_debug_logs = logs2.filter(
19 pl.col("level") < "info",
20)
21print(non_debug_logs)1shape: (2, 2)
2┌───────┬────────────────────┐
3│ level ┆ message │
4│ --- ┆ --- │
5│ enum ┆ str │
6╞═══════╪════════════════════╡
7│ debug ┆ process id: 525 │
8│ debug ┆ startup time: 67ms │
9└───────┴────────────────────┘