分类数据和枚举

如何一列包含的字符串值只能取有限个可能值中的一个, 那么该列就是包含分类数据的列, 比如有一列操作系统, 那可能的种类就是比较少的

处理分类数据时, 可以使用Polars的专用类型CategoricalEnum来提高查询性能. 接下来我们了解下两种数据类型之间的区别以及如何选择.

对比

简而言之, 应该尽可能优先选择Enum. 如果类别固定且预先已知, 则使用Enum. 如果类别不固定, 就必须使用Category. 如果需求在过程中发生变化, 可以随时从一种类型转换为另一种类型

数据类型Enum

创建Enum

使用Enum时必须提前指定类型, 我们来看下正常情况和错误情况

正常情况

  • 定义enum, 有三个值, 分别是: Polar, Panda, Brown

  • enum_1这一列中的每个值都必须在上述enum中有定义

1import polars as pl
2
3bears_enum = pl.Enum(["Polar", "Panda", "Brown"])
4
5bears = pl.Series("enum_1", ["Polar", "Panda", "Brown", "Brown", "Polar"], dtype=bears_enum)
6print(bears)
1shape: (5,)
2Series: 'enum_1' [enum]
3[
4	"Polar"
5	"Panda"
6	"Brown"
7	"Brown"
8	"Polar"
9]

无效值

当列中的数据不在Enum中时, 就会报错, 下面代码高亮的第5行中的数据 "Polar1" 并没有在bears_enum中, 就会报错

1import polars as pl
2
3bears_enum = pl.Enum(["Polar", "Panda", "Brown"])
4bears = pl.Series("enum_1", [
5            "Polar1",
6            "Panda", "Brown", "Brown", "Polar"
7            ], dtype=bears_enum)
8print(bears)
1InvalidOperationError                     Traceback (most recent call last)
2Cell In[7], line 4
3      1 import polars as pl
4      3 bears_enum = pl.Enum(["Polar", "Panda", "Brown"])
5----> 4 bears = pl.Series("enum_1", [
6      5             "Polar1",
7      6             "Panda", "Brown", "Brown", "Polar"
8      7             ], dtype=bears_enum)
9      8 print(bears)
10
11省略一些错误信息
12
13InvalidOperationError: conversion from `str` to `enum` failed in column 'enum_1' for 1 out of 5 values: ["Polar1"]
14
15Ensure that all values in the input column are present in the categories of the enum datatype.

类别排序和比较

Enum是有序的, 其顺序由指定类别的顺序决定. 我们看下面的代码, 以日志级别为例来进行演示

1import polars as pl
2# debug < info < warning < error
3log_levels = pl.Enum(["debug", "info", "warning", "error"])
4
5logs = pl.DataFrame(
6    {
7        "level": ["debug", "info", "debug", "error"],
8        "message": [
9            "process id: 525",
10            "Service started correctly",
11            "startup time: 67ms",
12            "Cannot connect to DB!",
13        ],
14    },
15    schema_overrides={
16        "level": log_levels,
17    },
18)
19
20non_debug_logs = logs.filter(
21    pl.col("level") > "debug",
22)
23print(non_debug_logs)
1shape: (2, 2)
2┌───────┬───────────────────────────┐
3│ level ┆ message                   │
4│ ---   ┆ ---                       │
5│ enum  ┆ str                       │
6╞═══════╪═══════════════════════════╡
7│ info  ┆ Service started correctly │
8│ error ┆ Cannot connect to DB!     │
9└───────┴───────────────────────────┘

上面代码显示我们可以将Enum值与字符串比较. 但是要注意, 这个字符串必须在Enum中, 否则就会报错, 可以自己试验, 只需把第20行代码改为pl.col("level") > "debug1"即可

数据类型Categorical

可以看作是更灵活的Enum

创建一个Categorical

使用dtype=pl.Categorical指定类型就可以创建一个Categorical类型的数据

1bears_cat = pl.Series(
2    ["Polar","Panda","Brown","Brown","Polar"], dtype=pl.Categorical
3)
4print(bears_cat)

与字符串比较

Categorical列与字符串进行比较时, Polars会执行以下比较, 我们看代码就能很好明白, 只有两个"B"开头的单词小于"Cat"

print(bears_cat < "Cat")

1shape: (5,)
2Series: '' [bool]
3[
4    false
5    false
6    true
7    true
8    false
9]

还可以将字符串列与Categorical列进行比较, 但是两个Categorical列之间比较通常效率更高

1bears_str = pl.Series(
2    ["Panda", "Brown", "Brown", "Polar", "Polar"],
3)
4print(bears_cat == bears_str)
1shape: (5,)
2Series: '' [bool]
3[
4    false
5    false
6    true
7    false
8    true
9]

比较Categorical列和字符串缓存

Polars默认将将数据类型为Categorical的列中的值按照它们在列中的显示顺序进行编码, 且独立于其他列进行编码, 所以Polars无法有效比较独立创建的两列

1from polars.exceptions import StringCacheMismatchError
2import polars as pl
3
4bears_cat = pl.Series(
5    ["Polar", "Panda", "Brown", "Brown", "Polar"], dtype=pl.Categorical
6)
7bears_cat2 = pl.Series(
8    ["Panda", "Brown", "Brown", "Polar", "Polar"], dtype=pl.Categorical,
9)
10
11try:
12    print(bears_cat == bears_cat2)
13except StringCacheMismatchError as e:
14    print(str(e))          # 错误提示信息

可以看到下面的提示信息非常友好, 告诉我们可以怎么做, 以及会有些性能损失

1cannot compare categoricals coming from different sources, consider setting a global StringCache.
2
3Help: if you're using Python, this may look something like:
4
5    with pl.StringCache():
6        df1 = pl.DataFrame({'a': ['1', '2']}, schema={'a': pl.Categorical})
7        df2 = pl.DataFrame({'a': ['1', '3']}, schema={'a': pl.Categorical})
8        pl.concat([df1, df2])
9
10Alternatively, if the performance cost is acceptable, you could just set:
11
12    import polars as pl
13    pl.enable_string_cache()
14
15on startup.

按照提示我们启用缓存, 或者也可以启用全局缓存

1with pl.StringCache():
2    bears_cat = pl.Series(
3        ["Polar", "Panda", "Brown", "Brown", "Polar"], dtype=pl.Categorical
4    )
5    bears_cat2 = pl.Series(
6        ["Panda", "Brown", "Brown", "Polar", "Polar"], dtype=pl.Categorical
7    )
8
9print(bears_cat == bears_cat2)
1shape: (5,)
2Series: '' [bool]
3[
4	false
5	false
6	true
7	false
8	true
9]

合并分类列

字符串缓存在将两列合并或混合时也很有用, 一个例子是垂直连接两个DataFrame

1import warnings
2
3male_bears = pl.DataFrame(
4    {
5        "species": ["Polar", "Brown", "Panda"],
6        "weight": [450, 500, 110],  # kg
7    },
8    schema_overrides={"species": pl.Categorical},
9)
10female_bears = pl.DataFrame(
11    {
12        "species": ["Brown", "Polar", "Panda"],
13        "weight": [340, 200, 90],  # kg
14    },
15    schema_overrides={"species": pl.Categorical},
16)
17
18with warnings.catch_warnings():
19    warnings.filterwarnings("ignore")
20    bears = pl.concat([male_bears, female_bears], how="vertical")
21
22print(bears)

这里我们忽略了警告, 因为这个时候重新编码的代价是昂贵的, Polars建议我们使用字符串缓存或者Enum

1shape: (6, 2)
2┌─────────┬────────┐
3│ species ┆ weight │
4│ ---     ┆ ---    │
5│ cat     ┆ i64    │
6╞═════════╪════════╡
7│ Polar   ┆ 450    │
8│ Brown   ┆ 500    │
9│ Panda   ┆ 110    │
10│ Brown   ┆ 340    │
11│ Polar   ┆ 200    │
12│ Panda   ┆ 90     │
13└─────────┴────────┘

分类列之间的比较不是词法比较

Polars默认不会对两列数据类型为Categorical的进行词法比较, 如果需要按照词法比较, 需要在创建时指定

1with pl.StringCache():
2    bears_cat = pl.Series(
3        ["Polar", "Panda", "Brown", "Brown", "Polar"],
4        dtype=pl.Categorical(ordering="lexical"),
5    )
6    bears_cat2 = pl.Series(
7        ["Panda", "Brown", "Brown", "Polar", "Polar"], dtype=pl.Categorical
8    )
9
10print(bears_cat > bears_cat2)
1shape: (5,)
2Series: '' [bool]
3[
4    true
5    true
6    false
7    false
8    false
9]

否则, Polars将根据值推断顺序

1with pl.StringCache():
2    bears_cat = pl.Series(
3        # Polar <  Panda <  Brown
4        # 0       1         2
5        ["Polar", "Panda", "Brown", "Brown", "Polar"],
6        dtype=pl.Categorical,
7    )
8    bears_cat2 = pl.Series(
9        # 1       2
10        ["Panda", "Brown", "Brown", "Polar", "Polar"], dtype=pl.Categorical
11    )
12
13print(bears_cat > bears_cat2)
1shape: (5,)
2Series: '' [bool]
3[
4    false
5    false
6    false
7    true
8    false
9]

词法比较顺序是什么

  • dtype=pl.Categorical(ordering="lexical"): 按照出现的顺序进行比较
    • 按照字典序/字母序进行比较
  • dtype=pl.Categorical(ordering="physical"): 默认情况!
    • 按照出现的顺序比较

分类数据的性能考虑

这里直接看官方文档, 总结就是优先使用Enum!

分类数据类型的性能考虑