分类数据和枚举
如何一列包含的字符串值只能取有限个可能值中的一个, 那么该列就是包含分类数据 的列, 比如有一列操作系统, 那可能的种类就是比较少的
处理分类数据时, 可以使用Polars的专用类型Categorical
和Enum
来提高查询性能. 接下来我们了解下两种数据类型之间的区别以及如何选择.
对比
简而言之, 应该尽可能优先选择Enum
. 如果类别固定且预先已知, 则使用Enum
. 如果类别不固定, 就必须使用Category
. 如果需求在过程中发生变化,
可以随时从一种类型转换为另一种类型
数据类型Enum
创建Enum
使用Enum
时必须提前指定类型, 我们来看下正常情况和错误情况
正常情况
定义enum, 有三个值, 分别是: Polar, Panda, Brown
enum_1这一列中的每个值都必须在上述enum中有定义
1 import polars as pl
2
3 bears_enum = pl . Enum ( [ "Polar" , "Panda" , "Brown" ] )
4
5 bears = pl . Series ( "enum_1" , [ "Polar" , "Panda" , "Brown" , "Brown" , "Polar" ] , dtype = bears_enum )
6 print ( bears )
1 shape: (5,)
2 Series: 'enum_1' [enum]
3 [
4 "Polar"
5 "Panda"
6 "Brown"
7 "Brown"
8 "Polar"
9 ]
无效值
当列中的数据不在Enum
中时, 就会报错, 下面代码高亮的第5行中的数据 "Polar1" 并没有在bears_enum中, 就会报错
1 import polars as pl
2
3 bears_enum = pl . Enum ( [ "Polar" , "Panda" , "Brown" ] )
4 bears = pl . Series ( "enum_1" , [
5 "Polar1" ,
6 "Panda" , "Brown" , "Brown" , "Polar"
7 ] , dtype = bears_enum )
8 print ( bears )
1 InvalidOperationError Traceback (most recent call last)
2 Cell In[7], line 4
3 1 import polars as pl
4 3 bears_enum = pl.Enum(["Polar", "Panda", "Brown"])
5 ----> 4 bears = pl.Series("enum_1", [
6 5 "Polar1",
7 6 "Panda", "Brown", "Brown", "Polar"
8 7 ], dtype=bears_enum)
9 8 print(bears)
10
11 省略一些错误信息
12
13 InvalidOperationError: conversion from `str` to `enum` failed in column 'enum_1' for 1 out of 5 values: ["Polar1"]
14
15 Ensure that all values in the input column are present in the categories of the enum datatype.
类别排序和比较
Enum
是有序的, 其顺序由指定类别的顺序决定. 我们看下面的代码, 以日志级别为例来进行演示
1 import polars as pl
2 # debug < info < warning < error
3 log_levels = pl . Enum ( [ "debug" , "info" , "warning" , "error" ] )
4
5 logs = pl . DataFrame (
6 {
7 "level" : [ "debug" , "info" , "debug" , "error" ] ,
8 "message" : [
9 "process id: 525" ,
10 "Service started correctly" ,
11 "startup time: 67ms" ,
12 "Cannot connect to DB!" ,
13 ] ,
14 } ,
15 schema_overrides = {
16 "level" : log_levels ,
17 } ,
18 )
19
20 non_debug_logs = logs . filter (
21 pl . col ( "level" ) > "debug" ,
22 )
23 print ( non_debug_logs )
1 shape: (2, 2)
2 ┌───────┬───────────────────────────┐
3 │ level ┆ message │
4 │ --- ┆ --- │
5 │ enum ┆ str │
6 ╞═══════╪═══════════════════════════╡
7 │ info ┆ Service started correctly │
8 │ error ┆ Cannot connect to DB! │
9 └───────┴───────────────────────────┘
上面代码显示我们可以将Enum
值与字符串比较.
但是要注意, 这个字符串必须在Enum
中, 否则就会报错, 可以自己试验, 只需把第20行代码改为pl.col("level") > "debug1"
即可
数据类型Categorical
可以看作是更灵活的Enum
创建一个Categorical
使用dtype=pl.Categorical
指定类型就可以创建一个Categorical
类型的数据
1 bears_cat = pl . Series (
2 [ "Polar" , "Panda" , "Brown" , "Brown" , "Polar" ] , dtype = pl . Categorical
3 )
4 print ( bears_cat )
与字符串比较
将Categorical
列与字符串进行比较时, Polars会执行以下比较, 我们看代码就能很好明白, 只有两个"B"开头的单词小于"Cat"
print(bears_cat < "Cat")
1 shape: (5,)
2 Series: '' [bool]
3 [
4 false
5 false
6 true
7 true
8 false
9 ]
还可以将字符串列与Categorical
列进行比较, 但是两个Categorical
列之间比较通常效率更高
1 bears_str = pl . Series (
2 [ "Panda" , "Brown" , "Brown" , "Polar" , "Polar" ] ,
3 )
4 print ( bears_cat == bears_str )
1 shape: (5,)
2 Series: '' [bool]
3 [
4 false
5 false
6 true
7 false
8 true
9 ]
比较Categorical
列和字符串缓存
Polars默认将将数据类型为Categorical
的列中的值按照它们在列中的显示顺序进行编码, 且独立于其他列进行编码, 所以Polars无法有效比较独立创建的两列
1 from polars . exceptions import StringCacheMismatchError
2 import polars as pl
3
4 bears_cat = pl . Series (
5 [ "Polar" , "Panda" , "Brown" , "Brown" , "Polar" ] , dtype = pl . Categorical
6 )
7 bears_cat2 = pl . Series (
8 [ "Panda" , "Brown" , "Brown" , "Polar" , "Polar" ] , dtype = pl . Categorical ,
9 )
10
11 try :
12 print ( bears_cat == bears_cat2 )
13 except StringCacheMismatchError as e :
14 print ( str ( e ) ) # 错误提示信息
可以看到下面的提示信息非常友好, 告诉我们可以怎么做, 以及会有些性能损失
1 cannot compare categoricals coming from different sources, consider setting a global StringCache.
2
3 Help: if you're using Python, this may look something like:
4
5 with pl.StringCache():
6 df1 = pl.DataFrame({'a': ['1', '2']}, schema={'a': pl.Categorical})
7 df2 = pl.DataFrame({'a': ['1', '3']}, schema={'a': pl.Categorical})
8 pl.concat([df1, df2])
9
10 Alternatively, if the performance cost is acceptable, you could just set:
11
12 import polars as pl
13 pl.enable_string_cache()
14
15 on startup.
按照提示我们启用缓存, 或者也可以启用全局缓存
1 with pl . StringCache ( ) :
2 bears_cat = pl . Series (
3 [ "Polar" , "Panda" , "Brown" , "Brown" , "Polar" ] , dtype = pl . Categorical
4 )
5 bears_cat2 = pl . Series (
6 [ "Panda" , "Brown" , "Brown" , "Polar" , "Polar" ] , dtype = pl . Categorical
7 )
8
9 print ( bears_cat == bears_cat2 )
1 shape: (5,)
2 Series: '' [bool]
3 [
4 false
5 false
6 true
7 false
8 true
9 ]
合并分类列
字符串缓存在将两列合并或混合时也很有用, 一个例子是垂直连接两个DataFrame
1 import warnings
2
3 male_bears = pl . DataFrame (
4 {
5 "species" : [ "Polar" , "Brown" , "Panda" ] ,
6 "weight" : [ 450 , 500 , 110 ] , # kg
7 } ,
8 schema_overrides = { "species" : pl . Categorical } ,
9 )
10 female_bears = pl . DataFrame (
11 {
12 "species" : [ "Brown" , "Polar" , "Panda" ] ,
13 "weight" : [ 340 , 200 , 90 ] , # kg
14 } ,
15 schema_overrides = { "species" : pl . Categorical } ,
16 )
17
18 with warnings . catch_warnings ( ) :
19 warnings . filterwarnings ( "ignore" )
20 bears = pl . concat ( [ male_bears , female_bears ] , how = "vertical" )
21
22 print ( bears )
这里我们忽略了警告, 因为这个时候重新编码的代价是昂贵的, Polars建议我们使用字符串缓存或者Enum
1 shape: (6, 2)
2 ┌─────────┬────────┐
3 │ species ┆ weight │
4 │ --- ┆ --- │
5 │ cat ┆ i64 │
6 ╞═════════╪════════╡
7 │ Polar ┆ 450 │
8 │ Brown ┆ 500 │
9 │ Panda ┆ 110 │
10 │ Brown ┆ 340 │
11 │ Polar ┆ 200 │
12 │ Panda ┆ 90 │
13 └─────────┴────────┘
分类列之间的比较不是词法比较
Polars默认不会对两列数据类型为Categorical
的进行词法比较, 如果需要按照词法比较, 需要在创建时指定
1 with pl . StringCache ( ) :
2 bears_cat = pl . Series (
3 [ "Polar" , "Panda" , "Brown" , "Brown" , "Polar" ] ,
4 dtype = pl . Categorical ( ordering = "lexical" ) ,
5 )
6 bears_cat2 = pl . Series (
7 [ "Panda" , "Brown" , "Brown" , "Polar" , "Polar" ] , dtype = pl . Categorical
8 )
9
10 print ( bears_cat > bears_cat2 )
1 shape: (5,)
2 Series: '' [bool]
3 [
4 true
5 true
6 false
7 false
8 false
9 ]
否则, Polars将根据值推断顺序
1 with pl . StringCache ( ) :
2 bears_cat = pl . Series (
3 # Polar < Panda < Brown
4 # 0 1 2
5 [ "Polar" , "Panda" , "Brown" , "Brown" , "Polar" ] ,
6 dtype = pl . Categorical ,
7 )
8 bears_cat2 = pl . Series (
9 # 1 2
10 [ "Panda" , "Brown" , "Brown" , "Polar" , "Polar" ] , dtype = pl . Categorical
11 )
12
13 print ( bears_cat > bears_cat2 )
1 shape: (5,)
2 Series: '' [bool]
3 [
4 false
5 false
6 false
7 true
8 false
9 ]
词法比较顺序是什么
dtype=pl.Categorical(ordering="lexical"): 按照出现的顺序进行比较
dtype=pl.Categorical(ordering="physical"): 默认情况!
分类数据的性能考虑
这里直接看官方文档, 总结就是优先使用Enum
!
分类数据类型的性能考虑