Parquet文档描述了几种不同的编码此处
Parquet documentation describe few different encodings here
在读取/写入过程中文件内部是否有所改变,还是可以设置? Spark文档中对此一无所知.只能从Netflix团队的Ryan Blue的讲话中找到幻灯片 .他将镶木地板配置设置为sqlContext
Is it changes somehow inside file during read/write, or I can set it? Nothing about it in Spark documentation. Only found slides from speach by Ryan Blue from Netflix team. He sets parquet configurations to sqlContext
sqlContext.setConf("parquet.filter.dictionary.enabled", "true")看起来与Parquet文件中的纯字典编码无关.
Looks like it's not about plain dictionary encoding in Parquet files.
推荐答案所以我在Twitter工程上找到了我的问题的答案博客.
So I found an answer to my question on twitter engineering blog.
当许多唯一值<时,Parquet启用了自动词典编码. 10 ^ 5. 此处是发布带有自调整字典编码的Parquet 1.0的帖子
Parquet has an automatic dictionary encoding enabled when a number of unique values < 10^5. Here is a post announcing Parquet 1.0 with self-tuning dictionary encoding
UPD:
可以在SparkSession配置中切换字典编码:
Dictionary encoding can be switched in SparkSession configs:
SparkSession.builder .appName("name") .config("parquet.enable.dictionary","false") //true关于按列编码,有一个改进的 issue 是对以下内容的改进: Parquet的Jira于17年7月14日创建.由于字典编码是默认设置,并且仅适用于所有表格,因此它会关闭Delta编码(Jira 问题(此错误),这是对几乎每个值都是唯一的时间戳等数据的唯一合适编码.
Regarding encoding by column, there is an open issue as improvement in Parquet’s Jira that was created on 14th July, 17. Since dictionary encoding is a default and works only for all table it turns off Delta Encoding(Jira issue for this bug) which is the only suitable encoding for data like timestamps where almost each value is unique.
UPD2
如何确定输出文件使用了哪种编码?
How can we tell which encoding was used for an output file?
-
我用过镶木工具.
I used parquet-tools for it.
-> brew install parquet-tools(对于Mac) ->实木复合地板工具meta your_parquet_file .snappy.parquet
-> brew install parquet-tools (for mac) -> parquet-tools meta your_parquet_file.snappy.parquet
输出:
.column_1: BINARY SNAPPY DO:0 FPO:16637 SZ:2912/8114/3.01 VC:26320 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED .column_2: BINARY SNAPPY DO:0 FPO:25526 SZ:119245/711487/1.32 VC:26900 ENC:PLAIN,RLE,BIT_PACKED .其中PLAIN和PLAIN_DICTIONARY是用于该列的编码
Where PLAIN and PLAIN_DICTIONARY are encodings which were used for that columns
更多推荐
如何在Spark中设置Parquet文件编码
发布评论