如何在Spark中设置Parquet文件编码

编程入门行业动态更新时间:2024-10-22 23:34:59

本文介绍了如何在Spark中设置Parquet文件编码的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

Parquet文档描述了几种不同的编码此处

Parquet documentation describe few different encodings here

在读取/写入过程中文件内部是否有所改变，还是可以设置? Spark文档中对此一无所知.只能从Netflix团队的Ryan Blue的讲话中找到幻灯片 .他将镶木地板配置设置为sqlContext

Is it changes somehow inside file during read/write, or I can set it? Nothing about it in Spark documentation. Only found slides from speach by Ryan Blue from Netflix team. He sets parquet configurations to sqlContext

sqlContext.setConf("parquet.filter.dictionary.enabled", "true")

看起来与Parquet文件中的纯字典编码无关.

Looks like it's not about plain dictionary encoding in Parquet files.

推荐答案

所以我在Twitter工程上找到了我的问题的答案博客.

So I found an answer to my question on twitter engineering blog.

当许多唯一值<时，Parquet启用了自动词典编码. 10 ^ 5. 此处是发布带有自调整字典编码的Parquet 1.0的帖子

Parquet has an automatic dictionary encoding enabled when a number of unique values < 10^5. Here is a post announcing Parquet 1.0 with self-tuning dictionary encoding

UPD:

可以在SparkSession配置中切换字典编码:

Dictionary encoding can be switched in SparkSession configs:

SparkSession.builder .appName("name") .config("parquet.enable.dictionary","false") //true

关于按列编码，有一个改进的 issue 是对以下内容的改进: Parquet的Jira于17年7月14日创建.由于字典编码是默认设置，并且仅适用于所有表格，因此它会关闭Delta编码(Jira 问题(此错误)，这是对几乎每个值都是唯一的时间戳等数据的唯一合适编码.

Regarding encoding by column, there is an open issue as improvement in Parquet’s Jira that was created on 14th July, 17. Since dictionary encoding is a default and works only for all table it turns off Delta Encoding(Jira issue for this bug) which is the only suitable encoding for data like timestamps where almost each value is unique.

UPD2

如何确定输出文件使用了哪种编码?

How can we tell which encoding was used for an output file?

我用过镶木工具.

I used parquet-tools for it.

-> brew install parquet-tools(对于Mac) ->实木复合地板工具meta your_parquet_file .snappy.parquet

-> brew install parquet-tools (for mac) -> parquet-tools meta your_parquet_file.snappy.parquet

输出:

.column_1: BINARY SNAPPY DO:0 FPO:16637 SZ:2912/8114/3.01 VC:26320 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED .column_2: BINARY SNAPPY DO:0 FPO:25526 SZ:119245/711487/1.32 VC:26900 ENC:PLAIN,RLE,BIT_PACKED .

其中PLAIN和PLAIN_DICTIONARY是用于该列的编码

Where PLAIN and PLAIN_DICTIONARY are encodings which were used for that columns

更多推荐

如何在Spark中设置Parquet文件编码

本文发布于:2023-11-22 06:50:10，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1616386.html