HDFS for Spark中gspipped Parquet文件是否可拆分?

编程入门 行业动态 更新时间:2024-10-23 05:35:07
本文介绍了HDFS for Spark中gspipped Parquet文件是否可拆分?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

在互联网上搜索和阅读有关此主题的答案时,我收到令人困惑的消息.任何人都可以分享他们的经验吗?我知道一个事实,那就是gzip压缩的csv不是,但是Parquet的文件内部结构是如此,以至于Parquet vs csv的情况完全不同?

I get confusing messages when searching and reading answers on the internet on this subject. Anyone can share their experience? I know for a fact that gzipped csv is not, but maybe file internal structures for Parquet are such that it is totally different case for Parquet vs csv?

推荐答案

具有GZIP压缩的实木复合地板文件实际上是可拆分的.这是因为Parquet文件的内部布局.它们始终是可拆分的,与所使用的压缩算法无关.

Parquet files with GZIP compression are actually splittable. This is because of the internal layout of Parquet files. These are always splittable, independent of the used compression algorithm.

这个事实主要是由于Parquet文件的设计分为以下几部分:

This fact is mainly due to the design of Parquet files that divided in the following parts:

  • 每个Parquet文件由几个行组组成,它们的大小应与HDFS块大小相同.
  • 每个行组每列包含一个ColumnChunk. RowGroup中的每个ColumnChunk具有相同数量的行.
  • ColumnChunks分为多个页面,这些页面的大小可能在64KiB到16MiB之间. 压缩是按页面进行的,因此页面是工作可以并行执行的最低级别.
  • Each Parquet files consists of several RowGroups, these should be the same size as your HDFS Block Size.
  • Each RowGroup consists of a ColumnChunk per column. Each ColumnChunk in a RowGroup has the same number of Rows.
  • ColumnChunks are split into Pages, these are probably in the size of 64KiB to 16MiB. Compression is done on a per-page basis, thus a page is the lowest level of parallelisation a job can work on.
  • 您可以在此处找到更详细的说明: github/apache/parquet-format#file-format

    You can find a more detailed explanation here: github/apache/parquet-format#file-format

    更多推荐

    HDFS for Spark中gspipped Parquet文件是否可拆分?

    本文发布于:2023-11-22 06:51:25,感谢您对本站的认可!
    本文链接:https://www.elefans.com/category/jswz/34/1616389.html
    版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
    本文标签:文件   Spark   HDFS   Parquet   gspipped

    发布评论

    评论列表 (有 0 条评论)
    草根站长

    >www.elefans.com

    编程频道|电子爱好者 - 技术资讯及电子产品介绍!