Spark:写入Parquet文件时无法理解的行为

编程入门行业动态更新时间:2024-10-24 01:48:05

本文介绍了Spark:写入Parquet文件时无法理解的行为-数据类型的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我有这样的csv记录:

I have a csv record like this :

--------------------------- name | age | entranceDate | --------------------------- Tom | 12 | 2019-10-01 | --------------------------- Mary | 15 | 2019-10-01 |

我使用自定义模式从CSV读取它并将其转换为DataFrame:

I read it from CSV and convert it to a DataFrame, using a custom schema :

public static StructType createSchema() { final StructType schema = DataTypes.createStructType(Arrays.asList( DataTypes.createStructField("name", DataTypes.StringType, false), DataTypes.createStructField("age", DataTypes.StringType, false), DataTypes.createStructField("entranceDate", DataTypes.StringType, false) )); return schema; } sqlContext.read() .format("com.databricks.spark.csv") .option("inferSchema", "false") .option("delimiter", FIELD_DELIMITER) .option("header", "false") .schema(schema) .load(pathToMyCsvFile);

现在我想将此数据帧写入我的hdfs的木地板中:

Now I want to write this dataframe to parquet on my hdfs :

String[] partitions = new String[] { "name", "entranceDate" }; df.write() .partitionBy(partitions) .mode(SaveMode.Append) .parquet(parquetPath);

但是当我在spark-shell中检查实木复合地板的架构时:

But when I check the schema of the parquet in spark-shell :

sqlContext.read.parquet("/test/parquet/name=Tom/entranceDate=2019-10-01/").printSchema()

它显示entranceDate是Date类型.我不知道那是怎么回事?我已经指定该字段应为String，如何将其自动转换为Date?

it shows the entranceDate is of type Date. I wonder how is that ? I already specify that this field should be String, how can it convert automatically to Date ?

--------------

编辑:我进行了一些测试，发现仅当我在编写时执行.partitionBy(partitions)时，它才能转换为日期.如果删除此行并打印架构，它将显示entranceDate的类型为String

Edit : I did some tests and found that it converts to Date only if I do .partitionBy(partitions) when writing. If I remove this line and print the schema, it will show the type of entranceDate is String

推荐答案

我会说这是因为自动模式推断机制.火花文档页面说

I would say that happens because automatic schema inference mechanism. Spark documentation page says

请注意，分区列的数据类型为自动推断.当前，数字数据类型，日期，时间戳和字符串类型受支持.

Notice that the data types of the partitioning columns are automatically inferred. Currently, numeric data types, date, timestamp and string type are supported.

有时用户可能不希望自动推断分区列.对于这些用例，自动类型可以通过以下方式配置推断 spark.sql.sources.partitionColumnTypeInference.enabled，这是默认为true.

Sometimes users may not want to automatically infer the data types of the partitioning columns. For these use cases, the automatic type inference can be configured by spark.sql.sources.partitionColumnTypeInference.enabled, which is default to true.

更多推荐

Spark:写入Parquet文件时无法理解的行为

本文发布于:2023-11-22 06:54:31，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1616399.html