Spark 2.0隐式编码器,在类型为Option [Seq [String]](scala)时处理缺失列(Spark 2.0 implicit encoder, deal with missing

系统教程 行业动态 更新时间:2024-06-14 16:59:47
Spark 2.0隐式编码器,在类型为Option [Seq [String]](scala)时处理缺失列(Spark 2.0 implicit encoder, deal with missing column when type is Option[Seq[String]] (scala))

当我们的数据源中缺少某些类型为选项[Seq [String]]的列时,我遇到了一些编码数据的麻烦。 理想情况下,我希望缺少的列数据填充None 。

场景:

我们正在阅读的一些镶木地板文件有column1但不是column2

我们将来自这些parquet文件的数据加载到数据集中,并将其转换为MyType 。

case class MyType(column1: Option[String], column2: Option[Seq[String]]) sqlContext.read.parquet("dataSource.parquet").as[MyType]

org.apache.spark.sql.AnalysisException:无法在给定输入列的情况下解析' column2 ':[column1];

有没有办法用column2数据创建数据集为None ?

I'm having some trouble encoding data when some columns that are of type Option[Seq[String]] are missing from our data source. Ideally I would like the missing column data to be filled with None.

Scenario:

We have some parquet files that we are reading in that have column1 but not column2.

We load the data in from these parquet files into a Dataset, and cast it as MyType.

case class MyType(column1: Option[String], column2: Option[Seq[String]]) sqlContext.read.parquet("dataSource.parquet").as[MyType]

org.apache.spark.sql.AnalysisException: cannot resolve 'column2' given input columns: [column1];

Is there a way to create the Dataset with column2 data as None?

最满意答案

在简单情况下,您可以提供一个初始模式,它是预期模式的超集。 例如在你的情况下:

val schema = Seq[MyType]().toDF.schema

Seq("a", "b", "c").map(Option(_))
  .toDF("column1")
  .write.parquet("/tmp/column1only")

val df = spark.read.schema(schema).parquet("/tmp/column1only").as[MyType]
df.show
 
+-------+-------+
|column1|column2|
+-------+-------+
|      a|   null|
|      b|   null|
|      c|   null|
+-------+-------+
 
df.first
 
MyType = MyType(Some(a),None)
 

这种方法可能有点脆弱,因此一般而言,您应该使用SQL文字来填补空白:

spark.read.parquet("/tmp/column1only")
  // or ArrayType(StringType)
  .withColumn("column2", lit(null).cast("array<string>"))
  .as[MyType]
  .first
 
MyType = MyType(Some(a),None)

In simple cases you can provide an initial schema which is a superset of expected schemas. For example in your case:

val schema = Seq[MyType]().toDF.schema

Seq("a", "b", "c").map(Option(_))
  .toDF("column1")
  .write.parquet("/tmp/column1only")

val df = spark.read.schema(schema).parquet("/tmp/column1only").as[MyType]
df.show
 
+-------+-------+
|column1|column2|
+-------+-------+
|      a|   null|
|      b|   null|
|      c|   null|
+-------+-------+
 
df.first
 
MyType = MyType(Some(a),None)
 

This approach can be a little bit fragile so in general you should rather use SQL literals to fill the blanks:

spark.read.parquet("/tmp/column1only")
  // or ArrayType(StringType)
  .withColumn("column2", lit(null).cast("array<string>"))
  .as[MyType]
  .first
 
MyType = MyType(Some(a),None)

                    
                     
          

更多推荐

本文发布于:2023-04-17 09:01:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/dzcp/9d7792972e476f3eef223eabcd9bdaff.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:编码器   缺失   类型   隐式   Seq

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!