当我们的数据源中缺少某些类型为选项[Seq [String]]的列时,我遇到了一些编码数据的麻烦。 理想情况下,我希望缺少的列数据填充None 。
场景:
我们正在阅读的一些镶木地板文件有column1但不是column2 。
我们将来自这些parquet文件的数据加载到数据集中,并将其转换为MyType 。
case class MyType(column1: Option[String], column2: Option[Seq[String]]) sqlContext.read.parquet("dataSource.parquet").as[MyType]org.apache.spark.sql.AnalysisException:无法在给定输入列的情况下解析' column2 ':[column1];
有没有办法用column2数据创建数据集为None ?
I'm having some trouble encoding data when some columns that are of type Option[Seq[String]] are missing from our data source. Ideally I would like the missing column data to be filled with None.
Scenario:
We have some parquet files that we are reading in that have column1 but not column2.
We load the data in from these parquet files into a Dataset, and cast it as MyType.
case class MyType(column1: Option[String], column2: Option[Seq[String]]) sqlContext.read.parquet("dataSource.parquet").as[MyType]org.apache.spark.sql.AnalysisException: cannot resolve 'column2' given input columns: [column1];
Is there a way to create the Dataset with column2 data as None?
最满意答案
在简单情况下,您可以提供一个初始模式,它是预期模式的超集。 例如在你的情况下:
val schema = Seq[MyType]().toDF.schema Seq("a", "b", "c").map(Option(_)) .toDF("column1") .write.parquet("/tmp/column1only") val df = spark.read.schema(schema).parquet("/tmp/column1only").as[MyType] df.show+-------+-------+ |column1|column2| +-------+-------+ | a| null| | b| null| | c| null| +-------+-------+df.firstMyType = MyType(Some(a),None)这种方法可能有点脆弱,因此一般而言,您应该使用SQL文字来填补空白:
spark.read.parquet("/tmp/column1only") // or ArrayType(StringType) .withColumn("column2", lit(null).cast("array<string>")) .as[MyType] .firstMyType = MyType(Some(a),None)In simple cases you can provide an initial schema which is a superset of expected schemas. For example in your case:
val schema = Seq[MyType]().toDF.schema Seq("a", "b", "c").map(Option(_)) .toDF("column1") .write.parquet("/tmp/column1only") val df = spark.read.schema(schema).parquet("/tmp/column1only").as[MyType] df.show+-------+-------+ |column1|column2| +-------+-------+ | a| null| | b| null| | c| null| +-------+-------+df.firstMyType = MyType(Some(a),None)This approach can be a little bit fragile so in general you should rather use SQL literals to fill the blanks:
spark.read.parquet("/tmp/column1only") // or ArrayType(StringType) .withColumn("column2", lit(null).cast("array<string>")) .as[MyType] .firstMyType = MyType(Some(a),None)
更多推荐
发布评论