Parquet架构和Spark(Parquet schema and Spark)

编程入门 行业动态 更新时间:2024-10-24 16:32:30
Parquet架构和Spark(Parquet schema and Spark)

我正在尝试将CS​​V文件转换为镶木地板,我正在使用Spark来实现这一目标。

SparkSession spark = SparkSession .builder() .appName(appName) .config("spark.master", master) .getOrCreate(); Dataset<Row> logFile = spark.read().csv("log_file.csv"); logFile.write().parquet("log_file.parquet");

现在问题是我没有定义架构,列看起来像这样(在spark中使用printSchema()显示输出)

root |-- _c0: string (nullable = true) |-- _c1: string (nullable = true) |-- _c2: string (nullable = true) ....

csv在第一行有名字,但是他们被忽略了我猜,问题是只有几列是字符串,我也有整数和日期。

使用Spark,没有avro或其他任何基本上(从未使用过avro)。

我有哪些选择来定义架构以及如何? 如果我需要以另一种方式编写镶木地板文件,那么只要它是一个快速简单的解决方案就没有问题。

(我使用spark standalone进行测试/不知道scala)

I am trying to convert CSV files to parquet and i am using Spark to accomplish this.

SparkSession spark = SparkSession .builder() .appName(appName) .config("spark.master", master) .getOrCreate(); Dataset<Row> logFile = spark.read().csv("log_file.csv"); logFile.write().parquet("log_file.parquet");

Now the problem is i don't have a schema defined and columns look like this (output displayed using printSchema() in spark)

root |-- _c0: string (nullable = true) |-- _c1: string (nullable = true) |-- _c2: string (nullable = true) ....

the csv has the names on the first row but they're ignored i guess, the problem is only a few columns are strings, i also have ints and dates.

I am only using Spark, no avro or anything else basically (never used avro).

What are my options to define a schema and how? If i need to write the parquet file in another way then no problem as long as it's a quick an easy solution.

(i am using spark standalone for tests / don't know scala)

最满意答案

尝试使用.option(“inferschema”,“true”)呈现Spark-csv包。 这将自动从数据中推断出架构。

您还可以使用结构类型为数据定义自定义模式,并使用.schema(schema_name)根据自定义模式进行读取。

val sqlContext = new SQLContext(sc) val customSchema = StructType(Array( StructField("year", IntegerType, true), StructField("make", StringType, true), StructField("model", StringType, true), StructField("comment", StringType, true), StructField("blank", StringType, true))) val df = sqlContext.read .format("com.databricks.spark.csv") .option("header", "true") // Use first line of all files as header .schema(customSchema) .load("cars.csv")

Try using the .option("inferschema","true") present Spark-csv package. This will automatically infer the schema from the data.

You can also define a custom schema for your data using struct type and use the .schema(schema_name) to read the on the basis of a custom schema.

val sqlContext = new SQLContext(sc) val customSchema = StructType(Array( StructField("year", IntegerType, true), StructField("make", StringType, true), StructField("model", StringType, true), StructField("comment", StringType, true), StructField("blank", StringType, true))) val df = sqlContext.read .format("com.databricks.spark.csv") .option("header", "true") // Use first line of all files as header .schema(customSchema) .load("cars.csv")

更多推荐

本文发布于:2023-07-24 10:54:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1245019.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:架构   Parquet   schema   Spark

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!