Parquet架构和Spark(Parquet schema and Spark)

我正在尝试将CSV文件转换为镶木地板，我正在使用Spark来实现这一目标。

SparkSession spark = SparkSession .builder() .appName(appName) .config("spark.master", master) .getOrCreate(); Dataset<Row> logFile = spark.read().csv("log_file.csv"); logFile.write().parquet("log_file.parquet");

现在问题是我没有定义架构，列看起来像这样（在spark中使用printSchema（）显示输出）

root |-- _c0: string (nullable = true) |-- _c1: string (nullable = true) |-- _c2: string (nullable = true) ....

csv在第一行有名字，但是他们被忽略了我猜，问题是只有几列是字符串，我也有整数和日期。

我只使用Spark，没有avro或其他任何基本上（从未使用过avro）。

我有哪些选择来定义架构以及如何？如果我需要以另一种方式编写镶木地板文件，那么只要它是一个快速简单的解决方案就没有问题。

（我使用spark standalone进行测试/不知道scala）

I am trying to convert CSV files to parquet and i am using Spark to accomplish this.

SparkSession spark = SparkSession .builder() .appName(appName) .config("spark.master", master) .getOrCreate(); Dataset<Row> logFile = spark.read().csv("log_file.csv"); logFile.write().parquet("log_file.parquet");

Now the problem is i don't have a schema defined and columns look like this (output displayed using printSchema() in spark)

root |-- _c0: string (nullable = true) |-- _c1: string (nullable = true) |-- _c2: string (nullable = true) ....

the csv has the names on the first row but they're ignored i guess, the problem is only a few columns are strings, i also have ints and dates.

I am only using Spark, no avro or anything else basically (never used avro).

What are my options to define a schema and how? If i need to write the parquet file in another way then no problem as long as it's a quick an easy solution.

(i am using spark standalone for tests / don't know scala)

最满意答案

尝试使用.option（“inferschema”，“true”）呈现Spark-csv包。这将自动从数据中推断出架构。

您还可以使用结构类型为数据定义自定义模式，并使用.schema(schema_name)根据自定义模式进行读取。

val sqlContext = new SQLContext(sc) val customSchema = StructType(Array( StructField("year", IntegerType, true), StructField("make", StringType, true), StructField("model", StringType, true), StructField("comment", StringType, true), StructField("blank", StringType, true))) val df = sqlContext.read .format("com.databricks.spark.csv") .option("header", "true") // Use first line of all files as header .schema(customSchema) .load("cars.csv")

Try using the .option("inferschema","true") present Spark-csv package. This will automatically infer the schema from the data.

You can also define a custom schema for your data using struct type and use the .schema(schema_name) to read the on the basis of a custom schema.

val sqlContext = new SQLContext(sc) val customSchema = StructType(Array( StructField("year", IntegerType, true), StructField("make", StringType, true), StructField("model", StringType, true), StructField("comment", StringType, true), StructField("blank", StringType, true))) val df = sqlContext.read .format("com.databricks.spark.csv") .option("header", "true") // Use first line of all files as header .schema(customSchema) .load("cars.csv")

更多推荐

Parquet架构和Spark(Parquet schema and Spark)

最满意答案

发布评论取消回复

最近发表

热门文章

标签列表