我正在尝试将CSV文件转换为镶木地板,我正在使用Spark来实现这一目标。
SparkSession spark = SparkSession .builder() .appName(appName) .config("spark.master", master) .getOrCreate(); Dataset<Row> logFile = spark.read().csv("log_file.csv"); logFile.write().parquet("log_file.parquet");现在问题是我没有定义架构,列看起来像这样(在spark中使用printSchema()显示输出)
root |-- _c0: string (nullable = true) |-- _c1: string (nullable = true) |-- _c2: string (nullable = true) ....csv在第一行有名字,但是他们被忽略了我猜,问题是只有几列是字符串,我也有整数和日期。
我只使用Spark,没有avro或其他任何基本上(从未使用过avro)。
我有哪些选择来定义架构以及如何? 如果我需要以另一种方式编写镶木地板文件,那么只要它是一个快速简单的解决方案就没有问题。
(我使用spark standalone进行测试/不知道scala)
I am trying to convert CSV files to parquet and i am using Spark to accomplish this.
SparkSession spark = SparkSession .builder() .appName(appName) .config("spark.master", master) .getOrCreate(); Dataset<Row> logFile = spark.read().csv("log_file.csv"); logFile.write().parquet("log_file.parquet");Now the problem is i don't have a schema defined and columns look like this (output displayed using printSchema() in spark)
root |-- _c0: string (nullable = true) |-- _c1: string (nullable = true) |-- _c2: string (nullable = true) ....the csv has the names on the first row but they're ignored i guess, the problem is only a few columns are strings, i also have ints and dates.
I am only using Spark, no avro or anything else basically (never used avro).
What are my options to define a schema and how? If i need to write the parquet file in another way then no problem as long as it's a quick an easy solution.
(i am using spark standalone for tests / don't know scala)
最满意答案
尝试使用.option(“inferschema”,“true”)呈现Spark-csv包。 这将自动从数据中推断出架构。
您还可以使用结构类型为数据定义自定义模式,并使用.schema(schema_name)根据自定义模式进行读取。
val sqlContext = new SQLContext(sc) val customSchema = StructType(Array( StructField("year", IntegerType, true), StructField("make", StringType, true), StructField("model", StringType, true), StructField("comment", StringType, true), StructField("blank", StringType, true))) val df = sqlContext.read .format("com.databricks.spark.csv") .option("header", "true") // Use first line of all files as header .schema(customSchema) .load("cars.csv")Try using the .option("inferschema","true") present Spark-csv package. This will automatically infer the schema from the data.
You can also define a custom schema for your data using struct type and use the .schema(schema_name) to read the on the basis of a custom schema.
val sqlContext = new SQLContext(sc) val customSchema = StructType(Array( StructField("year", IntegerType, true), StructField("make", StringType, true), StructField("model", StringType, true), StructField("comment", StringType, true), StructField("blank", StringType, true))) val df = sqlContext.read .format("com.databricks.spark.csv") .option("header", "true") // Use first line of all files as header .schema(customSchema) .load("cars.csv")更多推荐
发布评论