问题描述
限时送ChatGPT账号..我是 pyspark 的新手.我想知道 pyspark 数据框中的 rdd 是什么意思.
I am new to to pyspark. I am wondering what does rdd mean in pyspark dataframe.
weatherData = spark.read.csv('weather.csv', header=True, inferSchema=True)
这两行代码有相同的输出.我想知道 rdd 有什么影响
These two line of the code has the same output. I am wondering what the effect of having rdd
weatherData.collect()
weatherData.rdd.collect()
推荐答案
数据框是一个表格,或类似二维数组的结构,其中每一列包含一个变量的测量值,每一行包含一个案例.
A data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case.
因此,DataFrame
由于其表格格式而具有额外的元数据,这允许 Spark 对最终查询运行某些优化.
So, a DataFrame
has additional metadata due to its tabular format, which allows Spark to run certain optimizations on the finalized query.
RDD
,另一方面,只是一个R弹性D分布式Dataset,它是更多的是无法优化的数据黑匣子,因为可以对其执行的操作不受约束.
An RDD
, on the other hand, is merely a Resilient Distributed Dataset that is more of a blackbox of data that cannot be optimized as the operations that can be performed against it, are not as constrained.
但是,您可以通过 .rdd
方法从 DataFrame
到 RDD
,并且可以从 RDD 到DataFrame(如果RDD是表格格式)通过.toDF()
方法
However, you can go from a DataFrame
to an RDD
via its .rdd
method, and you can go from an RDD to a DataFrame (if the RDD is in a tabular format) via the .toDF()
method
一般情况下,由于内置查询优化,建议尽可能使用 DataFrame.
In general, it is recommended to use a DataFrame where possible due to the built in query optimization.
这篇关于pyspark 数据框中的 rdd 是什么意思的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
更多推荐
[db:关键词]
发布评论