问题描述
限时送ChatGPT账号..我有一个如下的数据框:
I have a dataframe as below:
+-----+--------------------+
|LABEL| TERM|
+-----+--------------------+
| 4| inhibitori_effect|
| 4| novel_therapeut|
| 4| antiinflammator...|
| 4| promis_approach|
| 4| cell_function|
| 4| cell_line|
| 4| cancer_cell|
我想通过将所有术语作为序列来创建一个新的数据框,以便我可以将它们与 Word2vec 一起使用.即:
I want to create a new dataframe by taking all terms as sequence so that I can use them with Word2vec. That is:
+-----+--------------------+
|LABEL| TERM|
+-----+--------------------+
| 4| inhibitori_effect, novel_therapeut,..., cell_line |
因此,我想应用此处给出的示例代码:https://spark.apache/docs/latest/ml-features.html#word2vec
As a result I want to apply this sample code as given here: https://spark.apache/docs/latest/ml-features.html#word2vec
到目前为止,我已经尝试将 df 转换为 RDD 并进行映射.然后我无法将其重新转换为 df.
So far I have tried to convert df to RDD and map it. And then I could not manage to re-convert it to a df.
提前致谢.
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.SQLContext
val sc = new SparkContext(conf)
val sqlContext: SQLContext = new HiveContext(sc)
val df = sqlContext.load("jdbc",Map(
"url" -> "jdbc:oracle:thin:...",
"dbtable" -> "table"))
df.show(20)
df.groupBy($"label").agg(collect_list($"term").alias("term"))
推荐答案
您可以使用 collect_list
或 collect_set
函数:
You can use collect_list
or collect_set
functions:
import org.apache.spark.sql.functions.{collect_list, collect_set}
df.groupBy($"label").agg(collect_list($"term").alias("term"))
在 Spark
2.0 它需要 HiveContext
并且在 Spark 2.0+ 中你必须在 SessionBuilder
中启用 hive 支持.请参阅在 Spark SQL 中使用 collect_list 和 collect_set
In Spark < 2.0 it requires HiveContext
and in Spark 2.0+ you have to enable hive support in SessionBuilder
. See Use collect_list and collect_set in Spark SQL
这篇关于如何将数据框列转换为序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
更多推荐
[db:关键词]
发布评论