问题描述
限时送ChatGPT账号..我在数据框中有以下类型:
I have the following types in a dataframe:
root
|-- id: string (nullable = true)
|-- items: array (nullable = true)
| |-- element: string (containsNull = true)
输入:
val rawData = Seq(("id1",Array("item1","item2","item3","item4")),("id2",Array("item1","item2","item3")))
val data = spark.createDataFrame(rawData)
和项目列表:
val filter_list = List("item1", "item2")
我想过滤掉不在 filter_list
中的项目,类似于 array_contains
的功能,但它不适用于提供的字符串列表,仅单个值.
I would like to filter out items that are non in the filter_list
, similar to how array_contains
would function, but its not working on a provided list of strings, only a single value.
所以输出看起来像这样:
so the output would look like this:
val rawData = Seq(("id1",Array("item1","item2")),("id2",Array("item1","item2")))
val data = spark.createDataFrame(rawData)
我尝试使用以下 UDF 解决此问题,但我可能在 Scala 和 Spark 之间混合了类型:
I tried solving this with the following UDF, but I probably mix types between Scala and Spark:
def filterItems(flist: List[String]) = udf {
(recs: List[String]) => recs.filter(item => flist.contains(item))
}
我使用的是 Spark 2.2
I'm using Spark 2.2
谢谢!
推荐答案
你的代码几乎是对的.您所要做的就是将 List
替换为 Seq
You code is almost right. All you have to do is replace List
with Seq
def filterItems(flist: List[String]) = udf {
(recs: Seq[String]) => recs.filter(item => flist.contains(item))
}
从 List[String] => 更改签名也是有意义的.UserDefinedFunction
到 SeqString] =>UserDefinedFunction
,但不是必需的.
It would also make sense to change signature from List[String] => UserDefinedFunction
to SeqString] => UserDefinedFunction
, but it is not required.
参考 SQL 编程指南 - 数据类型一>.
这篇关于根据提供的列表过滤数组列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
更多推荐
[db:关键词]
发布评论