我正在处理数据框df,例如以下数据框:
I am working on a dataframe df, for instance the following dataframe:
df.show()输出:
+----+------+ |keys|values| +----+------+ | aa| apple| | bb|orange| | bb| desk| | bb|orange| | bb| desk| | aa| pen| | bb|pencil| | aa| chair| +----+------+我使用collect_set进行汇总,并获得一个消除了重复元素的对象集(或collect_list来获取对象列表).
I use collect_set to aggregate and get a set of objects with duplicate elements eliminated (or collect_list to get list of objects).
df_new = df.groupby('keys').agg(collect_set(df.values).alias('collectedSet_values'))结果数据帧如下:
df_new.show()输出:
+----+----------------------+ |keys|collectedSet_values | +----+----------------------+ |bb |[orange, pencil, desk]| |aa |[apple, pen, chair] | +----+----------------------+我正在努力寻找一种方法来查看结果集中的对象(在列collectedSet_values中)中是否存在特定的关键字(例如"chair").我不想使用udf解决方案.
I am struggling to find a way to see if a specific keyword (like 'chair') is in the resulting set of objects (in column collectedSet_values). I do not want to go with udf solution.
请评论您的解决方案/想法.
Please comment your solutions/ideas.
亲切的问候.
推荐答案实际上,有一个不错的函数array_contains为我们做到了.我们将其用于对象集的方式与此处.要知道每组对象中是否都存在椅子"一词,我们可以简单地执行以下操作:
Actually there is a nice function array_contains which does that for us. The way we use it for set of objects is the same as in here. To know if word 'chair' exists in each set of object, we can simply do the following:
df_new.withColumn('contains_chair', array_contains(df_new.collectedSet_values, 'chair')).show()输出:
+----+----------------------+--------------+ |keys|collectedSet_values |contains_chair| +----+----------------------+--------------+ |bb |[orange, pencil, desk]|false | |aa |[apple, pen, chair] |true | +----+----------------------+--------------+collect_list的结果也是如此.
更多推荐
pyspark;检查元素是否在collect
发布评论