PySpark按条件计数值

编程入门 行业动态 更新时间:2024-10-12 01:31:42
本文介绍了PySpark按条件计数值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我有一个DataFrame,这里是一个代码段:

I have a DataFrame, a snippet here:

[['u1', 1], ['u2', 0]]

基本上是一个名为f的字符串字段,第二个元素(is_fav)的值为1或0.

basically a string field named f and either a 1 or a 0 for second element (is_fav).

我需要做的是在第一个字段上分组并计算1和0的出现次数.我希望做类似的事情

What I need to do is grouping on the first field and counting the occurrences of 1s and 0s. I was hoping to do something like

num_fav = count((col("is_fav") == 1)).alias("num_fav") num_nonfav = count((col("is_fav") == 0)).alias("num_nonfav") df.groupBy("f").agg(num_fav, num_nonfav)

它不能正常工作,在两种情况下,我得到的结果都是相同的,等于组中项目的计数,因此似乎忽略了过滤器(无论是1还是0).这是否取决于count的工作方式?

It does not work properly, I get in both cases the same result which amounts to the count for the items in the group, so the filter (whether it is a 1 or a 0) seems to be ignored. Does this depend on how count works?

推荐答案

此处没有过滤器. col("is_fav") == 1和col("is_fav") == 0)都只是布尔表达式,而count只要定义就不会真正在意它们的值.

There is no filter here. Both col("is_fav") == 1 and col("is_fav") == 0) are just boolean expressions and count doesn't really care about their value as long as it is defined.

例如,可以使用简单的sum解决方法很多:

There are many ways you can solve this for example by using simple sum:

from pyspark.sql.functions import sum, abs gpd = df.groupBy("f") gpd.agg( sum("is_fav").alias("fv"), (count("is_fav") - sum("is_fav")).alias("nfv") )

或使未定义的值不确定(又名 NULL ):

or making ignored values undefined (a.k.a NULL):

exprs = [ count(when(col("is_fav") == x, True)).alias(c) for (x, c) in [(1, "fv"), (0, "nfv")] ] gpd.agg(*exprs)

更多推荐

PySpark按条件计数值

本文发布于:2023-10-26 18:16:31,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1530956.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:数值   条件   PySpark

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!