使用pyspark计算groupBy总数的百分比

编程入门 行业动态 更新时间:2024-10-09 00:40:37
本文介绍了使用pyspark计算groupBy总数的百分比的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我在pyspark中有以下代码,生成的表向我显示了列的不同值及其计数.我想让另一列显示每一行代表总计数的百分比.我该怎么办?

I have the following code in pyspark, resulting in a table showing me the different values for a column and their counts. I want to have another column showing what percentage of the total count does each row represent. How do I do that?

difrgns = (df1 .groupBy("column_name") .count() .sort(desc("count")) .show())

提前谢谢!

推荐答案

一个示例,它不适合Windowing,因为注释暗示并且是更好的方法:

An example as an alternative if not comfortable with Windowing as the comment alludes to and is the better way to go:

# Running in Databricks, not all stuff required from pyspark.sql import Row from pyspark.sql import SparkSession import pyspark.sql.functions as F from pyspark.sql.types import * #from pyspark.sql.functions import col data = [("A", "X", 2, 100), ("A", "X", 7, 100), ("B", "X", 10, 100), ("C", "X", 1, 100), ("D", "X", 50, 100), ("E", "X", 30, 100)] rdd = sc.parallelize(data) someschema = rdd.map(lambda x: Row(c1=x[0], c2=x[1], val1=int(x[2]), val2=int(x[3]))) df = sqlContext.createDataFrame(someschema) tot = df.count() df.groupBy("c1") \ .count() \ .withColumnRenamed('count', 'cnt_per_group') \ .withColumn('perc_of_count_total', (F.col('cnt_per_group') / tot) * 100 ) \ .show()

返回:

+---+-------------+-------------------+ | c1|cnt_per_group|perc_of_count_total| +---+-------------+-------------------+ | E| 1| 16.666666666666664| | B| 1| 16.666666666666664| | D| 1| 16.666666666666664| | C| 1| 16.666666666666664| | A| 2| 33.33333333333333| +---+-------------+-------------------+

我专注于Scala,这似乎更容易.就是说,通过注释建议的解决方案使用Window,这是我在Scala中使用over()做的事情.

I focus on Scala and it seems easier with that. That said, the suggested solution via the comments uses Window which is what I would do in Scala with over().

更多推荐

使用pyspark计算groupBy总数的百分比

本文发布于:2023-10-30 18:39:03,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1543681.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:百分比   总数   pyspark   groupBy

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!