聚合Dask数据框并生成聚合的数据框

编程入门 行业动态 更新时间:2024-10-19 04:23:27
本文介绍了聚合Dask数据框并生成聚合的数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我有一个Dask数据帧,看起来像这样:

I have a Dask dataframe that looks like this:

url referrer session_id ts customer url1 ref1 xxx 2017-09-15 00:00:00 a url2 ref2 yyy 2017-09-15 00:00:00 a url2 ref3 yyy 2017-09-15 00:00:00 a url1 ref1 xxx 2017-09-15 01:00:00 a url2 ref2 yyy 2017-09-15 01:00:00 a

我想对url和timestamp上的数据进行分组,汇总列值并生成一个如下所示的数据框:

I want to group the data on url and timestamp, aggregate column values and produce a dataframe that would look like this instead:

customer url ts page_views visitors referrers a url1 2017-09-15 00:00:00 1 1 [ref1] a url2 2017-09-15 00:00:00 2 2 [ref2, ref3]

在Spark SQL中,我可以按如下操作:

In Spark SQL, I can do this as follows:

select customer, url, ts, count(*) as page_views, count(distinct(session_id)) as visitors, collect_list(referrer) as referrers from df group by customer, url, ts

我可以用Dask数据帧做任何事情吗?我试过了,但我只能分别计算汇总列,如下所示:

Is there any way I can do it with Dask dataframes? I tried, but I can only calculate the aggregated columns separately, as follows:

# group on timestamp (rounded) and url grouped = df.groupby(['ts', 'url']) # calculate page views (count rows in each group) page_views = grouped.size() # collect a list of referrer strings per group referrers = grouped['referrer'].apply(list, meta=('referrers', 'f8')) # count unique visitors (session ids) visitors = grouped['session_id'].count()

但是我似乎找不到一种产生所需组合数据框的好方法。

But I can't seem to find a good way to produce a combined dataframe that I need.

推荐答案

以下确实有效:

gb = df.groupby(['customer', 'url', 'ts']) gb.apply(lambda d: pd.DataFrame({'views': len(d), 'visitiors': d.session_id.count(), 'referrers': [d.referer.tolist()]})).reset_index()

(假设访问者应为单如上面的sql所示)您可能希望定义输出的元。

(assuming visitors should be unique as per the sql above) You may wish to define the meta of the output.

更多推荐

聚合Dask数据框并生成聚合的数据框

本文发布于:2023-06-13 17:39:29,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/686669.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:数据   Dask

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!