dask数据框应用meta

编程入门 行业动态 更新时间:2024-10-09 14:23:56
本文介绍了dask数据框应用meta的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我想对任务数据框的单个列进行频率计数。该代码有效,但是我得到了警告抱怨说,未定义元。如果我尝试定义元,则会出现错误 AttributeError:'DataFrame'对象没有属性'name'。对于此特定用例,看起来好像不需要定义 meta ,但我想知道如何做以供将来参考。

I'm wanting to do a frequency count on a single column of a dask dataframe. The code works, but I get an warning complaining that meta is not defined. If I try to define meta I get an error AttributeError: 'DataFrame' object has no attribute 'name'. For this particular use case it doesn't look like I need to define meta but I'd like to know how to do that for future reference.

虚拟数据帧和列频率

import pandas as pd from dask import dataframe as dd df = pd.DataFrame([['Sam', 'Alex', 'David', 'Sarah', 'Alice', 'Sam', 'Anna'], ['Sam', 'David', 'David', 'Alice', 'Sam', 'Alice', 'Sam'], [12, 10, 15, 23, 18, 20, 26]], index=['Column A', 'Column B', 'Column C']).T dask_df = dd.from_pandas(df)

In [39]: dask_df.head() Out[39]: Column A Column B Column C 0 Sam Sam 12 1 Alex David 10 2 David David 15 3 Sarah Alice 23 4 Alice Sam 18

(dask_df.groupby('Column B') .apply(lambda group: len(group)) )pute() UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected. Before: .apply(func) After: .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result or: .apply(func, meta=('x', 'f8')) for series result warnings.warn(msg) Out[60]: Column B Alice 2 David 2 Sam 3 dtype: int64

尝试定义元会产生 AttributeError

(dask_df.groupby('Column B') .apply(lambda d: len(d), meta={'Column B': 'int'}))pute()

与此相同

(dask_df.groupby('Column B') .apply(lambda d: len(d), meta=pd.DataFrame({'Column B': 'int'})))pute()

如果我尝试使用 dtype 是 int 而不是 int 'f8'或 np.float64 ,所以它似乎不是 dtype 引起问题。

same if I try having the dtype be int instead of "int" or for that matter 'f8' or np.float64 so it doesn't seem like it's the dtype that is causing the problem.

元上的文档似乎暗示我应该完全按照自己的意愿做尝试做的事情( dask.pydata/en/latest /dataframe-design.html#metadata )。

The documentation on meta seems to imply that I should be doing exactly what I'm trying to do (dask.pydata/en/latest/dataframe-design.html#metadata).

什么是元?以及我应该如何定义它?

What is meta? and how am I supposed to define it?

使用 python 3.6 dask 0.14.3 和 pandas 0.20.2

推荐答案

元是计算结果的名称/类型的规定。这是必需的,因为 apply()具有足够的灵活性,可以从数据框中生成几乎所有内容。如您所见,如果您不提供元,那么dask实际上会计算部分数据,以查看类型是什么-很好,但是您应该知道它正在发生。 通过提供输出的零行版本(数据帧或系列),或者仅提供输出的零行版本,就可以避免这种预计算(可能会很昂贵),并且在知道输出的外观时更加明确。

meta is the prescription of the names/types of the output from the computation. This is required because apply() is flexible enough that it can produce just about anything from a dataframe. As you can see, if you don't provide a meta, then dask actually computes part of the data, to see what the types should be - which is fine, but you should know it is happening. You can avoid this pre-computation (which can be expensive) and be more explicit when you know what the output should look like, by providing a zero-row version of the output (dataframe or series), or just the types.

计算的输出实际上是一个序列,因此以下是最简单的方法

The output of your computation is actually a series, so the following is the simplest that works

(dask_df.groupby('Column B') .apply(len, meta=('int')))pute()

,但更准确的是

(dask_df.groupby('Column B') .apply(len, meta=pd.Series(dtype='int', name='Column B')))

更多推荐

dask数据框应用meta

本文发布于:2023-11-22 08:01:19,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1616598.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:数据   dask   meta

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!