BigQuery COUNT DISTINCT估计错误(BigQuery COUNT DISTINCT estimation error)

编程入门 行业动态 更新时间:2024-10-20 00:39:19
BigQuery COUNT DISTINCT估计错误(BigQuery COUNT DISTINCT estimation error)

我知道BigQuery正在提供COUNT DISTINCT的估计,但有没有关于错误有多大以及它依赖什么样的参数的信息?

谢谢

I understand that BigQuery is providing an estimation of COUNT DISTINCT, but is there any information on how big the error is and what kind of parameters it depends on?

Thanks

最满意答案

COUNT DISTINCT估计的准确性取决于干扰值的实际数量。 如果它很小 - 算法非常准确(对于较小的值,它通常会返回精确的值),但是不同值的数量越多 - 它就越不准确。 注意,COUNT(DISTINCT)采用第二个参数,它将内存换成准确性,即它将使用更多内存,但更准确。 例如:

SELECT COUNT(DISTINCT x, 100000) FROM T

如果distict值的总数小于100,000,将返回相当准确的结果。

COUNT个不同估计的精确算法有所不同,但不同的变化具有相似的误差估计 - 约1 / SQRT(N),其中N是第二个参数。 默认值为1000,相当于约3%的误差。 如果碰到10000,则误差约为1%。

The accuracy of COUNT DISTINCT estimation depends on real number of distict values. If it is small - the algorithm is pretty accurate (for small values it usually returns the exact value), but the bigger number of distinct values is - the less accurate it can become. Note, that COUNT(DISTINCT) takes second argument, which trades memory for accuracy, i.e. it will use more memory, but be more accurate. For example:

SELECT COUNT(DISTINCT x, 100000) FROM T

will return fairly accurate results if total number of distict values is less than 100,000.

The exact algorithm for COUNT distinct estimate varies, but different variations have similar error estimate - about 1/SQRT(N), where N is the second argument. Default value is 1000, which corresponds to about 3% error. If bumped to 10000 it would be about 1% error.

更多推荐

本文发布于:2023-08-06 18:50:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1454572.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:错误   COUNT   BigQuery   DISTINCT   error

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!