我知道BigQuery正在提供COUNT DISTINCT的估计,但有没有关于错误有多大以及它依赖什么样的参数的信息?
谢谢
I understand that BigQuery is providing an estimation of COUNT DISTINCT, but is there any information on how big the error is and what kind of parameters it depends on?
Thanks
最满意答案
COUNT DISTINCT估计的准确性取决于干扰值的实际数量。 如果它很小 - 算法非常准确(对于较小的值,它通常会返回精确的值),但是不同值的数量越多 - 它就越不准确。 注意,COUNT(DISTINCT)采用第二个参数,它将内存换成准确性,即它将使用更多内存,但更准确。 例如:
SELECT COUNT(DISTINCT x, 100000) FROM T如果distict值的总数小于100,000,将返回相当准确的结果。
COUNT个不同估计的精确算法有所不同,但不同的变化具有相似的误差估计 - 约1 / SQRT(N),其中N是第二个参数。 默认值为1000,相当于约3%的误差。 如果碰到10000,则误差约为1%。
The accuracy of COUNT DISTINCT estimation depends on real number of distict values. If it is small - the algorithm is pretty accurate (for small values it usually returns the exact value), but the bigger number of distinct values is - the less accurate it can become. Note, that COUNT(DISTINCT) takes second argument, which trades memory for accuracy, i.e. it will use more memory, but be more accurate. For example:
SELECT COUNT(DISTINCT x, 100000) FROM Twill return fairly accurate results if total number of distict values is less than 100,000.
The exact algorithm for COUNT distinct estimate varies, but different variations have similar error estimate - about 1/SQRT(N), where N is the second argument. Default value is 1000, which corresponds to about 3% error. If bumped to 10000 it would be about 1% error.
更多推荐
发布评论