BigQuery SQL:平均值,几何平均值,移除离群值,中位数

编程入门 行业动态 更新时间:2024-10-28 02:32:42
本文介绍了BigQuery SQL:平均值,几何平均值,移除离群值,中位数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我正在计算平均时间以得到关于Stack Overflow的答复,结果毫无意义.

I'm calculating the average time to get a reply on Stack Overflow, and the results make no sense.

#standardSQL WITH question_answers AS ( SELECT * , timestamp_diff(answers.first, creation_date, minute) minutes FROM ( SELECT creation_date , (SELECT AS STRUCT MIN(creation_date) first, COUNT(*) c FROM `bigquery-public-data.stackoverflow.posts_answers` b WHERE a.id=b.parent_id ) answers , SPLIT(tags, '|') tags FROM `bigquery-public-data.stackoverflow.posts_questions` a WHERE EXTRACT(year FROM creation_date) > 2015 ), UNNEST(tags) tag WHERE tag IN ('java', 'javascript', 'google-bigquery', 'firebase', 'php') AND answers.c > 0 ) SELECT tag , COUNT(*) questions , ROUND(AVG(minutes), 2) first_reply_avg_minutes FROM question_answers GROUP BY tag

我应该如何计算平均时间?

How should I calculate the average time?

推荐答案

更新2019:如何共享一些坚持使用公共UDF ?

第一个,中位数:

SELECT fhoffa.x.median([1,1,1,2,3,4,5,100,1000]) 3.0

实际上-在平均超过100小时(> 6000分钟)的堆栈溢出问题上获得答案的方法似乎是错误的-很大程度上是由异常值驱动的.

Indeed - getting an average time to get answers on Stack Overflow of above 100 hours (>6000 minutes) seems wrong - and is largely driven by outliers.

代替简单的AVG(),您可以获得:

Instead of doing a simple AVG() you could get:

  • 几何平均值:EXP(AVG(LOG(GREATEST(minutes,1))))
  • 除去异常值后的平均值:AVG(q) FROM (SELECT q FROM QUANTILES(q, 100) LIMIT 80 OFFSET 2)).
  • 中位数:all_minutes[OFFSET(CAST(ARRAY_LENGTH(all_minutes)/2 AS INT64))]
  • The geometric mean: EXP(AVG(LOG(GREATEST(minutes,1))))
  • The mean after removing outliers: AVG(q) FROM (SELECT q FROM QUANTILES(q, 100) LIMIT 80 OFFSET 2)).
  • The median: all_minutes[OFFSET(CAST(ARRAY_LENGTH(all_minutes)/2 AS INT64))]

如果使用以下任何一种选择,结果将更有意义:

The results make way more sense if you use any of those alternatives:

正如您在此处看到的,在这种情况下,除去异常值可以得出类似于几何均值的结果-而中位数报告的数字甚至更低.使用哪一个?您的选择.

As you can see here, in this case removing outliers give us results similar to the geometric mean - while the median reports even lower numbers. Which one to use? Your choice.

WITH question_answers AS ( SELECT * , timestamp_diff(answers.first, creation_date, minute) minutes FROM ( SELECT creation_date , (SELECT AS STRUCT MIN(creation_date) first, COUNT(*) c FROM `bigquery-public-data.stackoverflow.posts_answers` b WHERE a.id=b.parent_id ) answers , SPLIT(tags, '|') tags FROM `bigquery-public-data.stackoverflow.posts_questions` a WHERE EXTRACT(year FROM creation_date) > 2015 ), UNNEST(tags) tag WHERE tag IN ('java', 'javascript', 'google-bigquery', 'firebase', 'php', 'sql', 'elasticsearch', 'apache-kafka', 'tensorflow') AND answers.c > 0 ) SELECT * EXCEPT(qs, all_minutes) , (SELECT ROUND(AVG(q),2) FROM (SELECT q FROM UNNEST(qs) q ORDER BY q LIMIT 80 OFFSET 2)) avg_no_outliers , all_minutes[OFFSET(CAST(ARRAY_LENGTH(all_minutes)/2 AS INT64) )] median_minutes FROM ( SELECT tag , COUNT(*) questions , ROUND(AVG(minutes), 2) avg_minutes , ROUND(EXP(AVG(LOG(GREATEST(minutes,1)))),2) first_reply_avg_minutes_geom , APPROX_QUANTILES(minutes, 100) qs , ARRAY_AGG(minutes IGNORE NULLS ORDER BY minutes) all_minutes FROM question_answers GROUP BY tag ) ORDER BY 2 DESC

奖金MEDIAN() 来自Elliott的UDF函数.

CREATE TEMP FUNCTION MEDIAN(arr ANY TYPE) AS (( SELECT IF( MOD(ARRAY_LENGTH(arr), 2) = 0, (arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2) - 1)] + arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2))]) / 2, arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2))] ) FROM (SELECT ARRAY_AGG(x ORDER BY x) AS arr FROM UNNEST(arr) AS x) ));

更多推荐

BigQuery SQL:平均值,几何平均值,移除离群值,中位数

本文发布于:2023-10-19 18:27:46,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1508378.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:平均值   中位数   几何   移除   BigQuery

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!