如何在集合中使用neo4j中的平均函数(How to use average function in neo4j with collection)

编程入门 行业动态 更新时间:2024-10-28 10:29:33
如何在集合中使用neo4j中的平均函数(How to use average function in neo4j with collection)

我想计算两个向量的协方差为集合A = [1,2,3,4] B = [5,6,7,8]

Cov(A,B)= Sigma [(ai-AVGa)*(bi-AVGb)] /(n-1)

我的协方差计算问题是:

1)我写的时候不能有嵌套的聚合函数

SUM((ai-avg(a)) * (bi-avg(b)))

2)或者在另一个形状中,我怎样才能从一个缩减中提取两个集合,例如:

REDUCE(x= 0.0, ai IN COLLECT(a) | bi IN COLLECT(b) | x + (ai-avg(a))*(bi-avg(b)))

3)如果无法在oe中提取两个集合,则减少如何将它们的值关联起来以计算它们分离时的协方差

REDUCE(x= 0.0, ai IN COLLECT(a) | x + (ai-avg(a))) REDUCE(y= 0.0, bi IN COLLECT(b) | y + (bi-avg(b)))

我的意思是我可以写嵌套减少?

4)有什么方法可以“放松”,“提取”

感谢您提前寻求帮助。

I want to calculate covariance of two vectors as collection A=[1, 2, 3, 4] B=[5, 6, 7, 8]

Cov(A,B)= Sigma[(ai-AVGa)*(bi-AVGb)] / (n-1)

My problem for covariance computation is:

1) I can not have a nested aggregate function when I write

SUM((ai-avg(a)) * (bi-avg(b)))

2) Or in another shape, how can I extract two collection with one reduce such as:

REDUCE(x= 0.0, ai IN COLLECT(a) | bi IN COLLECT(b) | x + (ai-avg(a))*(bi-avg(b)))

3) if it is not possible to extract two collection in oe reduce how it is possible to relate their value to calculate covariance when they are separated

REDUCE(x= 0.0, ai IN COLLECT(a) | x + (ai-avg(a))) REDUCE(y= 0.0, bi IN COLLECT(b) | y + (bi-avg(b)))

I mean that can I write nested reduce?

4) Is there any ways with "unwind", "extract"

Thank you in advanced for any help.

最满意答案

cybersam的回答是完全正确的,但如果你想避免双UNWIND产生的n^2笛卡尔产品,你可以这样做:

WITH [1,2,3,4] AS a, [5,6,7,8] AS b WITH REDUCE(s = 0.0, x IN a | s + x) / SIZE(a) AS e_a, REDUCE(s = 0.0, x IN b | s + x) / SIZE(b) AS e_b, SIZE(a) AS n, a, b RETURN REDUCE(s = 0.0, i IN RANGE(0, n - 1) | s + ((a[i] - e_a) * (b[i] - e_b))) / (n - 1) AS cov;

编辑:

不要打电话给任何人,但让我详细说明为什么你要避免https://stackoverflow.com/a/34423783/2848578中的双重UNWIND。 正如我下面所说的,在Cypher中UNWINDing k个长度为n的集合导致n^k行。 因此,让我们用两个长度为3的集合来计算协方差。

> WITH [1,2,3] AS a, [4,5,6] AS b UNWIND a AS aa UNWIND b AS bb RETURN aa, bb; | aa | bb ---+----+---- 1 | 1 | 4 2 | 1 | 5 3 | 1 | 6 4 | 2 | 4 5 | 2 | 5 6 | 2 | 6 7 | 3 | 4 8 | 3 | 5 9 | 3 | 6

现在我们有n^k = 3^2 = 9行。 在这一点上,取这些标识符的平均值意味着我们取9个值的平均值。

> WITH [1,2,3] AS a, [4,5,6] AS b UNWIND a AS aa UNWIND b AS bb RETURN AVG(aa), AVG(bb); | AVG(aa) | AVG(bb) ---+---------+--------- 1 | 2.0 | 5.0

也正如我下面所说,这并不影响答案,因为重复的数字向量的平均值将始终相同。 例如,{1,2,3}的平均值等于{1,2,3,1,2,3}的平均值。 对于小的n值可能是无关紧要的,但是当你开始获得更大的n值时,你会看到性能下降。

假设您有两个长度为1000的向量。 使用双重UNWIND计算每个值的平均值:

> WITH RANGE(0, 1000) AS a, RANGE(1000, 2000) AS b UNWIND a AS aa UNWIND b AS bb RETURN AVG(aa), AVG(bb); | AVG(aa) | AVG(bb) ---+---------+--------- 1 | 500.0 | 1500.0

714毫秒

比使用REDUCE慢得多:

> WITH RANGE(0, 1000) AS a, RANGE(1000, 2000) AS b RETURN REDUCE(s = 0.0, x IN a | s + x) / SIZE(a) AS e_a, REDUCE(s = 0.0, x IN b | s + x) / SIZE(b) AS e_b; | e_a | e_b ---+-------+-------- 1 | 500.0 | 1500.0

4毫秒

为了将它们放在一起,我将在长度为1000的矢量上全面比较这两个查询:

> WITH RANGE(0, 1000) AS aa, RANGE(1000, 2000) AS bb UNWIND aa AS a UNWIND bb AS b WITH aa, bb, SIZE(aa) AS n, AVG(a) AS avgA, AVG(b) AS avgB RETURN REDUCE(s = 0, i IN RANGE(0,n-1)| s +((aa[i]-avgA)*(bb[i]-avgB)))/(n-1) AS covariance; | covariance ---+------------ 1 | 83583.5

9105毫秒

> WITH RANGE(0, 1000) AS a, RANGE(1000, 2000) AS b WITH REDUCE(s = 0.0, x IN a | s + x) / SIZE(a) AS e_a, REDUCE(s = 0.0, x IN b | s + x) / SIZE(b) AS e_b, SIZE(a) AS n, a, b RETURN REDUCE(s = 0.0, i IN RANGE(0, n - 1) | s + ((a[i] - e_a) * (b[i ] - e_b))) / (n - 1) AS cov; | cov ---+--------- 1 | 83583.5

33毫秒

cybersam's answer is totally fine but if you want to avoid the n^2 Cartesian product that results from the double UNWIND you can do this instead:

WITH [1,2,3,4] AS a, [5,6,7,8] AS b WITH REDUCE(s = 0.0, x IN a | s + x) / SIZE(a) AS e_a, REDUCE(s = 0.0, x IN b | s + x) / SIZE(b) AS e_b, SIZE(a) AS n, a, b RETURN REDUCE(s = 0.0, i IN RANGE(0, n - 1) | s + ((a[i] - e_a) * (b[i] - e_b))) / (n - 1) AS cov;

Edit:

Not calling anyone out, but let me elaborate more on why you would want to avoid the double UNWIND in https://stackoverflow.com/a/34423783/2848578. Like I said below, UNWINDing k length-n collections in Cypher results in n^k rows. So let's take two length-3 collections over which you want to calculate the covariance.

> WITH [1,2,3] AS a, [4,5,6] AS b UNWIND a AS aa UNWIND b AS bb RETURN aa, bb; | aa | bb ---+----+---- 1 | 1 | 4 2 | 1 | 5 3 | 1 | 6 4 | 2 | 4 5 | 2 | 5 6 | 2 | 6 7 | 3 | 4 8 | 3 | 5 9 | 3 | 6

Now we have n^k = 3^2 = 9 rows. At this point, taking the average of these identifiers means we're taking the average of 9 values.

> WITH [1,2,3] AS a, [4,5,6] AS b UNWIND a AS aa UNWIND b AS bb RETURN AVG(aa), AVG(bb); | AVG(aa) | AVG(bb) ---+---------+--------- 1 | 2.0 | 5.0

Also as I said below, this doesn't affect the answer because the average of a repeating vector of numbers will always be the same. For example, the average of {1,2,3} is equal to the average of {1,2,3,1,2,3}. It is likely inconsequential for small values of n, but when you start getting larger values of n you'll start seeing a performance decrease.

Let's say you have two length-1000 vectors. Calculating the average of each with a double UNWIND:

> WITH RANGE(0, 1000) AS a, RANGE(1000, 2000) AS b UNWIND a AS aa UNWIND b AS bb RETURN AVG(aa), AVG(bb); | AVG(aa) | AVG(bb) ---+---------+--------- 1 | 500.0 | 1500.0

714 ms

Is significantly slower than using REDUCE:

> WITH RANGE(0, 1000) AS a, RANGE(1000, 2000) AS b RETURN REDUCE(s = 0.0, x IN a | s + x) / SIZE(a) AS e_a, REDUCE(s = 0.0, x IN b | s + x) / SIZE(b) AS e_b; | e_a | e_b ---+-------+-------- 1 | 500.0 | 1500.0

4 ms

To bring it all together, I'll compare the two queries in full on length-1000 vectors:

> WITH RANGE(0, 1000) AS aa, RANGE(1000, 2000) AS bb UNWIND aa AS a UNWIND bb AS b WITH aa, bb, SIZE(aa) AS n, AVG(a) AS avgA, AVG(b) AS avgB RETURN REDUCE(s = 0, i IN RANGE(0,n-1)| s +((aa[i]-avgA)*(bb[i]-avgB)))/(n-1) AS covariance; | covariance ---+------------ 1 | 83583.5

9105 ms

> WITH RANGE(0, 1000) AS a, RANGE(1000, 2000) AS b WITH REDUCE(s = 0.0, x IN a | s + x) / SIZE(a) AS e_a, REDUCE(s = 0.0, x IN b | s + x) / SIZE(b) AS e_b, SIZE(a) AS n, a, b RETURN REDUCE(s = 0.0, i IN RANGE(0, n - 1) | s + ((a[i] - e_a) * (b[i ] - e_b))) / (n - 1) AS cov; | cov ---+--------- 1 | 83583.5

33 ms

更多推荐

本文发布于:2023-08-02 18:44:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1380044.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:函数   平均   如何在   neo4j   collection

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!