Spark Scala中的点产品

编程入门行业动态更新时间:2024-10-28 19:19:36

本文介绍了Spark Scala中的点产品的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我在Spark Scala中有两个数据框，其中每个数据框的第二列都是数字数组

val data22 = Seq((1，List(0.693147,0.6931471))，(2，List(0.69314，0.0))，(3，List(0.0，0.693147))).toDF("ID"，"tf_idf")data22.show(truncate = false)+ --- + --------------------- +| ID | tf_idf |+ --- + --------------------- +| 1 | [0.693，0.702] || 2 | [0.69314，0.0] || 3 | [0.0，0.693147] |+ --- + --------------------- +val data12 = Seq((1，List(0.69314,0.6931471))).toDF("ID"，"tf_idf")data12.show(truncate = false)+ --- + -------------------- +| ID | tf_idf |+ --- + -------------------- +| 1 | [0.693，0.805] |+ --- + -------------------- +

我需要在这两个数据框中执行行之间的点积.那就是我需要将 data12 中的 tf_idf 数组与 data22 中的 tf_idf 的每一行相乘.

(例如:点积的第一行应为:0.693 * 0.693 + 0.702 * 0.805

第二行:0.69314 * 0.693 + 0.0 * 0.805

第三行:0.0 * 0.693 + 0.693147 * 0.805)

基本上我想要一些东西(例如矩阵乘法) data22 * transpose(data12)如果有人可以建议在Spark Scala中执行此操作的方法，我将不胜感激.

谢谢

解决方案

解决方案如下所示:

scala>val data22 = Seq((1，List(0.693147,0.6931471))，(2，List(0.69314，0.0))，(3，List(0.0，0.693147))).toDF("ID"，"tf_idf")data22:org.apache.spark.sql.DataFrame = [ID:int，tf_idf:array< double>]斯卡拉>val data12 = Seq((1，List(0.69314,0.6931471))).toDF("ID"，"tf_idf")data12:org.apache.spark.sql.DataFrame = [ID:int，tf_idf:array< double>]斯卡拉>val arrayDot = data12.take(1).map(row =>(row.getAs [Int](0)，row.getAs [WrappedArray [Double]](1).toSeq))arrayDot:Array [(Int，Seq [Double])] = Array((1，WrappedArray(0.69314，0.6931471)))斯卡拉>val dotColumn = arrayDot(0)._ 2dotColumn:Seq [Double] = WrappedArray(0.69314，0.6931471)斯卡拉>val dotUdf = udf((y:Seq [Double])=> y zip dotColumn map(z => z._1 * z._2)reduce(_ + _))dotUdf:org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(< function1>，DoubleType，Some(List(ArrayType(DoubleType，false)))))斯卡拉>data22.withColumn("dotProduct"，dotUdf('tf_idf)).show+ --- + -------------------- + ------------------- +|ID |tf_idf |dotProduct |+ --- + -------------------- + ------------------- +|1 | [0.693147，0.6931 ... |0.96090081381841 ||2 |[0.69314，0.0] | 0.48044305959999994 ||3 |[0.0，0.693147] |0.4804528329237 |+ --- + -------------------- + ------------------- +

请注意，它会将 data12 中的 tf_idf 数组乘以 data22 中的每一行 tf_idf ./p>

让我知道是否有帮助！

I have two data frames in Spark Scala where the second column of each data frame is an array of numbers

val data22= Seq((1,List(0.693147,0.6931471)),(2,List(0.69314, 0.0)),(3,List(0.0, 0.693147))).toDF("ID","tf_idf") data22.show(truncate=false) +---+---------------------+ |ID |tf_idf | +---+---------------------+ |1 |[0.693, 0.702] | |2 |[0.69314, 0.0] | |3 |[0.0, 0.693147] | +---+---------------------+ val data12= Seq((1,List(0.69314,0.6931471))).toDF("ID","tf_idf") data12.show(truncate=false) +---+--------------------+ |ID |tf_idf | +---+--------------------+ |1 |[0.693, 0.805] | +---+--------------------+

I need to perform the dot product between rows in this two data frames. That is I need to multiply the tf_idf array in data12 with each row of tf_idf in data22.

(Ex: The first row in dot product should be like this : 0.693*0.693 + 0.702*0.805

Second row : 0.69314*0.693 + 0.0*0.805

Third row : 0.0*0.693 + 0.693147*0.805 )

Basically I want something(like matrix multiplication) data22*transpose(data12) I would be grateful if someone can suggest a method to do this in Spark Scala .

Thank you

解决方案

The solution is shown below:

scala> val data22= Seq((1,List(0.693147,0.6931471)),(2,List(0.69314, 0.0)),(3,List(0.0, 0.693147))).toDF("ID","tf_idf") data22: org.apache.spark.sql.DataFrame = [ID: int, tf_idf: array<double>] scala> val data12= Seq((1,List(0.69314,0.6931471))).toDF("ID","tf_idf") data12: org.apache.spark.sql.DataFrame = [ID: int, tf_idf: array<double>] scala> val arrayDot = data12.take(1).map(row => (row.getAs[Int](0), row.getAs[WrappedArray[Double]](1).toSeq)) arrayDot: Array[(Int, Seq[Double])] = Array((1,WrappedArray(0.69314, 0.6931471))) scala> val dotColumn = arrayDot(0)._2 dotColumn: Seq[Double] = WrappedArray(0.69314, 0.6931471) scala> val dotUdf = udf((y: Seq[Double]) => y zip dotColumn map(z => z._1*z._2) reduce(_ + _)) dotUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,DoubleType,Some(List(ArrayType(DoubleType,false)))) scala> data22.withColumn("dotProduct", dotUdf('tf_idf)).show +---+--------------------+-------------------+ | ID| tf_idf| dotProduct| +---+--------------------+-------------------+ | 1|[0.693147, 0.6931...| 0.96090081381841| | 2| [0.69314, 0.0]|0.48044305959999994| | 3| [0.0, 0.693147]| 0.4804528329237| +---+--------------------+-------------------+

Note that it multiplies multiply the tf_idf array in data12 with each row of tf_idf in data22.

Let me know if it helps!!

更多推荐

Spark Scala中的点产品

本文发布于:2023-11-22 07:13:58，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1616456.html