我在Spark Scala中有两个数据框,其中每个数据框的第二列都是数字数组
val data22 = Seq((1,List(0.693147,0.6931471)),(2,List(0.69314,0.0)),(3,List(0.0,0.693147))).toDF("ID","tf_idf")data22.show(truncate = false)+ --- + --------------------- +| ID | tf_idf |+ --- + --------------------- +| 1 | [0.693,0.702] || 2 | [0.69314,0.0] || 3 | [0.0,0.693147] |+ --- + --------------------- +val data12 = Seq((1,List(0.69314,0.6931471))).toDF("ID","tf_idf")data12.show(truncate = false)+ --- + -------------------- +| ID | tf_idf |+ --- + -------------------- +| 1 | [0.693,0.805] |+ --- + -------------------- +我需要在这两个数据框中执行行之间的点积.那就是我需要将 data12 中的 tf_idf 数组与 data22 中的 tf_idf 的每一行相乘.
(例如:点积的第一行应为:0.693 * 0.693 + 0.702 * 0.805
第二行:0.69314 * 0.693 + 0.0 * 0.805
第三行:0.0 * 0.693 + 0.693147 * 0.805)
基本上我想要一些东西(例如矩阵乘法) data22 * transpose(data12)如果有人可以建议在Spark Scala中执行此操作的方法,我将不胜感激.
谢谢
解决方案解决方案如下所示:
scala>val data22 = Seq((1,List(0.693147,0.6931471)),(2,List(0.69314,0.0)),(3,List(0.0,0.693147))).toDF("ID","tf_idf")data22:org.apache.spark.sql.DataFrame = [ID:int,tf_idf:array< double>]斯卡拉>val data12 = Seq((1,List(0.69314,0.6931471))).toDF("ID","tf_idf")data12:org.apache.spark.sql.DataFrame = [ID:int,tf_idf:array< double>]斯卡拉>val arrayDot = data12.take(1).map(row =>(row.getAs [Int](0),row.getAs [WrappedArray [Double]](1).toSeq))arrayDot:Array [(Int,Seq [Double])] = Array((1,WrappedArray(0.69314,0.6931471)))斯卡拉>val dotColumn = arrayDot(0)._ 2dotColumn:Seq [Double] = WrappedArray(0.69314,0.6931471)斯卡拉>val dotUdf = udf((y:Seq [Double])=> y zip dotColumn map(z => z._1 * z._2)reduce(_ + _))dotUdf:org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(< function1>,DoubleType,Some(List(ArrayType(DoubleType,false)))))斯卡拉>data22.withColumn("dotProduct",dotUdf('tf_idf)).show+ --- + -------------------- + ------------------- +|ID |tf_idf |dotProduct |+ --- + -------------------- + ------------------- +|1 | [0.693147,0.6931 ... |0.96090081381841 ||2 |[0.69314,0.0] | 0.48044305959999994 ||3 |[0.0,0.693147] |0.4804528329237 |+ --- + -------------------- + ------------------- +请注意,它会将 data12 中的 tf_idf 数组乘以 data22 中的每一行 tf_idf ./p>
让我知道是否有帮助!
I have two data frames in Spark Scala where the second column of each data frame is an array of numbers
val data22= Seq((1,List(0.693147,0.6931471)),(2,List(0.69314, 0.0)),(3,List(0.0, 0.693147))).toDF("ID","tf_idf") data22.show(truncate=false) +---+---------------------+ |ID |tf_idf | +---+---------------------+ |1 |[0.693, 0.702] | |2 |[0.69314, 0.0] | |3 |[0.0, 0.693147] | +---+---------------------+ val data12= Seq((1,List(0.69314,0.6931471))).toDF("ID","tf_idf") data12.show(truncate=false) +---+--------------------+ |ID |tf_idf | +---+--------------------+ |1 |[0.693, 0.805] | +---+--------------------+I need to perform the dot product between rows in this two data frames. That is I need to multiply the tf_idf array in data12 with each row of tf_idf in data22.
(Ex: The first row in dot product should be like this : 0.693*0.693 + 0.702*0.805
Second row : 0.69314*0.693 + 0.0*0.805
Third row : 0.0*0.693 + 0.693147*0.805 )
Basically I want something(like matrix multiplication) data22*transpose(data12) I would be grateful if someone can suggest a method to do this in Spark Scala .
Thank you
解决方案The solution is shown below:
scala> val data22= Seq((1,List(0.693147,0.6931471)),(2,List(0.69314, 0.0)),(3,List(0.0, 0.693147))).toDF("ID","tf_idf") data22: org.apache.spark.sql.DataFrame = [ID: int, tf_idf: array<double>] scala> val data12= Seq((1,List(0.69314,0.6931471))).toDF("ID","tf_idf") data12: org.apache.spark.sql.DataFrame = [ID: int, tf_idf: array<double>] scala> val arrayDot = data12.take(1).map(row => (row.getAs[Int](0), row.getAs[WrappedArray[Double]](1).toSeq)) arrayDot: Array[(Int, Seq[Double])] = Array((1,WrappedArray(0.69314, 0.6931471))) scala> val dotColumn = arrayDot(0)._2 dotColumn: Seq[Double] = WrappedArray(0.69314, 0.6931471) scala> val dotUdf = udf((y: Seq[Double]) => y zip dotColumn map(z => z._1*z._2) reduce(_ + _)) dotUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,DoubleType,Some(List(ArrayType(DoubleType,false)))) scala> data22.withColumn("dotProduct", dotUdf('tf_idf)).show +---+--------------------+-------------------+ | ID| tf_idf| dotProduct| +---+--------------------+-------------------+ | 1|[0.693147, 0.6931...| 0.96090081381841| | 2| [0.69314, 0.0]|0.48044305959999994| | 3| [0.0, 0.693147]| 0.4804528329237| +---+--------------------+-------------------+Note that it multiplies multiply the tf_idf array in data12 with each row of tf_idf in data22.
Let me know if it helps!!
更多推荐
Spark Scala中的点产品
发布评论