如何在Apache Spark中将RDD [ParentClass]与RDD [Subclass]匹配?

编程入门行业动态更新时间:2024-10-28 20:29:24

本文介绍了如何在Apache Spark中将RDD [ParentClass]与RDD [Subclass]匹配?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我必须将rdd与它的类型匹配.

I have to match an rdd with its types.

trait Fruit case class Apple(price:Int) extends Fruit case class Mango(price:Int) extends Fruit

现在将出现类型为DStream[Fruit]的dstream.它是Apple或Mango.

Now a dstream of type DStream[Fruit] is coming. It is either Apple or Mango.

如何基于子类执行操作?类似于以下内容(无效):

How to perform operation based on the subclass? Something like the below (which doesn't work):

dStream.foreachRDD{rdd:RDD[Fruit] => rdd match { case rdd: RDD[Apple] => //do something case rdd: RDD[Mango] => //do something case _ => println(rdd.count() + "<<<< not matched anything") } }

推荐答案

由于我们有RDD[Fruit]，所以任何行都可以是Apple或Mango.使用foreachRDD时，每个RDD都将包含这些(以及其他可能的)类型的混合.

Since we have an RDD[Fruit], any row can be either Apple or Mango. When using foreachRDD, each RDD will contain a mix of these (and possible other) types.

要区分不同类型，我们可以使用 collect[U](f: PartialFunction[T, U]): RDD[U] (不要与collect(): Array[T]混淆，后者会返回包含RDD中的元素的列表). 通过应用函数f，此函数将返回包含所有匹配值的RDD(在这种情况下，我们可以在此处使用模式匹配).

To differentiate between the different types, we can use collect[U](f: PartialFunction[T, U]): RDD[U] (this is not to be confused with collect(): Array[T] that returns a list with the elements from the RDD). This function will return an RDD that contains all matching values by applying a function f (in this case, we can use a pattern match here).

下面是一个小的说明性示例(也将Orange添加到了水果中).

Below follows a small illustrative example (adding Orange to the fruits as well).

设置:

val ssc = new StreamingContext(spark.sparkContext, Seconds(1)) val inputData: Queue[RDD[Fruit]] = Queue() val dStream: InputDStream[Fruit] = ssc.queueStream(inputData) inputData += spark.sparkContext.parallelize(Seq(Apple(5), Apple(5), Mango(11))) inputData += spark.sparkContext.parallelize(Seq(Mango(10), Orange(1), Orange(3)))

这将创建具有两个单独的RDD的RDD[Fruit]流.

This creates a stream of RDD[Fruit] with two separate RDDs.

dStream.foreachRDD{rdd: RDD[Fruit] => val mix = rdd.collect{ case row: Apple => ("APPLE", row.price) // do any computation on apple rows case row: Mango => ("MANGO", row.price) // do any computation on mango rows //case _@row => do something with other rows (will be removed by default). } mix foreach println }

在上面的collect中，我们稍微更改每行(删除类)，然后打印结果RDD.结果:

In the above collect, we change each row slightly (removing the class) and then prints the resulting RDD. Result:

// First RDD (MANGO,11) (APPLE,5) (APPLE,5) // Second RDD (MANGO,10)

可以看出，模式匹配保留并更改了包含Apple和Mango的行，同时删除了所有Orange类.

As can be seen, the pattern match have kept and changed the rows containing Apple and Mango while removing all Orange classes.

单独的RDD

如果需要，也可以如下将两个子类分成各自的RDD.然后可以在这两个RDD上执行任何计算.

If wanted, it is also possible to separate the two subclasses into their own RDDs as follows. Any computations can then be performed on these two RDDs.

val apple = rdd.collect{case row: Apple => row} val mango = rdd.collect{case row: Mango => row}

完整的示例代码

trait Fruit case class Apple(price:Int) extends Fruit case class Mango(price:Int) extends Fruit case class Orange(price:Int) extends Fruit object Test { def main(args: Array[String]) { val spark = SparkSession.builder.master("local[*]").getOrCreate() val ssc = new StreamingContext(spark.sparkContext, Seconds(1)) val inputData: Queue[RDD[Fruit]] = Queue() val inputStream: InputDStream[Fruit] = ssc.queueStream(inputData) inputData += spark.sparkContext.parallelize(Seq(Apple(5), Apple(5), Mango(11))) inputData += spark.sparkContext.parallelize(Seq(Mango(10), Orange(1), Orange(3))) inputStream.foreachRDD{rdd:RDD[Fruit] => val mix = rdd.collect{ case row: Apple => ("APPLE", row.price) // do any computation on apple rows case row: Mango => ("MANGO", row.price) // do any computation on mango rows //case _@row => do something with other rows (will be removed by default). } mix foreach println } ssc.start() ssc.awaitTermination() } }