我有一个像这样的SparkSQL DataFrame:
I have a SparkSQL DataFrame like this:
name gender age isActive points ------------------------------- Bob M 12 true 100 Hal M 16 false 80 Pat F 21 true 70 Lin F 17 false 40 Zac M 18 true 20 Mei F 19 true 10 Sal M 13 false 10我有一些类似的功能:
def isEligible(prog: String) (name: String, gender: String, age: Int, isActive: Boolean, points: Int): Boolean确定某人是否有资格参加特定计划.对于实例,以下函数调用将告诉我Bob是否符合Program1的条件:
that determines whether someone is eligible for a particular program. For Instance, the following function call would tell me whether Bob is eligible for Program1:
isEligible("Program1", "Bob", "M", 12, true, 100)一个人可能有资格参加多个课程.我想编写一个使用此DataFrame并输出摘要DataFrame的函数,如下所示:
A person may be eligible for more than one program. I want to write a function that takes this DataFrame, and outputs a summary DataFrame like so:
prog1 prog2 prog3 prog4 ----------------------- 7 3 2 5显示符合每个计划条件的人数.在Spark中执行此操作的最佳方法是什么?我知道我可以使用 struct 和 agg 函数,但是我不知道如何将我的 isEligible 函数合并到SparkSQL查询中.
which shows the number of people who are eligible for each program. What is the best way to do this in Spark? I know I can use struct and agg functions, but I don't know how to incorporate my isEligible function into the SparkSQL query.
推荐答案定义程序列表:
val progs = Seq("prog1", "prog2", "prog3", "prog4")定义表达式
@transient val exprs = progs.map(p => { val f = udf(isEligible(p) _) sum(f( $"name", $"gender", $"age", $"isActive", $"points" ).cast("long")).alias(p) }) df.select(exprs: _*)您还可以使用强类型数据集:
You could also use strongly typed dataset:
import org.apache.spark.sql.Row case class Record(name: String, gender: String, age: Int, isActive: Boolean, points: Int) df.as[Record].flatMap { case Record(name, gender, age, isActive, points) => progs.filter(p => isEligible(p)(name, gender, age, isActive, points)) }.groupBy().pivot("value", progs).count()更多推荐
如果存在多个条件,则SparkSQL总和
发布评论