火花1.6
如果我有一个数据集,并且我想通过使用Pearson相关性来识别具有最大预测能力的数据集中的特征,我应该使用哪些工具?
If I have a dataset and I want to identifiy the features in a dataset with the greatest predictive power by using Pearson correlation which tools should I use?
我使用的天真的方法是:
The naive approach I used... was:
val columns = x.columns.toList.filterNot(List("id","maxcykle","rul") contains) val corrVithRul = columns.map( c => (c,x.stat.corr("rul", c, "pearson")) ) Output: columns: List[String] = List(cykle, setting1, setting2, setting3, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, s14, s15, s16, s17, s18, s19, s20, s21, label1, label2, a1, sd1, a2, sd2, a3, sd3, a4, sd4, a5, sd5, a6, sd6, a7, sd7, a8, sd8, a9, sd9, a10, sd10, a11, sd11, a12, sd12, a13, sd13, a14, sd14, a15, sd15, a16, sd16, a17, sd17, a18, sd18, a19, sd19, a20, sd20, a21, sd21) corrVithRul: List[(String, Double)] = List((cykle,-0.7362405993070199), (setting1,-0.0031984575547410617), (setting2,-0.001947628351500473), (setting3,NaN), (s1,-0.011460304217886725), (s2,-0.6064839743782909), (s3,-0.5845203909175897), (s4,-0.6789482333860454), (s5,-0.011121400898477964), (s6,-0.1283484484732187), (s7,0.6572226620548292), (s8,-0.5639684065744165), (s9,-0.3901015749180319), (s10,-0.04924720421765515), (s11,-0.6962281014554186), (s12,0.6719831036132922), (s13,-0.5625688251505582), (s14,-0.30676887025759053), (s15,-0.6426670441973734), (s16,-0.09716223410021836), (s17,-0.6061535537829589), (s18,NaN), (s19,NaN), (s20,0.6294284994377392), (s21,0.6356620421802835), (label1,-0.5665958821050425), (label2,-0.548191636440298), (a1,0.040592887198906136), (sd1,NaN), (a2,-0.7364292...当然,每个地图迭代都会提交一份工作,Statistics.corr可能正是我想要的?
Which of course is submitting one job per map iteration, Statistics.corr might be what I am looking for?
推荐答案Statistics.corr 在这里看起来像是正确的选择.您可能会考虑的另一个选项是 RowMatrix.columnSimilarities (列之间的余弦相似度,可以选择使用带阈值采样的优化版本)(可选)和 RowMatrixputeCovariance .您必须先将一种或多种方式将数据组合成 Vectors .假设列已经是 DoubleType ,则可以使用 VectorAssembler :
Statistics.corr looks like correct choice here. Another options you may consider are RowMatrix.columnSimilarities (cosine similarities between columns, optionally with optimized version which uses sampling with threshold) and RowMatrixputeCovariance. One way or another you'll have to assemble your data into Vectors first. Assuming columns are already of DoubleType you can use VectorAssembler:
import org.apache.spark.ml.feature.VectorAssembler import org.apache.spark.mllib.linalg.Vector val df: DataFrame = ??? val assembler = new VectorAssembler() .setInputCols(df.columns.diff(Seq("id","maxcykle","rul"))) .setOutputCol("features") val rows = assembler.transform(df) .select($"features") .rdd .map(_.getAs[Vector]("features"))接下来,您可以使用 Statistics.corr
import org.apache.spark.mllib.stat.Statistics Statistics.corr(rows)或转换为 RowMatrix :
import org.apache.spark.mllib.linalg.distributed.RowMatrix val mat = new RowMatrix(rows) mat.columnSimilarities(0.75)更多推荐
Spark 1.6 Pearson相关
发布评论