PySpark &MLLib:随机森林特征的重要性

编程入门 行业动态 更新时间:2024-10-11 07:26:42
本文介绍了PySpark &MLLib:随机森林特征的重要性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我正在尝试提取我使用 PySpark 训练的随机森林对象的特征重要性.但是,我在文档的任何地方都没有看到这样做的示例,也不是 RandomForestModel 的方法.

I'm trying to extract the feature importances of a random forest object I have trained using PySpark. However, I do not see an example of doing this anywhere in the documentation, nor is it a method of RandomForestModel.

如何从 PySpark 中的 RandomForestModel 回归器或分类器中提取特征重要性?

How can I extract feature importances from a RandomForestModel regressor or classifier in PySpark?

这是文档中提供的示例代码,让我们开始;但是,其中没有提及特征重要性.

Here's the sample code provided in the documentation to get us started; however, there is no mention of feature importances in it.

from pyspark.mllib.tree import RandomForest from pyspark.mllib.util import MLUtils # Load and parse the data file into an RDD of LabeledPoint. data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt') # Split the data into training and test sets (30% held out for testing) (trainingData, testData) = data.randomSplit([0.7, 0.3]) # Train a RandomForest model. # Empty categoricalFeaturesInfo indicates all features are continuous. # Note: Use larger numTrees in practice. # Setting featureSubsetStrategy="auto" lets the algorithm choose. model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={}, numTrees=3, featureSubsetStrategy="auto", impurity='gini', maxDepth=4, maxBins=32)

我没有看到 model.__featureImportances_ 属性可用 -- 我在哪里可以找到它?

I don't see a model.__featureImportances_ attribute available -- where can I find this?

推荐答案

UPDATE for version > 2.0.0

从 2.0.0 版本开始,如你所见 这里,FeatureImportances 可用于随机森林.

From the version 2.0.0, as you can see here, FeatureImportances is available for Random Forest.

事实上,您可以在这里找到 那个:

In fact, you can find here that:

DataFrame API 支持两种主要的树集成算法:随机森林和梯度提升树 (GBT).两者都使用 spark.ml 决策树作为基础模型.

The DataFrame API supports two major tree ensemble algorithms: Random Forests and Gradient-Boosted Trees (GBTs). Both use spark.ml decision trees as their base models.

用户可以在 MLlib Ensemble 指南中找到有关集成算法的更多信息.在本节中,我们将演示用于集成的 DataFrame API.

Users can find more information about ensemble algorithms in the MLlib Ensemble guide. In this section, we demonstrate the DataFrame API for ensembles.

此 API 与原始 MLlib 集成 API 之间的主要区别是:

The main differences between this API and the original MLlib ensembles API are:

  • 支持数据帧和机器学习管道
  • 分类与回归的分离
  • 使用 DataFrame 元数据区分连续特征和分类特征
  • 随机森林的更多功能:特征重要性的估计,以及用于分类的每个类别的预测概率(也称为类别条件概率).
  • support for DataFrames and ML Pipelines
  • separation of classification vs. regression
  • use of DataFrame metadata to distinguish continuous and categorical features
  • more functionality for random forests: estimates of feature importance, as well as the predicted probability of each class (a.k.a. class conditional probabilities) for classification.

如果你想拥有特征重要性值,你必须使用 ml 包,而不是 mllib,并使用数据帧.

If you want to have Feature Importance values, you have to work with ml package, not mllib, and use dataframes.

下面有一个例子,你可以找到 这里:

Below there is an example that you can find here:

# IMPORT >>> import numpy >>> from numpy import allclose >>> from pyspark.ml.linalg import Vectors >>> from pyspark.ml.feature import StringIndexer >>> from pyspark.ml.classification import RandomForestClassifier # PREPARE DATA >>> df = spark.createDataFrame([ ... (1.0, Vectors.dense(1.0)), ... (0.0, Vectors.sparse(1, [], []))], ["label", "features"]) >>> stringIndexer = StringIndexer(inputCol="label", outputCol="indexed") >>> si_model = stringIndexer.fit(df) >>> td = si_model.transform(df) # BUILD THE MODEL >>> rf = RandomForestClassifier(numTrees=3, maxDepth=2, labelCol="indexed", seed=42) >>> model = rf.fit(td) # FEATURE IMPORTANCES >>> model.featureImportances SparseVector(1, {0: 1.0})

更多推荐

PySpark &MLLib:随机森林特征的重要性

本文发布于:2023-07-18 14:24:13,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1145681.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:重要性   特征   森林   PySpark   amp

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!