列转换后的Pyspark随机森林特征重要性映射

编程入门 行业动态 更新时间:2024-10-11 17:23:13
本文介绍了列转换后的Pyspark随机森林特征重要性映射的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我正在尝试用列名绘制某些基于树的模型的功能重要性.我正在使用Pyspark.

I am trying to plot the feature importances of certain tree based models with column names. I am using Pyspark.

由于我也拥有文本分类变量和数字变量,因此我不得不使用类似这样的管道方法-

Since I had textual categorical variables and numeric ones too, I had to use a pipeline method which is something like this -

  • 使用字符串索引器为字符串列编制索引
  • 对所有列使用一个热编码器
  • 使用向量汇编器创建包含特征向量的特征列

  • use string indexer to index string columns
  • use one hot encoder for all columns
  • use a vectorassembler to create the feature column containing the feature vector

    来自 docs 对于步骤1,2,3-

    Some sample code from the docs for steps 1,2,3 -

    from pyspark.ml import Pipeline from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, VectorAssembler categoricalColumns = ["workclass", "education", "marital_status", "occupation", "relationship", "race", "sex", "native_country"] stages = [] # stages in our Pipeline for categoricalCol in categoricalColumns: # Category Indexing with StringIndexer stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol + "Index") # Use OneHotEncoder to convert categorical variables into binary SparseVectors # encoder = OneHotEncoderEstimator(inputCol=categoricalCol + "Index", outputCol=categoricalCol + "classVec") encoder = OneHotEncoderEstimator(inputCols= [stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"]) # Add stages. These are not run here, but will run all at once later on. stages += [stringIndexer, encoder] numericCols = ["age", "fnlwgt", "education_num", "capital_gain", "capital_loss", "hours_per_week"] assemblerInputs = [c + "classVec" for c in categoricalColumns] + numericCols assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features") stages += [assembler] # Create a Pipeline. pipeline = Pipeline(stages=stages) # Run the feature transformations. # - fit() computes feature statistics as needed. # - transform() actually transforms the features. pipelineModel = pipeline.fit(dataset) dataset = pipelineModel.transform(dataset)

  • 最终训练模型

  • finally train the model

    经过培训和评估后,我可以使用"model.featureImportances"来获得特征排名,但是我没有得到特征/列名称,而只是获得特征编号,像这样-

    after training and eval, I can use the "model.featureImportances" to get the feature rankings, however I dont get the feature/column names, rather just the feature number, something like this -

    print dtModel_1.featureImportances (38895,[38708,38714,38719,38720,38737,38870,38894],[0.0742343395738,0.169404823667,0.100485791055,0.0105823115814,0.0134236162982,0.194124862158,0.437744255667])

  • 如何将其映射回初始列名称和值?这样我就可以绘图了吗?**

    How do I map it back to the initial column names and the values? So that I can plot ?**

    推荐答案

    将元数据提取为此处显示的,由 user6910411

    attrs = sorted( (attr["idx"], attr["name"]) for attr in (chain(*dataset .schema["features"] .metadata["ml_attr"]["attrs"].values())))

    并结合功能重要性:

    [(name, dtModel_1.featureImportances[idx]) for idx, name in attrs if dtModel_1.featureImportances[idx]]

    更多推荐

    列转换后的Pyspark随机森林特征重要性映射

    本文发布于:2023-07-18 14:24:49,感谢您对本站的认可!
    本文链接:https://www.elefans.com/category/jswz/34/1145686.html
    版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
    本文标签:重要性   特征   森林   Pyspark

    发布评论

    评论列表 (有 0 条评论)
    草根站长

    >www.elefans.com

    编程频道|电子爱好者 - 技术资讯及电子产品介绍!