加载持久化CrossValidatorModel抛出“Param numTrees不存在”错误(Loading persisted CrossValidatorModel throws “Param

系统教程 行业动态 更新时间:2024-06-14 17:04:02
加载持久化CrossValidatorModel抛出“Param numTrees不存在”错误(Loading persisted CrossValidatorModel throws “Param numTrees does not exist” Error)

我正在使用Spark 2.0创建一个RandomForestClassifier来解决多类分类问题。 我能够成功训练模型并使用model.save()方法将训练过的模型保存到S3存储桶。 但是在使用load()加载此模型时,我收到以下错误。

`

Exception in thread "main" java.util.NoSuchElementException: Param numTrees does not exist. at org.apache.spark.ml.param.Params$$anonfun$getParam$2.apply(params.scala:609) at org.apache.spark.ml.param.Params$$anonfun$getParam$2.apply(params.scala:609) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.ml.param.Params$class.getParam(params.scala:608) at org.apache.spark.ml.PipelineStage.getParam(Pipeline.scala:42) at org.apache.spark.ml.util.DefaultParamsReader$$anonfun$getAndSetParams$1.apply(ReadWrite.scala:430) at org.apache.spark.ml.util.DefaultParamsReader$$anonfun$getAndSetParams$1.apply(ReadWrite.scala:429) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.ml.util.DefaultParamsReader$.getAndSetParams(ReadWrite.scala:429) at org.apache.spark.ml.classification.RandomForestClassificationModel$RandomForestClassificationModelReader.load(RandomForestClassifier.scala:310) at org.apache.spark.ml.classification.RandomForestClassificationModel$RandomForestClassificationModelReader.load(RandomForestClassifier.scala:284) at org.apache.spark.ml.util.DefaultParamsReader$.loadParamsInstance(ReadWrite.scala:447) at org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$4.apply(Pipeline.scala:267) o.a.p.h.InternalParquetRecordReader at org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$4.apply(Pipeline.scala:265) : at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) block read in memory in 4226 ms. row count = 52598 at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) 2017-05-04 21:53:08.140 at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at org.apache.spark.ml.Pipeline$SharedReadWrite$.load(Pipeline.scala:265) at org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:341) at org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:335) --- at org.apache.spark.ml.util.DefaultParamsReader$.loadParamsInstance(ReadWrite.scala:447) at org.apache.spark.ml.tuning.CrossValidatorModel$CrossValidatorModelReader.load(CrossValidator.scala:269) at org.apache.spark.ml.tuning.CrossValidatorModel$CrossValidatorModelReader.load(CrossValidator.scala:256) at org.apache.spark.ml.util.MLReadable$class.load(ReadWrite.scala:227) at org.apache.spark.ml.tuning.CrossValidatorModel$.load(CrossValidator.scala:240) : at org.apache.spark.ml.tuning.CrossValidatorModel.load(CrossValidator.scala)

`

下面是我用来训练和保存模型的代码片段

val assembler = new VectorAssembler(); assembler.setInputCols(inputColumnNames); assembler.setOutputCol("Inputs_Indexed"); //split 70:30 training and test data val Array(trainingData, testData) = df.randomSplit(Array(0.7, 0.3)) //train using RandomForest Model val rf = new RandomForestClassifier() .setLabelCol("Facing_Indexed") .setFeaturesCol("Inputs_Indexed") .setNumTrees(500); val labelConverter = new IndexToString() .setInputCol("prediction") .setOutputCol("predictedLabel") .setLabels(labelIndexer.labels); val stageList = new ArrayList[PipelineStage]; stageList.addAll(categoricalInputModels); stageList.add(labelIndexer); stageList.add(assembler); stageList.add(rf); stageList.add(labelConverter); val stages= new Array[PipelineStage](stageList.size); //convert stages list to array stageList.toArray(stages); val pipeline = new Pipeline().setStages(stages) val paramGrid = new ParamGridBuilder().addGrid(rf.maxDepth, Array(3, 5, 8)).build() val evaluator = new MulticlassClassificationEvaluator() .setLabelCol("Facing_Indexed") .setPredictionCol("prediction") .setMetricName("accuracy") val cv = new CrossValidator().setEstimator(pipeline).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid) val model = cv.fit(trainingData) val predictions = model.transform(testData); predictions.select("predictedLabel", "Facing", "Inputs_Indexed").show(5); val accuracy = evaluator.evaluate(predictions) println("Test Error = " + (1.0 - accuracy)) model.save("s3n://xyz_path/au.model")

保存训练模型后,我使用CrossValidatorModel.load(“s3n://xyz_path/au.model”)将模型加载到一个单独的Java程序中,该程序抛出上述错误。 在我的S3存储桶中,我可以看到正在保存的序列化模型。 我不确定哪里出错了。 任何有关此错误的帮助表示赞赏。

I am creating a RandomForestClassifier using Spark 2.0 for solving a multiclass classification problem. I am able to train the model successfully and save the trained model using the model.save() method to an S3 bucket. However while loading this model using the load() I am getting the following error.

`

Exception in thread "main" java.util.NoSuchElementException: Param numTrees does not exist. at org.apache.spark.ml.param.Params$$anonfun$getParam$2.apply(params.scala:609) at org.apache.spark.ml.param.Params$$anonfun$getParam$2.apply(params.scala:609) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.ml.param.Params$class.getParam(params.scala:608) at org.apache.spark.ml.PipelineStage.getParam(Pipeline.scala:42) at org.apache.spark.ml.util.DefaultParamsReader$$anonfun$getAndSetParams$1.apply(ReadWrite.scala:430) at org.apache.spark.ml.util.DefaultParamsReader$$anonfun$getAndSetParams$1.apply(ReadWrite.scala:429) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.ml.util.DefaultParamsReader$.getAndSetParams(ReadWrite.scala:429) at org.apache.spark.ml.classification.RandomForestClassificationModel$RandomForestClassificationModelReader.load(RandomForestClassifier.scala:310) at org.apache.spark.ml.classification.RandomForestClassificationModel$RandomForestClassificationModelReader.load(RandomForestClassifier.scala:284) at org.apache.spark.ml.util.DefaultParamsReader$.loadParamsInstance(ReadWrite.scala:447) at org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$4.apply(Pipeline.scala:267) o.a.p.h.InternalParquetRecordReader at org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$4.apply(Pipeline.scala:265) : at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) block read in memory in 4226 ms. row count = 52598 at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) 2017-05-04 21:53:08.140 at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at org.apache.spark.ml.Pipeline$SharedReadWrite$.load(Pipeline.scala:265) at org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:341) at org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:335) --- at org.apache.spark.ml.util.DefaultParamsReader$.loadParamsInstance(ReadWrite.scala:447) at org.apache.spark.ml.tuning.CrossValidatorModel$CrossValidatorModelReader.load(CrossValidator.scala:269) at org.apache.spark.ml.tuning.CrossValidatorModel$CrossValidatorModelReader.load(CrossValidator.scala:256) at org.apache.spark.ml.util.MLReadable$class.load(ReadWrite.scala:227) at org.apache.spark.ml.tuning.CrossValidatorModel$.load(CrossValidator.scala:240) : at org.apache.spark.ml.tuning.CrossValidatorModel.load(CrossValidator.scala)

`

Below is the code snippet that I use to train and save the model

val assembler = new VectorAssembler(); assembler.setInputCols(inputColumnNames); assembler.setOutputCol("Inputs_Indexed"); //split 70:30 training and test data val Array(trainingData, testData) = df.randomSplit(Array(0.7, 0.3)) //train using RandomForest Model val rf = new RandomForestClassifier() .setLabelCol("Facing_Indexed") .setFeaturesCol("Inputs_Indexed") .setNumTrees(500); val labelConverter = new IndexToString() .setInputCol("prediction") .setOutputCol("predictedLabel") .setLabels(labelIndexer.labels); val stageList = new ArrayList[PipelineStage]; stageList.addAll(categoricalInputModels); stageList.add(labelIndexer); stageList.add(assembler); stageList.add(rf); stageList.add(labelConverter); val stages= new Array[PipelineStage](stageList.size); //convert stages list to array stageList.toArray(stages); val pipeline = new Pipeline().setStages(stages) val paramGrid = new ParamGridBuilder().addGrid(rf.maxDepth, Array(3, 5, 8)).build() val evaluator = new MulticlassClassificationEvaluator() .setLabelCol("Facing_Indexed") .setPredictionCol("prediction") .setMetricName("accuracy") val cv = new CrossValidator().setEstimator(pipeline).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid) val model = cv.fit(trainingData) val predictions = model.transform(testData); predictions.select("predictedLabel", "Facing", "Inputs_Indexed").show(5); val accuracy = evaluator.evaluate(predictions) println("Test Error = " + (1.0 - accuracy)) model.save("s3n://xyz_path/au.model")

After the trained model is saved I use the CrossValidatorModel.load("s3n://xyz_path/au.model") to load the model in a separate Java program which throws the above mentioned error. In my S3 bucket I can see the serialized model being saved. I am not sure where it is going wrong. Any help with respect to this error is appreciated.

最满意答案

我弄清楚问题是什么。 AWS EMR集群运行Spark 2.1.0,我正在训练并将模型保存到S3存储桶。 但是在我的Java程序中,我指的是Spark MLLib的2.0.0版本。 我发现在2.0到2.1迁移指南中报告的RandomForestClassifierModel中的“numTrees”Param有一个重大变化http://spark.apache.org/docs/latest/ml-guide.html#from-20-到21

所以,最后我在Java Project中更新了Spark MLLib maven依赖项,指向2.1.0版。

<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-mllib_2.11</artifactId> <version>2.1.0</version> </dependency>

然后它抱怨另外一个缺课

java.lang.NoClassDefFoundError: org/codehaus/commons/compiler/UncheckedCompileException

当我添加commons-compiler依赖项时,它得到修复

<dependency> <groupId>org.codehaus.janino</groupId> <artifactId>commons-compiler</artifactId> <version>2.7.8</version> </dependency>

那就是我的持久模型最终成功加载了!

I figured out what was the problem. The AWS EMR cluster was running Spark 2.1.0 using which I was training and saving my model to the S3 bucket. However in my Java Program I was pointing to 2.0.0 version of Spark MLLib. I found that there was a breaking change related to the "numTrees" Param in RandomForestClassifierModel reported in the 2.0 to 2.1 migration guide here http://spark.apache.org/docs/latest/ml-guide.html#from-20-to-21

So, finally I updated the Spark MLLib maven dependency in my Java Project to point to version 2.1.0.

<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-mllib_2.11</artifactId> <version>2.1.0</version> </dependency>

It then complained about an additional missing class

java.lang.NoClassDefFoundError: org/codehaus/commons/compiler/UncheckedCompileException

It got fixed when I added the commons-compiler dependency

<dependency> <groupId>org.codehaus.janino</groupId> <artifactId>commons-compiler</artifactId> <version>2.7.8</version> </dependency>

And thats how finally my persisted model was successfully loaded !

更多推荐

本文发布于:2023-04-24 20:55:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/dzcp/d38850ee18946845b990c6d95dd60430.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:不存在   抛出   持久   加载   错误

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!