使用基于 Spark 数据集的 ML API 时初始化逻辑回归系数?

编程入门 行业动态 更新时间:2024-10-27 03:38:28
本文介绍了使用基于 Spark 数据集的 ML API 时初始化逻辑回归系数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

限时送ChatGPT账号..

默认情况下,逻辑回归训练将系数初始化为全零.但是,我想自己初始化系数.这将很有用,例如,如果之前的训练运行在多次迭代后崩溃——我可以简单地使用最后一组已知的系数重新开始训练.

By default, logistic regression training initializes the coefficients to be all-zero. However, I would like to initialize the coefficients myself. This would be useful, for example, if a previous training run crashed after several iterations -- I could simply restart training with the last known set of coefficients.

是否可以使用任何基于数据集/数据帧的 API,最好是 Scala?

Is this possible with any of the dataset/dataframe-based APIs, preferably Scala?

看Spark源码,好像有一个方法setInitialModel来初始化模型及其系数,.

Looking at the Spark source code, it seems that there is a method setInitialModel to initialize the model and its coefficients, but it's unfortunately marked as private.

基于 RDD 的 API 似乎允许初始化系数:LogisticRegressionWithSGD.run(...) 的重载之一接受 initialWeights 向量.但是,我想使用基于数据集的 API 而不是基于 RDD 的 API,因为(1)前者支持弹性网络正则化(我不知道如何使用基于 RDD 的逻辑回归来做弹性网络)和(2) 因为基于RDD的API处于维护模式.

The RDD-based API seems to allow initializing coefficients: one of the overloads of LogisticRegressionWithSGD.run(...) accepts an initialWeights vector. However, I would like to use the dataset-based API instead of the RDD-based API because (1) the former supports elastic net regularization (I couldn't figure out how to do elastic net with the RDD-based logistic regression) and (2) because the RDD-based API is in maintenance mode.

我总是可以尝试使用反射来调用私有的 setInitialModel 方法,但如果可能的话我想避免这种情况(也许这甚至行不通......我也无法分辨如果 setInitialModel 有充分的理由被标记为私有).

I could always try using reflection to call that private setInitialModel method, but I would like to avoid this if possible (and maybe that wouldn't even work... I also can't tell if setInitialModel is marked private for a good reason).

推荐答案

随意覆盖该方法.是的,您需要将该类复制到您自己的工作区中.没关系:不要害怕.

Feel free to override the method. Yes you will need to copy that class into your own work area. That's fine: do not fear.

当您构建项目时 - 无论是通过 maven 还是 sbt - 您的类的本地副本将获胜"并在 MLlib.幸运的是,同一个包中的其他类不会被着色.

When you build your project -either via maven or sbt - your local copy of the class will "win" and shade the one in MLlib. Fortunately the other classes in that same package will not be shaded.

我多次使用这种方法来覆盖 Spark 类:实际上您的构建时间也应该很短.

I have used this approach many times with overriding Spark classes: actually your build times should be small as well.

这篇关于使用基于 Spark 数据集的 ML API 时初始化逻辑回归系数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

更多推荐

[db:关键词]

本文发布于:2023-04-19 01:51:08,感谢您对本站的认可!
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:初始化   系数   逻辑   数据   Spark

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!