使用训练有素的Spark ML模型提供实时预测

编程入门 行业动态 更新时间:2024-10-24 06:29:07
本文介绍了使用训练有素的Spark ML模型提供实时预测的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我们目前正在测试基于Spark在Python中LDA实现的预测引擎: spark.apache/docs/2.2.0/ml-clustering.html#latent-dirichlet-allocation-lda spark .apache/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.clustering.LDA (我们使用的是pyspark.ml软件包,而不是pyspark.mllib)

We are currently testing a prediction engine based on Spark's implementation of LDA in Python: spark.apache/docs/2.2.0/ml-clustering.html#latent-dirichlet-allocation-lda spark.apache/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.clustering.LDA (we are using the pyspark.ml package, not pyspark.mllib)

我们能够成功地在Spark集群上训练模型(使用Google Cloud Dataproc).现在,我们正在尝试使用该模型作为API(例如flask应用程序)提供实时预测.

We were able to succesfuly train a model on a Spark cluster (using Google Cloud Dataproc). Now we are trying to use the model to serve real-time predictions as an API (e.g. flask application).

实现这一目标的最佳方法是什么?

What would be the best approach to achieve so?

我们的主要痛点是,似乎我们需要带回整个Spark环境,以加载经过训练的模型并运行转换. 到目前为止,我们已经尝试为每个收到的请求在本地模式下运行Spark,但是这种方法为我们提供了

Our main pain point is that it seems we need to bring back the whole Spark environnement in order to load the trained model and run the transform. So far we've tried running Spark in local mode for each received request but this approach gave us:

  • 表现不佳(启动SparkSession,加载模型,运行转换...的时间)
  • 可伸缩性差(无法处理并发请求)
  • 整个方法似乎很繁琐,是否会有更简单的替代方法,甚至根本不需要暗示Spark?

    The whole approach seems quite heavy, would there be a simpler alternative, or even one that would not need to imply Spark at all?

    下面是训练和预测步骤的简化代码.

    Bellow are simplified code of the training and prediction steps.

    def train(input_dataset): conf = pyspark.SparkConf().setAppName("lda-train") spark = SparkSession.builder.config(conf=conf).getOrCreate() # Generate count vectors count_vectorizer = CountVectorizer(...) vectorizer_model = count_vectorizer.fit(input_dataset) vectorized_dataset = vectorizer_model.transform(input_dataset) # Instantiate LDA model lda = LDA(k=100, maxIter=100, optimizer="em", ...) # Train LDA model lda_model = lda.fit(vectorized_dataset) # Save models to external storage vectorizer_model.write().overwrite().save("gs://...") lda_model.write().overwrite().save("gs://...")

    预测代码

    def predict(input_query): conf = pyspark.SparkConf().setAppName("lda-predict").setMaster("local") spark = SparkSession.builder.config(conf=conf).getOrCreate() # Load models from external storage vectorizer_model = CountVectorizerModel.load("gs://...") lda_model = DistributedLDAModel.load("gs://...") # Run prediction on the input data using the loaded models vectorized_query = vectorizer_model.transform(input_query) transformed_query = lda_model.transform(vectorized_query) ... spark.stop() return transformed_query

    推荐答案

    如果您已经在Spark中拥有训练有素的机器学习模型,则可以使用rest api 使用Hydroshpere Mist为模型(测试或预测)提供服务而无需创建Spark Context.这将使您不必重新创建Spark环境,而仅依靠web services进行预测

    If you already have a trained Machine Learning model in spark, You can use Hydroshpere Mist to serve the models(testing or prediction) using rest api without creating a Spark Context. This will save you from recreating the spark environment and rely only on web services for prediction

    引用:

    • github/Hydrospheredata/mist
    • github/Hydrospheredata/spark-ml-serving
    • github/Hydrospheredata/hydro-serving
    • github/Hydrospheredata/mist
    • github/Hydrospheredata/spark-ml-serving
    • github/Hydrospheredata/hydro-serving

    更多推荐

    使用训练有素的Spark ML模型提供实时预测

    本文发布于:2023-11-10 09:47:51,感谢您对本站的认可!
    本文链接:https://www.elefans.com/category/jswz/34/1575046.html
    版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
    本文标签:训练有素   实时   模型   ML   Spark

    发布评论

    评论列表 (有 0 条评论)
    草根站长

    >www.elefans.com

    编程频道|电子爱好者 - 技术资讯及电子产品介绍!