使用训练有素的Spark ML模型提供实时预测

编程入门行业动态更新时间:2024-10-24 06:29:07

本文介绍了使用训练有素的Spark ML模型提供实时预测的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我们目前正在测试基于Spark在Python中LDA实现的预测引擎: spark.apache/docs/2.2.0/ml-clustering.html#latent-dirichlet-allocation-lda spark .apache/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.clustering.LDA (我们使用的是pyspark.ml软件包，而不是pyspark.mllib)

We are currently testing a prediction engine based on Spark's implementation of LDA in Python: spark.apache/docs/2.2.0/ml-clustering.html#latent-dirichlet-allocation-lda spark.apache/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.clustering.LDA (we are using the pyspark.ml package, not pyspark.mllib)

我们能够成功地在Spark集群上训练模型(使用Google Cloud Dataproc).现在，我们正在尝试使用该模型作为API(例如flask应用程序)提供实时预测.

We were able to succesfuly train a model on a Spark cluster (using Google Cloud Dataproc). Now we are trying to use the model to serve real-time predictions as an API (e.g. flask application).

实现这一目标的最佳方法是什么?

What would be the best approach to achieve so?

我们的主要痛点是，似乎我们需要带回整个Spark环境，以加载经过训练的模型并运行转换. 到目前为止，我们已经尝试为每个收到的请求在本地模式下运行Spark，但是这种方法为我们提供了

Our main pain point is that it seems we need to bring back the whole Spark environnement in order to load the trained model and run the transform. So far we've tried running Spark in local mode for each received request but this approach gave us:

表现不佳(启动SparkSession，加载模型，运行转换...的时间)

可伸缩性差(无法处理并发请求)

整个方法似乎很繁琐，是否会有更简单的替代方法，甚至根本不需要暗示Spark?

The whole approach seems quite heavy, would there be a simpler alternative, or even one that would not need to imply Spark at all?

下面是训练和预测步骤的简化代码.

Bellow are simplified code of the training and prediction steps.

def train(input_dataset): conf = pyspark.SparkConf().setAppName("lda-train") spark = SparkSession.builder.config(conf=conf).getOrCreate() # Generate count vectors count_vectorizer = CountVectorizer(...) vectorizer_model = count_vectorizer.fit(input_dataset) vectorized_dataset = vectorizer_model.transform(input_dataset) # Instantiate LDA model lda = LDA(k=100, maxIter=100, optimizer="em", ...) # Train LDA model lda_model = lda.fit(vectorized_dataset) # Save models to external storage vectorizer_model.write().overwrite().save("gs://...") lda_model.write().overwrite().save("gs://...")

预测代码

def predict(input_query): conf = pyspark.SparkConf().setAppName("lda-predict").setMaster("local") spark = SparkSession.builder.config(conf=conf).getOrCreate() # Load models from external storage vectorizer_model = CountVectorizerModel.load("gs://...") lda_model = DistributedLDAModel.load("gs://...") # Run prediction on the input data using the loaded models vectorized_query = vectorizer_model.transform(input_query) transformed_query = lda_model.transform(vectorized_query) ... spark.stop() return transformed_query

推荐答案

如果您已经在Spark中拥有训练有素的机器学习模型，则可以使用rest api 使用Hydroshpere Mist为模型(测试或预测)提供服务而无需创建Spark Context.这将使您不必重新创建Spark环境，而仅依靠web services进行预测

If you already have a trained Machine Learning model in spark, You can use Hydroshpere Mist to serve the models(testing or prediction) using rest api without creating a Spark Context. This will save you from recreating the spark environment and rely only on web services for prediction

引用: