在创建Spark RDD之前，从S3读取数据到内存(Read data to memory from S3 before creating Spark RDDs)

编程入门行业动态更新时间:2024-10-26 04:25:26

我有一个Python应用程序，它使用两组相互关联的数据。一组数据是存储在一堆文件中的一堆矩阵。每个文件中都有一个矩阵。此外，每个文件都与之关联，另一个文件包含矩阵行的标签。这两个数据集已存储在S3中。我想将嵌入到不同文件中的矩阵连接起来然后使用Spark.mllib在矩阵的行上执行KMeans聚类。

作为一个小例子，这是级联矩阵：矩阵= [[2,7,6]，[3,6,1]，[8,0,1]，[6,2,3]，[1,9， 0]]

我们希望将此矩阵的行聚类为两组。因此这是Spark输出：out = [1,0,0,1,0]

但问题在于：我有一个巨大的图像数据矩阵。矩阵中的每一行都是指图像。我想跟踪矩阵中每一行所指的图像。所以我需要做的事情（至少我想我需要做的）是从S3顺序读取每个矩阵文件和相关的标记文件到内存，这样我就不会丢失矩阵行的标记。然后根据内存中的矩阵文件创建RDD。我很感激有关如何做到这一点的任何建议。

顺便说一下，我正在使用Pyspark，boto和boto3。

I have a Python application which uses two sets of data related to each other. One set of data is a bunch of matrices stored in bunch of files. Each file has a single matrix in it. Also each file has associated with it, another file containing the labels of the rows of the matrix. These two datasets have been stored in S3. I would like to do concatenate the matrices chunked into different files and then use Spark.mllib to perform KMeans clustering over the rows of the matrix.

As a small example this is the concatenated matrix: matrix=[[2,7,6],[3,6,1],[8,0,1],[6,2,3],[1,9,0]]

And we want to cluster the rows of this matrix into two groups. Therefore this is the Spark output: out=[1,0,0,1,0]

But here is the problem: I have a huge matrix of image data. Each and every row in the matrix refers to an image. I would like to keep track of the image each and every row in the matrix refers to. So what I need to do (at least I guess I need to do) is read each and every matrix file and associated tag file to memory from S3 sequentially, so that I do not lose track of the tags for the rows of the matrix. And then create RDDs based on the matrix file in memory. I'd appreciate any suggestions on how to this.

By the way I am using Pyspark, boto, and boto3.