在创建Spark RDD之前,从S3读取数据到内存(Read data to memory from S3 before creating Spark RDDs)

编程入门 行业动态 更新时间:2024-10-26 04:25:26
在创建Spark RDD之前,从S3读取数据到内存(Read data to memory from S3 before creating Spark RDDs)

我有一个Python应用程序,它使用两组相互关联的数据。 一组数据是存储在一堆文件中的一堆矩阵。 每个文件中都有一个矩阵。 此外,每个文件都与之关联,另一个文件包含矩阵行的标签。 这两个数据集已存储在S3中。 我想将嵌入到不同文件中的矩阵连接起来然后使用Spark.mllib在矩阵的行上执行KMeans聚类。

作为一个小例子,这是级联矩阵:矩阵= [[2,7,6],[3,6,1],[8,0,1],[6,2,3],[1,9, 0]]

我们希望将此矩阵的行聚类为两组。 因此这是Spark输出:out = [1,0,0,1,0]

但问题在于:我有一个巨大的图像数据矩阵。 矩阵中的每一行都是指图像。 我想跟踪矩阵中每一行所指的图像。 所以我需要做的事情(至少我想我需要做的)是从S3顺序读取每个矩阵文件和相关的标记文件到内存,这样我就不会丢失矩阵行的标记。 然后根据内存中的矩阵文件创建RDD。 我很感激有关如何做到这一点的任何建议。

顺便说一下,我正在使用Pyspark,boto和boto3。

I have a Python application which uses two sets of data related to each other. One set of data is a bunch of matrices stored in bunch of files. Each file has a single matrix in it. Also each file has associated with it, another file containing the labels of the rows of the matrix. These two datasets have been stored in S3. I would like to do concatenate the matrices chunked into different files and then use Spark.mllib to perform KMeans clustering over the rows of the matrix.

As a small example this is the concatenated matrix: matrix=[[2,7,6],[3,6,1],[8,0,1],[6,2,3],[1,9,0]]

And we want to cluster the rows of this matrix into two groups. Therefore this is the Spark output: out=[1,0,0,1,0]

But here is the problem: I have a huge matrix of image data. Each and every row in the matrix refers to an image. I would like to keep track of the image each and every row in the matrix refers to. So what I need to do (at least I guess I need to do) is read each and every matrix file and associated tag file to memory from S3 sequentially, so that I do not lose track of the tags for the rows of the matrix. And then create RDDs based on the matrix file in memory. I'd appreciate any suggestions on how to this.

By the way I am using Pyspark, boto, and boto3.

最满意答案

在寻找解决方法并找不到直接解决方案之后,我尝试了下面提到的解决方案:

1:将行标签和矩阵放在Pandas DataFrame中,将矩阵的每一行与其对应的标签相关联。

2:使用Pandas to_csv方法将DataFrame转储为使用Pipe(“|”)作为分隔符的文本文件。

3:将文本文件推送到S3。

4:然后使用rdd = sc.textFile(s3n:// user @ pass:bucketname / * .csv)根据S3存储桶中的所有csv文件创建RDD。

使用data = rdd.map(lambda line:array([float(x)for x in line.split('|')[1:]]))来创建实际数据的RDD。

使用labels = rdd.map(lambda line:line.split('|')[0])来提取标签。

然后对实际数据执行KMeans。

希望这可以帮助。

After looking around for a way to this and finding no direct solution, I tried this solution mentioned below:

1: Put the row labels and the matrices together in the Pandas DataFrame associating each row of the matrix to its corresponding label.

2: Used Pandas to_csv method to dump the DataFrame to text files with Pipe ("|") as the seperator.

3: Pushed the text files to S3.

4: Then used rdd=sc.textFile(s3n://user@pass:bucketname/*.csv) to create RDD based on all csv files in the S3 bucket.

Used data = rdd.map(lambda line: array([float(x) for x in line.split('|')[1:]])) to create RDD of the actual data.

Used labels = rdd.map(lambda line: line.split('|')[0]) to extract the labels.

Then performed KMeans on the actual data.

Hope this helps.

更多推荐

本文发布于:2023-07-09 02:25:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1082902.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:内存   数据   RDD   Spark   RDDs

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!