Matrix上的Python PCA太大，无法容纳到内存中

编程入门行业动态更新时间:2024-10-22 18:45:06

本文介绍了Matrix上的Python PCA太大，无法容纳到内存中的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我有一个csv，它是100,000行x 27,000列，我正尝试对其进行PCA，以产生100,000行x 300列的矩阵. csv为9GB大.这是我目前正在做的事情:

I have a csv that is 100,000 rows x 27,000 columns that I am trying to do PCA on to produce a 100,000 rows X 300 columns matrix. The csv is 9GB large. Here is currently what I'm doing:

from sklearn.decomposition import PCA as RandomizedPCA import csv import sys import numpy as np import pandas as pd dataset = sys.argv[1] X = pd.DataFrame.from_csv(dataset) Y = X.pop("Y_Level") X = (X - X.mean()) / (X.max() - X.min()) Y = list(Y) dimensions = 300 sklearn_pca = RandomizedPCA(n_components=dimensions) X_final = sklearn_pca.fit_transform(X)

运行上面的代码时，在逐步执行.from_csv的过程中我的程序被杀死.通过将csv分成10,000组，我已经能够解决该问题；一对一地读取它们，然后调用pd.concat.这使我可以在被杀死之前进入标准化步骤(X-X.mean())....我的数据对我的macbook air来说太大了吗?还是有更好的方法来做到这一点.我真的很想将我拥有的所有数据用于我的机器学习应用程序.

When I run the above code, my program is killed while doing the .from_csv in step. I've been able to get around that by spliting the csv into sets of 10,000; reading them in 1 by 1, and then calling pd.concat. This allows me to get to the normalization step (X - X.mean()).... before getting killed. Is my data just too big for my macbook air? Or is there a better way to do this. I would really love to use all the data I have for my machine learning application.

如果我想按照下面的答案建议使用增量PCA，这是我会怎么做?:

If i wanted to use incremental PCA as suggested by the answer below, is this how I would do it?:

from sklearn.decomposition import IncrementalPCA import csv import sys import numpy as np import pandas as pd dataset = sys.argv[1] chunksize_ = 10000 #total_size is 100000 dimensions = 300 reader = pd.read_csv(dataset, sep = ',', chunksize = chunksize_) sklearn_pca = IncrementalPCA(n_components=dimensions) Y = [] for chunk in reader: y = chunk.pop("virginica") Y = Y + list(y) sklearn_pca.partial_fit(chunk) X = ??? #This is were i'm stuck, how do i take my final pca and output it to X, #the normal transform method takes in an X, which I don't have because I #couldn't fit it into memory.

我在网上找不到任何好的例子.

I can't find any good examples online.

推荐答案

尝试将您的数据划分或将其分批加载到脚本中，并使用增量PCA 及其每批次的 partial_fit 方法.

Try to divide your data or load it by batches into script, and fit your PCA with Incremetal PCA with it's partial_fit method on every batch.

from sklearn.decomposition import IncrementalPCA import csv import sys import numpy as np import pandas as pd dataset = sys.argv[1] chunksize_ = 5 * 25000 dimensions = 300 reader = pd.read_csv(dataset, sep = ',', chunksize = chunksize_) sklearn_pca = IncrementalPCA(n_components=dimensions) for chunk in reader: y = chunk.pop("Y") sklearn_pca.partial_fit(chunk) # Computed mean per feature mean = sklearn_pca.mean_ # and stddev stddev = np.sqrt(sklearn_pca.var_) Xtransformed = None for chunk in pd.read_csv(dataset, sep = ',', chunksize = chunksize_): y = chunk.pop("Y") Xchunk = sklearn_pca.transform(chunk) if Xtransformed == None: Xtransformed = Xchunk else: Xtransformed = np.vstack((Xtransformed, Xchunk))

有用的链接

更多推荐

Matrix上的Python PCA太大,无法容纳到内存中

本文发布于:2023-11-29 16:38:12，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1646964.html