在大型数据集上执行PCA

编程入门行业动态更新时间:2024-10-28 16:20:32

本文介绍了在大型数据集上执行PCA的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

在CountVectorizer变为(40,845 X 218,904)(会标)之后，我只有2个类，并且我的训练数据集矩阵大小只有一个文档分类问题.如果考虑三字组，则最高可达(40845 X 3,931,789).有没有一种方法可以在此类数据集上执行PCA而不会导致内存或稀疏数据集错误.我在6GB的计算机上使用python sklearn.

解决方案

如果您有6GB的RAM，则说明您拥有一台64位计算机，因此最简单的解决方案可能就是调高您的RAM.

否则，交叉点如下: scicomp.stackexchange/questions/1681/what-is-the-the-fastest-way-to-calculate-the-largest-eigenvalue-of-general -matrix/7487#7487

最近对此进行了一些很好的研究.新方法使用随机算法"，只需要对矩阵进行几次读取就可以在最大特征值上获得良好的准确性.这与幂迭代相反，后者需要多次矩阵向量乘法才能达到高精度.

您可以在此处了解有关这项新研究的更多信息:

math. berkeley.edu/~strain/273.F10/martinsson.tygert.rokhlin.randomized.decomposition.pdf

arxiv/abs/0909.4061

此代码将为您完成:

cims.nyu.edu/~tygert/software.html

bitbucket/rcompton/pca_hgdd7/a7f43d719f7d43b7b7a7b7b7b1e7c7f7d7b7b1e7c7f7d7b1b7b7b7b7b7b7b7b7b7b7b7b7b7b1b7b7b7b7b7b7b7b7b7b7b7b7b7b7b7b7b7b1b7b7f7d7b7b7b7b7b1b7b7f7d7b7b7b7b7b7b7b7f7b7b0N邀请中心/pca.m

code.google/p/redsvd/

cwiki.apache/MAHOUT/stochastic -singular-value-decomposition.html

如果您选择的语言不存在，您可以轻松滚动自己的随机SVD；它只需要矩阵向量乘法，然后调用现成的SVD.

I've got a document classification problem with only 2 classes and my training dataset matrix size, after the CountVectorizer becomes (40,845 X 218,904) (unigram). In the case of considering trigrams, it can reach up to (40845 X 3,931,789). Is there a way to perform PCA on such dataset without getting memory or sparse dataset errors. I'm using python sklearn on an 6GB machine.

解决方案

If you've got 6GB RAM you've got a 64bit machine, so the easiest solution is probably to just up your RAM.

Otherwise, crosspost of this: scicomp.stackexchange/questions/1681/what-is-the-fastest-way-to-calculate-the-largest-eigenvalue-of-a-general-matrix/7487#7487

There has been some good research on this recently. The new approaches use "randomized algorithms" which only require a few reads of your matrix to get good accuracy on the largest eigenvalues. This is in contrast to power iterations which require several matrix-vector multiplications to reach high accuracy.

You can read more about the new research here:

math.berkeley.edu/~strain/273.F10/martinsson.tygert.rokhlin.randomized.decomposition.pdf

arxiv/abs/0909.4061

This code will do it for you:

cims.nyu.edu/~tygert/software.html

bitbucket/rcompton/pca_hgdp/raw/be45a1d9a7077b60219f7017af0130c7f43d7b52/pca.m

code.google/p/redsvd/

cwiki.apache/MAHOUT/stochastic-singular-value-decomposition.html

If your language of choice isn't in there you can roll your own randomized SVD pretty easily; it only requires a matrix vector multiplication followed by a call to an off-the-shelf SVD.

更多推荐

在大型数据集上执行PCA

本文发布于:2023-07-26 09:38:33，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1215542.html