在CountVectorizer变为(40,845 X 218,904)(会标)之后,我只有2个类,并且我的训练数据集矩阵大小只有一个文档分类问题.如果考虑三字组,则最高可达(40845 X 3,931,789).有没有一种方法可以在此类数据集上执行PCA而不会导致内存或稀疏数据集错误.我在6GB的计算机上使用python sklearn.
解决方案如果您有6GB的RAM,则说明您拥有一台64位计算机,因此最简单的解决方案可能就是调高您的RAM.
否则,交叉点如下: scicomp.stackexchange/questions/1681/what-is-the-the-fastest-way-to-calculate-the-largest-eigenvalue-of-general -matrix/7487#7487
最近对此进行了一些很好的研究.新方法使用随机算法",只需要对矩阵进行几次读取就可以在最大特征值上获得良好的准确性.这与幂迭代相反,后者需要多次矩阵向量乘法才能达到高精度.
您可以在此处了解有关这项新研究的更多信息:
math. berkeley.edu/~strain/273.F10/martinsson.tygert.rokhlin.randomized.decomposition.pdf
arxiv/abs/0909.4061
此代码将为您完成:
cims.nyu.edu/~tygert/software.html
bitbucket/rcompton/pca_hgdd7/a7f43d719f7d43b7b7a7b7b7b1e7c7f7d7b7b1e7c7f7d7b1b7b7b7b7b7b7b7b7b7b7b7b7b7b1b7b7b7b7b7b7b7b7b7b7b7b7b7b7b7b7b7b1b7b7f7d7b7b7b7b7b1b7b7f7d7b7b7b7b7b7b7b7f7b7b0N邀请中心/pca.m
code.google/p/redsvd/
cwiki.apache/MAHOUT/stochastic -singular-value-decomposition.html
如果您选择的语言不存在,您可以轻松滚动自己的随机SVD;它只需要矩阵向量乘法,然后调用现成的SVD.
I've got a document classification problem with only 2 classes and my training dataset matrix size, after the CountVectorizer becomes (40,845 X 218,904) (unigram). In the case of considering trigrams, it can reach up to (40845 X 3,931,789). Is there a way to perform PCA on such dataset without getting memory or sparse dataset errors. I'm using python sklearn on an 6GB machine.
解决方案If you've got 6GB RAM you've got a 64bit machine, so the easiest solution is probably to just up your RAM.
Otherwise, crosspost of this: scicomp.stackexchange/questions/1681/what-is-the-fastest-way-to-calculate-the-largest-eigenvalue-of-a-general-matrix/7487#7487
There has been some good research on this recently. The new approaches use "randomized algorithms" which only require a few reads of your matrix to get good accuracy on the largest eigenvalues. This is in contrast to power iterations which require several matrix-vector multiplications to reach high accuracy.
You can read more about the new research here:
math.berkeley.edu/~strain/273.F10/martinsson.tygert.rokhlin.randomized.decomposition.pdf
arxiv/abs/0909.4061
This code will do it for you:
cims.nyu.edu/~tygert/software.html
bitbucket/rcompton/pca_hgdp/raw/be45a1d9a7077b60219f7017af0130c7f43d7b52/pca.m
code.google/p/redsvd/
cwiki.apache/MAHOUT/stochastic-singular-value-decomposition.html
If your language of choice isn't in there you can roll your own randomized SVD pretty easily; it only requires a matrix vector multiplication followed by a call to an off-the-shelf SVD.
更多推荐
在大型数据集上执行PCA
发布评论