如何读取R中的单行.txt数据集？(How to read one

如何读取R中的单行.txt数据集？(How to read one-row .txt dataset in R?)

我的.txt数据集如下所示：

perms ['AC', 'AT', 'AG', 'AN', 'CA', 'CT', 'CG', 'CN', 'TA', 'TC', 'TG', 'TN', 'GA', 'GC', 'GT', 'GN', 'NA', 'NC', 'NT', 'NG', 'AA', 'CC', 'TT', 'GG', 'NN'] link [11413851, 16930583, 16197703, 1085, 16533859, 16218116, 2309941, 572, 14414084, 13609414, 16552907, 1015, 13594224, 10038778, 11427660, 480, 1055, 445, 1061, 591, 15557040, 9822185, 15583349, 9815249, 11653456]

这个数据集中有两个变量：'perms'和'link'。如何在R中读取此数据集？我不能使用蛮力，因为我的样本的确切大小太大（其中一些有n> 100 000）。但结构完全相同。先谢谢你！

My .txt dataset looks like the following:

perms ['AC', 'AT', 'AG', 'AN', 'CA', 'CT', 'CG', 'CN', 'TA', 'TC', 'TG', 'TN', 'GA', 'GC', 'GT', 'GN', 'NA', 'NC', 'NT', 'NG', 'AA', 'CC', 'TT', 'GG', 'NN'] link [11413851, 16930583, 16197703, 1085, 16533859, 16218116, 2309941, 572, 14414084, 13609414, 16552907, 1015, 13594224, 10038778, 11427660, 480, 1055, 445, 1061, 591, 15557040, 9822185, 15583349, 9815249, 11653456]

There's two variables in this dataset: 'perms' and 'link'. How can I read this dataset in R? I cannot use brute-force, because of the exact size of my sample is just too huge (some of them have n>100 000). But the structure is totally the same. Thank you in advance!

最满意答案

我们用readLines读取数据集，按空格分隔后跟[或]后跟零或更多空格。创建逻辑索引（'ind'），分割数据的子集，循环， scan以获取单个元素，并转换为'data.frame'。

lines <- readLines("file.txt") lines1 <- strsplit(lines, "\\s*\\[|\\]\\s*")[[1]] ind <- c(TRUE, FALSE) data.frame(setNames(lapply(lines1[!ind], function(x) trimws(scan(text=x, what = "", sep=",", quiet=TRUE))), lines1[ind])) # perms link #1 AC 11413851 #2 AT 16930583 #3 AG 16197703 #4 AN 1085 #5 CA 16533859 #6 CT 16218116 #7 CG 2309941 #8 CN 572 #9 TA 14414084 #10 TC 13609414 #11 TG 16552907 #12 TN 1015 #13 GA 13594224 #14 GC 10038778 #15 GT 11427660 #16 GN 480 #17 NA 1055 #18 NC 445 #19 NT 1061 #20 NG 591 #21 AA 15557040 #22 CC 9822185 #23 TT 15583349 #24 GG 9815249 #25 NN 11653456

We read the dataset with readLines, split by space followed by [ or ] followed by zero or more space. Create a logical index ('ind'), subset the split data, loop though it, scan to get the individual elements, and convert to 'data.frame'.

lines <- readLines("file.txt") lines1 <- strsplit(lines, "\\s*\\[|\\]\\s*")[[1]] ind <- c(TRUE, FALSE) data.frame(setNames(lapply(lines1[!ind], function(x) trimws(scan(text=x, what = "", sep=",", quiet=TRUE))), lines1[ind])) # perms link #1 AC 11413851 #2 AT 16930583 #3 AG 16197703 #4 AN 1085 #5 CA 16533859 #6 CT 16218116 #7 CG 2309941 #8 CN 572 #9 TA 14414084 #10 TC 13609414 #11 TG 16552907 #12 TN 1015 #13 GA 13594224 #14 GC 10038778 #15 GT 11427660 #16 GN 480 #17 NA 1055 #18 NC 445 #19 NT 1061 #20 NG 591 #21 AA 15557040 #22 CC 9822185 #23 TT 15583349 #24 GG 9815249 #25 NN 11653456

更多推荐

如何读取R中的单行.txt数据集？(How to read one

最满意答案

发布评论取消回复

最近发表

热门文章

标签列表