如何在将制表符分隔的数据解析为R data.table / data.frame时排除某些行？(How to exclude certain lines when parsing tab

如何在将制表符分隔的数据解析为R data.table / data.frame时排除某些行？(How to exclude certain lines when parsing tab-delimited data into an R data.table/data.frame?)

此问题与以下问题相关：

如何将制表符分隔的数据（不同格式）解析为data.table / data.frame？

我有一个格式错误的文本文件，其中制表符分隔格式如下：

A 1092 - 1093 + 1X B 1093 HRDCPMRFYT A 1093 + 1094 - 1X B 1094 BSZSDFJRVF A 1094 + 1095 + 1X B 1095 SSTFCLEPVV ...

但是，文本文件中有几个长行，在技术上以制表符分隔，但是是长字符串。例如这里的行'Z'和'Y'

Z FX:E:4.2 Y 23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M A 1092 - 1093 + 1X B 1093 HRDCPMRFYT A 1093 + 1094 - 1X B 1094 BSZSDFJRVF A 1094 + 1095 + 1X B 1095 SSTFCLEPVV ...

这个文本文件的一部分是Y 23434M,23434M,...可能是几GB长。

这些线极为罕见，仅由前面的Z或Y标记。我目前在文本编辑器中打开文件并删除了这些行。

但是，这在算法上并不合理。有没有办法解析这个文件，以便（1）只使用行A和B或（2）明确不使用行Z和Y ？

编辑：澄清一下，Z 不是一个长串。这里只有'Y'是一个长串。是一个格式为X XX:X:0.0的字符串，其中X是一个字符， 0是一个整数。

This question is related to the following question:

How to parse tab-delimited data (of different formats) into a data.table/data.frame?

I have a text file which is malformed, whereby he tab-delimited format is the following:

A 1092 - 1093 + 1X B 1093 HRDCPMRFYT A 1093 + 1094 - 1X B 1094 BSZSDFJRVF A 1094 + 1095 + 1X B 1095 SSTFCLEPVV ...

However, there are several long lines in the text file which are technically tab-delimited, but are long strings. e.g. the rows 'Z' and 'Y' here

Z FX:E:4.2 Y 23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M A 1092 - 1093 + 1X B 1093 HRDCPMRFYT A 1093 + 1094 - 1X B 1094 BSZSDFJRVF A 1094 + 1095 + 1X B 1095 SSTFCLEPVV ...

There is a section of this text file whereby Y 23434M,23434M,... is possibly several GB long.

These lines are exceptionally rare, and are only labeled by a preceding Z or Y. I've currently opened up the file within a text editor and deleted these lines.

However, this is not algorithmically reasonable. Is there a way to parse this file such that either (1) only rows A and B are used or (2) rows Z and Y are explicitly not used?

EDIT: To clarify, Z is not a long string. Only 'Y' is a long string here. is a string of the format X XX:X:0.0, whereby X is a character and 0 an integer.

最满意答案

您可以进行系统调用，以便使用某种模式使用，例如sed ，将文件固定到位。如果要删除以Z或Y开头的所有行，只需传递一个正则表达式，然后是/d

system("sed -i '/^[ZY]/d' test.tab")

上面的命令将删除您文件中以Z或Y开头的所有行。然后，您可以运行我在上一个问题中发布的相同代码

library(data.table) fread("sed '$!N;s/\\n/ /' test.tab") # V1 V2 V3 V4 V5 V6 V7 V8 # 1: A 1092 - 1093 + 1X B 1093 HRDCPMRFYT # 2: A 1093 + 1094 - 1X B 1094 BSZSDFJRVF # 3: A 1094 + 1095 + 1X B 1095 SSTFCLEPVV

数据

text <- "Z FX:E:4.2 Y 23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M A 1092 - 1093 + 1X B 1093 HRDCPMRFYT A 1093 + 1094 - 1X B 1094 BSZSDFJRVF A 1094 + 1095 + 1X B 1095 SSTFCLEPVV" # Saving it as tab separated file on disk write(gsub(" +", "\t", text), file = "test.tab")

You can make a system call in order to fix the file in place using, let's say sed, by a certain pattern. If you want to remove all the rows that begin with Z or Y you can simply pass a regex expression followed by /d

system("sed -i '/^[ZY]/d' test.tab")

The command above will remove all the rows that begin with Z or Y from you file. Then, you can run the same code I've posted in your previous question

library(data.table) fread("sed '$!N;s/\\n/ /' test.tab") # V1 V2 V3 V4 V5 V6 V7 V8 # 1: A 1092 - 1093 + 1X B 1093 HRDCPMRFYT # 2: A 1093 + 1094 - 1X B 1094 BSZSDFJRVF # 3: A 1094 + 1095 + 1X B 1095 SSTFCLEPVV

Data

text <- "Z FX:E:4.2 Y 23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M A 1092 - 1093 + 1X B 1093 HRDCPMRFYT A 1093 + 1094 - 1X B 1094 BSZSDFJRVF A 1094 + 1095 + 1X B 1095 SSTFCLEPVV" # Saving it as tab separated file on disk write(gsub(" +", "\t", text), file = "test.tab")

更多推荐