此问题与以下问题相关:
如何将制表符分隔的数据(不同格式)解析为data.table / data.frame?
我有一个格式错误的文本文件,其中制表符分隔格式如下:
A 1092 - 1093 + 1X B 1093 HRDCPMRFYT A 1093 + 1094 - 1X B 1094 BSZSDFJRVF A 1094 + 1095 + 1X B 1095 SSTFCLEPVV ...但是,文本文件中有几个长行,在技术上以制表符分隔,但是是长字符串。 例如这里的行'Z'和'Y'
Z FX:E:4.2 Y 23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M A 1092 - 1093 + 1X B 1093 HRDCPMRFYT A 1093 + 1094 - 1X B 1094 BSZSDFJRVF A 1094 + 1095 + 1X B 1095 SSTFCLEPVV ...这个文本文件的一部分是Y 23434M,23434M,...可能是几GB长。
这些线极为罕见,仅由前面的Z或Y标记。 我目前在文本编辑器中打开文件并删除了这些行。
但是,这在算法上并不合理。 有没有办法解析这个文件,以便(1)只使用行A和B或(2)明确不使用行Z和Y ?
编辑:澄清一下,Z 不是一个长串。 这里只有'Y'是一个长串。 是一个格式为X XX:X:0.0的字符串,其中X是一个字符, 0是一个整数。
This question is related to the following question:
How to parse tab-delimited data (of different formats) into a data.table/data.frame?
I have a text file which is malformed, whereby he tab-delimited format is the following:
A 1092 - 1093 + 1X B 1093 HRDCPMRFYT A 1093 + 1094 - 1X B 1094 BSZSDFJRVF A 1094 + 1095 + 1X B 1095 SSTFCLEPVV ...However, there are several long lines in the text file which are technically tab-delimited, but are long strings. e.g. the rows 'Z' and 'Y' here
Z FX:E:4.2 Y 23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M A 1092 - 1093 + 1X B 1093 HRDCPMRFYT A 1093 + 1094 - 1X B 1094 BSZSDFJRVF A 1094 + 1095 + 1X B 1095 SSTFCLEPVV ...There is a section of this text file whereby Y 23434M,23434M,... is possibly several GB long.
These lines are exceptionally rare, and are only labeled by a preceding Z or Y. I've currently opened up the file within a text editor and deleted these lines.
However, this is not algorithmically reasonable. Is there a way to parse this file such that either (1) only rows A and B are used or (2) rows Z and Y are explicitly not used?
EDIT: To clarify, Z is not a long string. Only 'Y' is a long string here. is a string of the format X XX:X:0.0, whereby X is a character and 0 an integer.
最满意答案
您可以进行系统调用,以便使用某种模式使用,例如sed ,将文件固定到位。 如果要删除以Z或Y开头的所有行,只需传递一个正则表达式,然后是/d
system("sed -i '/^[ZY]/d' test.tab")上面的命令将删除您文件中以Z或Y开头的所有行。 然后,您可以运行我在上一个问题中发布的相同代码
library(data.table) fread("sed '$!N;s/\\n/ /' test.tab") # V1 V2 V3 V4 V5 V6 V7 V8 # 1: A 1092 - 1093 + 1X B 1093 HRDCPMRFYT # 2: A 1093 + 1094 - 1X B 1094 BSZSDFJRVF # 3: A 1094 + 1095 + 1X B 1095 SSTFCLEPVV数据
text <- "Z FX:E:4.2 Y 23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M A 1092 - 1093 + 1X B 1093 HRDCPMRFYT A 1093 + 1094 - 1X B 1094 BSZSDFJRVF A 1094 + 1095 + 1X B 1095 SSTFCLEPVV" # Saving it as tab separated file on disk write(gsub(" +", "\t", text), file = "test.tab")You can make a system call in order to fix the file in place using, let's say sed, by a certain pattern. If you want to remove all the rows that begin with Z or Y you can simply pass a regex expression followed by /d
system("sed -i '/^[ZY]/d' test.tab")The command above will remove all the rows that begin with Z or Y from you file. Then, you can run the same code I've posted in your previous question
library(data.table) fread("sed '$!N;s/\\n/ /' test.tab") # V1 V2 V3 V4 V5 V6 V7 V8 # 1: A 1092 - 1093 + 1X B 1093 HRDCPMRFYT # 2: A 1093 + 1094 - 1X B 1094 BSZSDFJRVF # 3: A 1094 + 1095 + 1X B 1095 SSTFCLEPVVData
text <- "Z FX:E:4.2 Y 23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M A 1092 - 1093 + 1X B 1093 HRDCPMRFYT A 1093 + 1094 - 1X B 1094 BSZSDFJRVF A 1094 + 1095 + 1X B 1095 SSTFCLEPVV" # Saving it as tab separated file on disk write(gsub(" +", "\t", text), file = "test.tab")更多推荐
发布评论