如何在将制表符分隔的数据解析为R data.table / data.frame时排除某些行?(How to exclude certain lines when parsing tab

系统教程 行业动态 更新时间:2024-06-14 16:59:47
如何在将制表符分隔的数据解析为R data.table / data.frame时排除某些行?(How to exclude certain lines when parsing tab-delimited data into an R data.table/data.frame?)

此问题与以下问题相关:

如何将制表符分隔的数据(不同格式)解析为data.table / data.frame?

我有一个格式错误的文本文件,其中制表符分隔格式如下:

A 1092 - 1093 + 1X B 1093 HRDCPMRFYT A 1093 + 1094 - 1X B 1094 BSZSDFJRVF A 1094 + 1095 + 1X B 1095 SSTFCLEPVV ...

但是,文本文件中有几个行,在技术上以制表符分隔,但是是长字符串。 例如这里的行'Z'和'Y'

Z FX:E:4.2 Y 23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M A 1092 - 1093 + 1X B 1093 HRDCPMRFYT A 1093 + 1094 - 1X B 1094 BSZSDFJRVF A 1094 + 1095 + 1X B 1095 SSTFCLEPVV ...

这个文本文件的一部分是Y 23434M,23434M,...可能是几GB长。

这些线极为罕见,仅由前面的Z或Y标记。 我目前在文本编辑器中打开文件并删除了这些行。

但是,这在算法上并不合理。 有没有办法解析这个文件,以便(1)只使用行A和B或(2)明确不使用行Z和Y ?

编辑:澄清一下,Z 不是一个长串。 这里只有'Y'是一个长串。 是一个格式为X XX:X:0.0的字符串,其中X是一个字符, 0是一个整数。

This question is related to the following question:

How to parse tab-delimited data (of different formats) into a data.table/data.frame?

I have a text file which is malformed, whereby he tab-delimited format is the following:

A 1092 - 1093 + 1X B 1093 HRDCPMRFYT A 1093 + 1094 - 1X B 1094 BSZSDFJRVF A 1094 + 1095 + 1X B 1095 SSTFCLEPVV ...

However, there are several long lines in the text file which are technically tab-delimited, but are long strings. e.g. the rows 'Z' and 'Y' here

Z FX:E:4.2 Y 23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M A 1092 - 1093 + 1X B 1093 HRDCPMRFYT A 1093 + 1094 - 1X B 1094 BSZSDFJRVF A 1094 + 1095 + 1X B 1095 SSTFCLEPVV ...

There is a section of this text file whereby Y 23434M,23434M,... is possibly several GB long.

These lines are exceptionally rare, and are only labeled by a preceding Z or Y. I've currently opened up the file within a text editor and deleted these lines.

However, this is not algorithmically reasonable. Is there a way to parse this file such that either (1) only rows A and B are used or (2) rows Z and Y are explicitly not used?

EDIT: To clarify, Z is not a long string. Only 'Y' is a long string here. is a string of the format X XX:X:0.0, whereby X is a character and 0 an integer.

最满意答案

您可以进行系统调用,以便使用某种模式使用,例如sed ,将文件固定到位。 如果要删除以Z或Y开头的所有行,只需传递一个正则表达式,然后是/d

system("sed -i '/^[ZY]/d' test.tab")

上面的命令将删除您文件中以Z或Y开头的所有行。 然后,您可以运行我在上一个问题中发布的相同代码

library(data.table) fread("sed '$!N;s/\\n/ /' test.tab") # V1 V2 V3 V4 V5 V6 V7 V8 # 1: A 1092 - 1093 + 1X B 1093 HRDCPMRFYT # 2: A 1093 + 1094 - 1X B 1094 BSZSDFJRVF # 3: A 1094 + 1095 + 1X B 1095 SSTFCLEPVV

数据

text <- "Z FX:E:4.2 Y 23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M A 1092 - 1093 + 1X B 1093 HRDCPMRFYT A 1093 + 1094 - 1X B 1094 BSZSDFJRVF A 1094 + 1095 + 1X B 1095 SSTFCLEPVV" # Saving it as tab separated file on disk write(gsub(" +", "\t", text), file = "test.tab")

You can make a system call in order to fix the file in place using, let's say sed, by a certain pattern. If you want to remove all the rows that begin with Z or Y you can simply pass a regex expression followed by /d

system("sed -i '/^[ZY]/d' test.tab")

The command above will remove all the rows that begin with Z or Y from you file. Then, you can run the same code I've posted in your previous question

library(data.table) fread("sed '$!N;s/\\n/ /' test.tab") # V1 V2 V3 V4 V5 V6 V7 V8 # 1: A 1092 - 1093 + 1X B 1093 HRDCPMRFYT # 2: A 1093 + 1094 - 1X B 1094 BSZSDFJRVF # 3: A 1094 + 1095 + 1X B 1095 SSTFCLEPVV

Data

text <- "Z FX:E:4.2 Y 23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M A 1092 - 1093 + 1X B 1093 HRDCPMRFYT A 1093 + 1094 - 1X B 1094 BSZSDFJRVF A 1094 + 1095 + 1X B 1095 SSTFCLEPVV" # Saving it as tab separated file on disk write(gsub(" +", "\t", text), file = "test.tab")

更多推荐

本文发布于:2023-04-17 09:05:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/dzcp/fd61464f74a38bc950282519477c7fa9.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:制表符   数据   如何在   data   table

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!