按唯一列值对CSV进行子集(Subsetting a CSV by unique column values)

编程入门行业动态更新时间:2024-10-28 12:27:22

我对linux很新，觉得这应该是一个相当简单的任务，但我无法弄明白。我有一个包含数百万行的大型数据文件，我想根据日期将文件分成较小的文件。我有一个包含YYMMDDHH数据的时间列，我想基于DD创建子文件。对于每个新DD，我想要创建一个包含当天所有条目的新文件。该文件是csv，已按时间排序。

从我所看到的看起来我应该能够使用cat，awk和grep来执行我想要的。

进一步详细说明，每行有14列。一列包含YYMMDDHH的数据（即14071000,14071000 ... 14071022,14071022 ...... 14071100 ... 14071200 ...）

我可以手动配置

cat trial | awk 'NR>=1 && NR<=100 {print}' >output.txt

这给了我1到100之间的行。我想知道是否有一个允许我基于YYMMDDHH列提取的命令，以便140710上的所有数据点都可以放在一个文件中。希望这有助于更好地解释我的问题。

I am fairly new to linux and feel this should be a fairly simple task, but I cannot quite figure it out. I have a large data file with millions of rows, and I want to break the file into smaller files based on date. I have a time column that contains YYMMDDHH data, and I want to create sub files based on the DD. For each new DD, I want a new file created with all entries for that day. The file is a csv and is already sorted by time.

From what I have read it looks like I should be able to use cat, awk and possibly grep to perform what I want.

To elaborate further, there are 14 columns per row. One column has data that contains YYMMDDHH (ie 14071000, 14071000...14071022,14071022....14071100...14071200...)

I can manually subset with

cat trial | awk 'NR>=1 && NR<=100 {print}' >output.txt

This gives me the rows between 1 and 100. I was wondering if there is a command that allows me to extract based off the YYMMDDHH column, so that all data points on 140710 could be put in a single file. Hope that helps explain my problem a little better.