我有一个逗号分隔的文本文件(为便于阅读,下面的示例中没有逗号)包含几列。
I have a comma separated text file (no commas in example below for readability) containing several columns.
id date xyz_1567.n28 2017-08-09T18:36:38.000000Z abc_2791.b87 2015-04-07T12:04:06.000000Z xyz_1567.n28 2019-10-09T10:34:38.000000Z只要 id列中有重复项,我们就需要比较重复行的 date列并删除该行与较早的日期。在上面的示例中,第一行和第三行共享相同的 id值。第三行的日期晚于第一行的日期,因此将保留第三行的日期。输出:
Whenever there is a duplicate in the 'id' column, we need to compare the 'date' column of the duplicate rows and remove the row with the earlier date. In the example above, the first and third rows share the same 'id' value. The date of row three is later than the one of row one, so row three would be kept. Output:
id date abc_2791.b87 2015-04-07T12:04:06.000000Z xyz_1567.n28 2019-10-09T10:34:38.000000Z使用awk或sort可以很容易地找到重复项,比较日期也不难。困难的部分是将两者结合-至少对我来说。
Finding duplicates could be achieved fairly easily with awk or sort, comparing dates isnt hard either. The hard part is combining the two - at least for me.
推荐答案sort -rk2 file | awk '!seen[$1]++'
按日期排序文件(第二列),然后删除重复项。这样,您可以在第一列中保留最新的唯一性。
Sort the file by date (the second column) and then remove the duplicates. This way you keep the most recent uniques per first column.
或者使用一个awk脚本
Or with one awk script
awk 'NR==1{print;next} $2>a[$1] {a[$1]=$2} END {for (i in a) print i,a[i]}' file更多推荐
重击:查找一列中的重复项,根据另一列的比较删除行
发布评论