过滤掉数据集中的连续行(Filtering out consecutive rows from dataset)

编程入门 行业动态 更新时间:2024-10-26 19:39:11
过滤掉数据集中的连续行(Filtering out consecutive rows from dataset)

我有一个包含索引值和一个日期时间变量的数据集,如下所示:

1 2017-01-03 09:30:01.958 46 2017-01-03 09:30:47.879 99 2017-01-03 09:33:48.121 117 2017-01-03 09:47:06.215 139 2017-01-03 09:51:06.054 1567 2017-01-03 14:17:18.949 2480 2017-01-03 15:57:13.442 2481 2017-01-03 15:57:14.333 2486 2017-01-03 15:57:37.500 2487 2017-01-03 15:57:38.677 2489 2017-01-03 15:57:41.053 2491 2017-01-03 15:57:54.870 2498 2017-01-03 15:59:24.210

我想要做的是从数据中删除连续的行(仅保留段中的第一个观察点),在这种情况下,代码应该删除索引为2481和2487的行。我尝试使用

df[df.index.diff() == 0].drop()

但它只返回了

AttributeError: 'Int64Index' object has no attribute 'diff'

I have a dataset with the index values and one datetime variable as follows:

1 2017-01-03 09:30:01.958 46 2017-01-03 09:30:47.879 99 2017-01-03 09:33:48.121 117 2017-01-03 09:47:06.215 139 2017-01-03 09:51:06.054 1567 2017-01-03 14:17:18.949 2480 2017-01-03 15:57:13.442 2481 2017-01-03 15:57:14.333 2486 2017-01-03 15:57:37.500 2487 2017-01-03 15:57:38.677 2489 2017-01-03 15:57:41.053 2491 2017-01-03 15:57:54.870 2498 2017-01-03 15:59:24.210

What I am trying to do is to remove consecutive rows from the data (only retaining the first observation from the segment), in this case the code should drop the rows with index 2481 and 2487. I tried using

df[df.index.diff() == 0].drop()

but it only returned

AttributeError: 'Int64Index' object has no attribute 'diff'

最满意答案

您可以使用boolean indexing ,以使用未实现方法的index使用to_series :

df = df[df.index.to_series().diff() != 1] print (df) date 1 2017-01-03 09:30:01.958 46 2017-01-03 09:30:47.879 99 2017-01-03 09:33:48.121 117 2017-01-03 09:47:06.215 139 2017-01-03 09:51:06.054 1567 2017-01-03 14:17:18.949 2480 2017-01-03 15:57:13.442 2486 2017-01-03 15:57:37.500 2489 2017-01-03 15:57:41.053 2491 2017-01-03 15:57:54.870 2498 2017-01-03 15:59:24.210

谢谢piRSquared for numpy替代方案:

df[np.append(0, np.diff(df.index.values)) != 1]

时间

#[11000 rows x 1 columns] df = pd.concat([df]*1000) In [60]: %timeit [True] + [(i[0]+1) != i[1] for i in zip(df.index.tolist(), df.index.tolist()[1:])] 100 loops, best of 3: 4.19 ms per loop In [61]: %timeit np.append(0, np.diff(df.index.values)) != 1 The slowest run took 4.72 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 33.1 µs per loop In [62]: %timeit df.index.to_series().diff() != 1 1000 loops, best of 3: 260 µs per loop

You can use boolean indexing, for working with index with not implemented method use to_series:

df = df[df.index.to_series().diff() != 1] print (df) date 1 2017-01-03 09:30:01.958 46 2017-01-03 09:30:47.879 99 2017-01-03 09:33:48.121 117 2017-01-03 09:47:06.215 139 2017-01-03 09:51:06.054 1567 2017-01-03 14:17:18.949 2480 2017-01-03 15:57:13.442 2486 2017-01-03 15:57:37.500 2489 2017-01-03 15:57:41.053 2491 2017-01-03 15:57:54.870 2498 2017-01-03 15:59:24.210

Thank you piRSquared for numpy alternative:

df[np.append(0, np.diff(df.index.values)) != 1]

Timings:

#[11000 rows x 1 columns] df = pd.concat([df]*1000) In [60]: %timeit [True] + [(i[0]+1) != i[1] for i in zip(df.index.tolist(), df.index.tolist()[1:])] 100 loops, best of 3: 4.19 ms per loop In [61]: %timeit np.append(0, np.diff(df.index.values)) != 1 The slowest run took 4.72 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 33.1 µs per loop In [62]: %timeit df.index.to_series().diff() != 1 1000 loops, best of 3: 260 µs per loop

更多推荐

本文发布于:2023-07-31 10:16:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1341913.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:过滤掉   数据   Filtering   dataset   rows

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!