TL; DR:需要处理水流时间序列,无法找出消除异常峰的方法。
我正在工作在一个项目中,我收到一个包含两列的 .csv 数据集:
- date ,日期时间时间戳
- 值,水流量值
此数据集通常是对具有自动灌溉系统的管理实体的水流传感器进行的一年测量,其中包含约402 000原始值。有时它可能会出现一些与浇水周期不对应的峰值,因为它是正常值之间的守时值,例如
TL;DR: Have water flow time series needed to be treated, can't figure it out a way to remove outlier peaks.
I'm currently working in a project where I receive a .csv dataset containing two columns:
- date, a datetime timestamp
- value, a water flow value
This dataset is usually one year of measures of a water flow sensor of a management entity with automatic irrigation systems, containing around 402 000 raw values. Sometimes it can have some peaks that doesn't correspond to a watering period, because it's a punctual value between normal values, like in the image.
So far I've tried going with calculating the percentage differences between two points and the spacing and calculating the median absolute deviation (MAD) but both catch false positives.
The issue here is I need an algorithm that identifies a spontaneous peak that lasts 1 or 2 measures, because it's physically impossible to have a 300% increase in flow for 2 minutes.
The other issue is in coding. It is needed to have a dynamic way to detect these peaks because, according to the whole dataset we clearly see why: In the summer the flow increases to more than double, making impossible to go with a .95 percentile.
I've prepared a github repo with the techniques stated above and 1 day of the dataset, the one I'm currently using now (It's around 1000 values).
解决方案Not a real answer but too long for a comment:
Maybe you could use the prominence of the peaks. You can use find_peaks with the prominence and width parameters and try and tweak other parameters like window size for prominence calculation (wlen).
The following quick example only illustrates the usage. It just finds peaks with a minimum prominence of arbitrarily 3 times the median:
from scipy.signal import find_peaks df = pd.read_csv('raw.githubusercontent/MigasTigas/peak_removal/master/dataset_simple_example.csv') peaks,_ = find_peaks(df.value, prominence=df.value.median()*3, width=(1,2)) ax = df.plot() df.iloc[peaks.tolist()].plot(style=['x'], ax=ax)更多推荐
如何检测水流时间序列中的离群峰?
发布评论