多索引

多索引 - 每天访问最后一次(Multi-indexing - accessing the last time in every day)

Pandas中多索引的新手。我的数据看起来像这样

Date Time value 2014-01-14 12:00:04 .424 12:01:12 .342 12:01:19 .341 ... 12:05:49 .23 2014-05-12 ... 1:02:42 .23 ....

现在，我想访问每个日期的最后一次，并将值存储在某个数组中。我做了这样的多索引

df= pd.read_csv("df.csv",index_col=0) df.index = pd.to_datetime(df.index,infer_datetime_format=True) df.index = pd.MultiIndex.from_arrays([df.index.date,df.index.time],names=['Date','Time']) df= df[~df.index.duplicated(keep='first')] dates = df.index.get_level_values(0)

所以我将日期保存为数组。我想迭代日期，但不能正确获取语法或错误地访问值。我尝试了一个for循环，但无法让它运行（ for date in dates ），也不能直接访问（ df.loc[dates[i]]或类似的东西）。此外，每个日期的时间变量数也各不相同。有没有什么办法解决这一问题？

New to multiindexing in Pandas. I have data that looks like this

Date Time value 2014-01-14 12:00:04 .424 12:01:12 .342 12:01:19 .341 ... 12:05:49 .23 2014-05-12 ... 1:02:42 .23 ....

For now, I want to access the last time for every single date and store the value in some array. I've made a multiindex like this

df= pd.read_csv("df.csv",index_col=0) df.index = pd.to_datetime(df.index,infer_datetime_format=True) df.index = pd.MultiIndex.from_arrays([df.index.date,df.index.time],names=['Date','Time']) df= df[~df.index.duplicated(keep='first')] dates = df.index.get_level_values(0)

So I have dates saved as an array. I want to iterate through the dates but can't either get the syntax right or am accessing the values incorrectly. I've tried a for loop but can't get it to run (for date in dates) and can't do direct access either (df.loc[dates[i]] or something like that). Also the number of time variables in each date varies. Is there any way to fix this?

最满意答案

这听起来像是groupby/max操作。更具体地说，您希望按Date分组并通过获取max聚合Time 。由于聚合只能在列值上进行，因此我们需要将Time索引级别更改为列（通过使用reset_index ）：

import pandas as pd df = pd.DataFrame({'Date': ['2014-01-14', '2014-01-14', '2014-01-14', '2014-01-14', '2014-05-12', '2014-05-12'], 'Time': ['12:00:04', '12:01:12', '12:01:19', '12:05:49', '01:01:59', '01:02:42'], 'value': [0.42399999999999999, 0.34200000000000003, 0.34100000000000003, 0.23000000000000001, 0.0, 0.23000000000000001]}) df['Date'] = pd.to_datetime(df['Date']) df = df.set_index(['Date', 'Time']) df = df.reset_index('Time', drop=False) max_times = df.groupby(level=0)['Time'].max() print(max_times)

产量

Date 2014-01-14 12:05:49 2014-05-12 1:02:42 Name: Time, dtype: object

如果你想选择整行 ，那么你可以使用idxmax - 但有一个警告。 idxmax返回索引标签。因此，索引必须是唯一的，标签才能表示唯一的行。由于Date级别本身并不是唯一的，因此要使用idxmax我们需要完全reset_index （以创建唯一整数的索引）：

df = pd.DataFrame({'Date': ['2014-01-14', '2014-01-14', '2014-01-14', '2014-01-14', '2014-05-12', '2014-05-12'], 'Time': ['12:00:04', '12:01:12', '12:01:19', '12:05:49', '01:01:59', '1:02:42'], 'value': [0.42399999999999999, 0.34200000000000003, 0.34100000000000003, 0.23000000000000001, 0.0, 0.23000000000000001]}) df['Date'] = pd.to_datetime(df['Date']) df['Time'] = pd.to_timedelta(df['Time']) df = df.set_index(['Date', 'Time']) df = df.reset_index() idx = df.groupby(['Date'])['Time'].idxmax() print(df.loc[idx])

产量

Date Time value 3 2014-01-14 12:05:49 0.23 5 2014-05-12 01:02:42 0.23

在保持MultiIndex的同时，我没有看到一个很好的方法。在设置MultiIndex之前，更容易执行groupby操作。此外，最好将日期时间保留为一个值，而不是将其分成两部分。请注意，给定类似日期时间/句点的系列， .dt访问器可让您根据需要轻松访问date和time 。因此，您可以按Date分组而不进行Date列：

df = pd.DataFrame({'DateTime': ['2014-01-14 12:00:04', '2014-01-14 12:01:12', '2014-01-14 12:01:19', '2014-01-14 12:05:49', '2014-05-12 01:01:59', '2014-05-12 01:02:42'], 'value': [0.42399999999999999, 0.34200000000000003, 0.34100000000000003, 0.23000000000000001, 0.0, 0.23000000000000001]}) df['DateTime'] = pd.to_datetime(df['DateTime']) # df = pd.read_csv('df.csv', parse_dates=[0]) idx = df.groupby(df['DateTime'].dt.date)['DateTime'].idxmax() result = df.loc[idx] print(result)

产量

DateTime value 3 2014-01-14 12:05:49 0.23 5 2014-05-12 01:02:42 0.23

This sounds like a groupby/max operation. More specifically, you want to group by the Date and aggregate the Times by taking the max. Since aggregation can only be done over column values, we'll need to change the Time index level into a column (by using reset_index):

import pandas as pd df = pd.DataFrame({'Date': ['2014-01-14', '2014-01-14', '2014-01-14', '2014-01-14', '2014-05-12', '2014-05-12'], 'Time': ['12:00:04', '12:01:12', '12:01:19', '12:05:49', '01:01:59', '01:02:42'], 'value': [0.42399999999999999, 0.34200000000000003, 0.34100000000000003, 0.23000000000000001, 0.0, 0.23000000000000001]}) df['Date'] = pd.to_datetime(df['Date']) df = df.set_index(['Date', 'Time']) df = df.reset_index('Time', drop=False) max_times = df.groupby(level=0)['Time'].max() print(max_times)

yields

Date 2014-01-14 12:05:49 2014-05-12 1:02:42 Name: Time, dtype: object

If you wish to select the entire row, then you could use idxmax -- but there is a caveat. idxmax returns index labels. Therefore, the index must be unique for the labels to signify unique rows. Since the Date level is not by itself unique, to use idxmax we'll need to reset_index completely (to make an index of unique integers):

df = pd.DataFrame({'Date': ['2014-01-14', '2014-01-14', '2014-01-14', '2014-01-14', '2014-05-12', '2014-05-12'], 'Time': ['12:00:04', '12:01:12', '12:01:19', '12:05:49', '01:01:59', '1:02:42'], 'value': [0.42399999999999999, 0.34200000000000003, 0.34100000000000003, 0.23000000000000001, 0.0, 0.23000000000000001]}) df['Date'] = pd.to_datetime(df['Date']) df['Time'] = pd.to_timedelta(df['Time']) df = df.set_index(['Date', 'Time']) df = df.reset_index() idx = df.groupby(['Date'])['Time'].idxmax() print(df.loc[idx])

yields

Date Time value 3 2014-01-14 12:05:49 0.23 5 2014-05-12 01:02:42 0.23

I don't see a good way to do this while keeping the MultiIndex. It is easier to perform the groupby operation before setting the MultiIndex. Moreover, it is probably preferable to preserve the datetimes as one value instead of splitting it into two parts. Note that given a datetime/period-like Series, the .dt accessor gives you easy access to the date and the time as needed. Thus you can group by the Date without making a Date column:

df = pd.DataFrame({'DateTime': ['2014-01-14 12:00:04', '2014-01-14 12:01:12', '2014-01-14 12:01:19', '2014-01-14 12:05:49', '2014-05-12 01:01:59', '2014-05-12 01:02:42'], 'value': [0.42399999999999999, 0.34200000000000003, 0.34100000000000003, 0.23000000000000001, 0.0, 0.23000000000000001]}) df['DateTime'] = pd.to_datetime(df['DateTime']) # df = pd.read_csv('df.csv', parse_dates=[0]) idx = df.groupby(df['DateTime'].dt.date)['DateTime'].idxmax() result = df.loc[idx] print(result)

yields

DateTime value 3 2014-01-14 12:05:49 0.23 5 2014-05-12 01:02:42 0.23

更多推荐

最满意答案

发布评论取消回复

最近发表

热门文章

标签列表