如何旋转数据框?

编程入门行业动态更新时间:2024-10-11 19:24:19

本文介绍了如何旋转数据框?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

什么是枢轴?
我如何旋转?
这是一个支点吗?
长格式到宽格式?

我看过很多关于数据透视表的问题.即使他们不知道他们在询问数据透视表，他们通常也是.几乎不可能写出一个规范的问题和答案来涵盖旋转的所有方面......

I've seen a lot of questions that ask about pivot tables. Even if they don't know that they are asking about pivot tables, they usually are. It is virtually impossible to write a canonical question and answer that encompasses all aspects of pivoting...

...但我要试一试.

现有问题和答案的问题在于，问题通常集中在 OP 难以概括的细微差别上，以便使用许多现有的好答案.但是，没有一个答案试图给出全面的解释(因为这是一项艰巨的任务)

The problem with existing questions and answers is that often the question is focused on a nuance that the OP has trouble generalizing in order to use a number of the existing good answers. However, none of the answers attempt to give a comprehensive explanation (because it's a daunting task)

看看我的Google 搜索

如何在 Pandas 中透视数据框?

很好的问答.但答案只回答了具体问题，几乎没有解释.

pandas 数据透视表到数据框

在这个问题中，OP 与枢轴的输出有关.即列的外观.OP 希望它看起来像 R.这对 Pandas 用户不是很有帮助.

pandas 旋转数据框，重复行

另一个不错的问题，但答案集中在一种方法上，即 pd.DataFrame.pivot

因此，每当有人搜索 pivot 时，他们都会得到零星的结果，而这些结果可能不会回答他们的具体问题.

So whenever someone searches for pivot they get sporadic results that are likely not going to answer their specific question.

您可能会注意到，我明显地命名了我的列和相关列值，以与我将在下面的答案中旋转的方式相对应.

You may notice that I conspicuously named my columns and relevant column values to correspond with how I'm going to pivot in the answers below.

问题

为什么会出现ValueError:索引包含重复条目，无法重塑
我如何旋转 df 使得 col 值为列，row 值为索引，val0 是值吗?

How do I pivot df such that the col values are columns, row values are the index, and mean of val0 are the values?
col col0 col1 col2 col3 col4 row row0 0.77 0.605 NaN 0.860 0.65 row2 0.13 NaN 0.395 0.500 0.25 row3 NaN 0.310 NaN 0.545 NaN row4 NaN 0.100 0.395 0.760 0.24
我如何旋转 df 使得 col 值为列，row 值为索引，val0 是值，缺失值是 0?
How do I pivot df such that the col values are columns, row values are the index, mean of val0 are the values, and missing values are 0?
col col0 col1 col2 col3 col4 row row0 0.77 0.605 0.000 0.860 0.65 row2 0.13 0.000 0.395 0.500 0.25 row3 0.00 0.310 0.000 0.545 0.00 row4 0.00 0.100 0.395 0.760 0.24
我能得到除 mean 以外的其他东西，比如 sum 吗?
Can I get something other than mean, like maybe sum?
col col0 col1 col2 col3 col4 row row0 0.77 1.21 0.00 0.86 0.65 row2 0.13 0.00 0.79 0.50 0.50 row3 0.00 0.31 0.00 1.09 0.00 row4 0.00 0.10 0.79 1.52 0.24
我可以一次做多个聚合吗?
Can I do more that one aggregation at a time?
sum mean col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4 row row0 0.77 1.21 0.00 0.86 0.65 0.77 0.605 0.000 0.860 0.65 row2 0.13 0.00 0.79 0.50 0.50 0.13 0.000 0.395 0.500 0.25 row3 0.00 0.31 0.00 1.09 0.00 0.00 0.310 0.000 0.545 0.00 row4 0.00 0.10 0.79 1.52 0.24 0.00 0.100 0.395 0.760 0.24
我可以聚合多个值列吗?
Can I aggregate over multiple value columns?
val0 val1 col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4 row row0 0.77 0.605 0.000 0.860 0.65 0.01 0.745 0.00 0.010 0.02 row2 0.13 0.000 0.395 0.500 0.25 0.45 0.000 0.34 0.440 0.79 row3 0.00 0.310 0.000 0.545 0.00 0.00 0.230 0.00 0.075 0.00 row4 0.00 0.100 0.395 0.760 0.24 0.00 0.070 0.42 0.300 0.46
可以按多列细分吗?
Can Subdivide by multiple columns?
item item0 item1 item2 col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4 row row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.605 0.86 0.65 row2 0.35 0.00 0.37 0.00 0.00 0.44 0.00 0.00 0.13 0.000 0.50 0.13 row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.000 0.28 0.00 row4 0.15 0.64 0.00 0.00 0.10 0.64 0.88 0.24 0.00 0.000 0.00 0.00
或
Or
item item0 item1 item2 col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4 key row key0 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.86 0.00 row2 0.00 0.00 0.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 0.00 row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.00 0.00 0.00 row4 0.15 0.64 0.00 0.00 0.00 0.00 0.00 0.24 0.00 0.00 0.00 0.00 key1 row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.81 0.00 0.65 row2 0.35 0.00 0.00 0.00 0.00 0.44 0.00 0.00 0.00 0.00 0.00 0.13 row3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00 row4 0.00 0.00 0.00 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 key2 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.40 0.00 0.00 row2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.13 0.00 0.00 0.00 row4 0.00 0.00 0.00 0.00 0.00 0.64 0.88 0.00 0.00 0.00 0.00 0.00
我可以汇总列和行一起出现的频率，也就是交叉表"吗?
Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?
col col0 col1 col2 col3 col4 row row0 1 2 0 1 1 row2 1 0 2 1 2 row3 0 1 0 2 0 row4 0 1 2 2 1
如何通过仅以两列为中心将 DataFrame 从长转换为宽?鉴于，
How do I convert a DataFrame from long to wide by pivoting on ONLY two columns? Given,
np.random.seed([3, 1415]) df2 = pd.DataFrame({'A': list('aaaabbbc'), 'B': np.random.choice(15, 8)}) df2 A B 0 a 0 1 a 11 2 a 2 3 a 11 4 b 10 5 b 10 6 b 14 7 c 7
预期应该看起来像
a b c 0 0.0 10.0 7.0 1 11.0 10.0 NaN 2 2.0 14.0 NaN 3 11.0 NaN NaN
如何在 pivot 之后将多个索引展平为单个索引?
How do I flatten the multiple index to single index after pivot?

来自
1 2 1 1 2 a 2 1 1 b 2 1 0 c 1 0 0
到
1|1 2|1 2|2 a 2 1 1 b 2 1 0 c 1 0 0

我们先回答第一个问题:

We start by answering the first question:

为什么我得到 ValueError: Index 包含重复的条目，无法重塑

发生这种情况是因为 Pandas 试图用重复的条目重新索引 columns 或 index 对象.可以使用不同的方法来执行数据透视.其中一些不太适合当要求它在其中旋转的键有重复时.例如.考虑 pd.DataFrame.pivot.我知道有重复的条目共享 row 和 col 值:

This occurs because pandas is attempting to reindex either a columns or index object with duplicate entries. There are varying methods to use that can perform a pivot. Some of them are not well suited to when there are duplicates of the keys in which it is being asked to pivot on. For example. Consider pd.DataFrame.pivot. I know there are duplicate entries that share the row and col values:

所以当我 pivot 使用

我收到上面提到的错误.事实上，当我尝试执行相同的任务时，我遇到了同样的错误:

I get the error mentioned above. In fact, I get the same error when I try to perform the same task with:

这是我们可以用来旋转的习语列表

Here is a list of idioms we can use to pivot

pd.DataFrame.groupby + pd.DataFrame.unstack
- 适用于几乎任何类型的数据透视的良好通用方法
- 您指定将构成一组中的透视行级别和列级别的所有列.您可以通过选择要聚合的剩余列和要执行聚合的函数来遵循此操作.最后，您unstack您希望在列索引中的级别.
- Good general approach for doing just about any type of pivot
- You specify all columns that will constitute the pivoted row levels and column levels in one group by. You follow that by selecting the remaining columns you want to aggregate and the function(s) you want to perform the aggregation. Finally, you unstack the levels that you want to be in the column index.
pd.DataFrame.pivot_table
- 具有更直观 API 的 groupby 美化版.对于许多人来说，这是首选方法.并且是开发人员的预期方法.
- 指定行级别、列级别、要聚合的值以及执行聚合的函数.
- A glorified version of groupby with more intuitive API. For many people, this is the preferred approach. And is the intended approach by the developers.
- Specify row level, column levels, values to be aggregated, and function(s) to perform aggregations.
pd.DataFrame.set_index + pd.DataFrame.unstack
- 对某些人(包括我自己)来说方便且直观.无法处理重复的分组键.
- 与groupby 范式类似，我们指定最终将成为行级或列级的所有列，并将它们设置为索引.然后我们unstack 列中我们想要的级别.如果剩余的索引级别或列级别不是唯一的，则此方法将失败.
- Convenient and intuitive for some (myself included). Cannot handle duplicate grouped keys.
- Similar to the groupby paradigm, we specify all columns that will eventually be either row or column levels and set those to be the index. We then unstack the levels we want in the columns. If either the remaining index levels or column levels are not unique, this method will fail.
pd.DataFrame.pivot
- 与 set_index 非常相似，因为它共享重复键限制.API 也非常有限.它只需要 index、columns、values 的标量值.
- 类似于 pivot_table 方法，我们选择要旋转的行、列和值.但是，我们无法聚合，如果行或列不是唯一的，则此方法将失败.
- Very similar to set_index in that it shares the duplicate key limitation. The API is very limited as well. It only takes scalar values for index, columns, values.
- Similar to the pivot_table method in that we select rows, columns, and values on which to pivot. However, we cannot aggregate and if either rows or columns are not unique, this method will fail.
pd.crosstab
- 这是pivot_table 的专用版本，其最纯粹的形式是执行多项任务的最直观方式.
- This a specialized version of pivot_table and in its purest form is the most intuitive way to perform several tasks.
pd.factorize + np.bincount
- 这是一种非常先进的技术，虽然晦涩但速度非常快.它不是在所有情况下都可以使用的，但是当它可以使用并且您使用起来很舒服时，您将获得绩效奖励.
pd.get_dummies + pd.DataFrame.dot
- 我用它来巧妙地执行交叉制表.
示例
对于每个后续的答案和问题，我要做的是使用 pd.DataFrame.pivot_table.然后我将提供执行相同任务的替代方案.

Examples
What I'm going to do for each subsequent answer and question is to answer it using pd.DataFrame.pivot_table. Then I'll provide alternatives to perform the same task.

我如何旋转 df 使得 col 值为列，row 值为索引，val0 是值，缺失值是 0?

How do I pivot df such that the col values are columns, row values are the index, mean of val0 are the values, and missing values are 0?
- pd.DataFrame.pivot_table
  - fill_value 默认未设置.我倾向于适当地设置它.在这种情况下，我将其设置为 0.请注意，我跳过了问题 2，因为它与没有 fill_value
  - fill_value is not set by default. I tend to set it appropriately. In this case I set it to 0. Notice I skipped question 2 as it's the same as this answer without the fill_value
  aggfunc='mean' 是默认值，我不需要设置它.我包括它是为了明确.
  
  aggfunc='mean' is the default and I didn't have to set it. I included it to be explicit.
  df.pivot_table( values='val0', index='row', columns='col', fill_value=0, aggfunc='mean') col col0 col1 col2 col3 col4 row row0 0.77 0.605 0.000 0.860 0.65 row2 0.13 0.000 0.395 0.500 0.25 row3 0.00 0.310 0.000 0.545 0.00 row4 0.00 0.100 0.395 0.760 0.24
- pd.crosstab
  pd.crosstab( index=df['row'], columns=df['col'], values=df['val0'], aggfunc='mean').fillna(0)