


Raise your hands if you’re an enthusiastic data analyst and you’ve heard this trite but very true line:


Data science is 80% data wrangling and 20% model building


Enough has been said about how 80% of the work is data wrangling. But there’s not much being said about how to get it right. Data wrangling is mostly thought of as an elusive art but not something that needs some structure.

关于80%的工作是如何处理数据的说法已经足够了。 但是关于如何正确处理它并没有太多的话要说。 数据争用通常被认为是一种难以捉摸的艺术,但不需要某种结构。

Time and again I have been humbled by all the bad estimates I have made in terms of time required, effort required and difficulty of the seemingly soft, furry and harmless data pull requests that come my way every now and then.


After having made numerous mistakes, I now operate with self made guidelines for data wrangling. This has improved my estimates and reduced the number of mistakes.

在犯了许多错误之后,我现在使用有关数据处理的自制准则进行操作。 这提高了我的估计并减少了错误数量。

I want put these self made guidelines out there and hoping to receive suggestions on how to make this process better for myself and beginners who struggle with data wrangling.


Now let me get straight to the point:


1.绘制最终数据帧 (1. Picture the final dataframe)

This one’s a no brainer. I usually start with visualising how my final result. If I have time on my hand I simply do this on Microsoft Excel. At this point I start to think about whether the final dataframe has all the information required. Being conscious of why I am pulling this data is helpful especially when the runtimes are long.

这是没有道理的。 我通常从可视化最终结果开始。 如果我有时间,可以在Microsoft Excel上进行操作。 在这一点上,我开始考虑最终数据帧是否具有所需的所有信息。 意识到为什么要提取此数据很有用,尤其是在运行时间较长的情况下。

2.有冷酷的业务估计 (2. Have cold business estimates)

During the early stages of a project my discussions with stakeholders revolve around what the business is expecting the results would be like. I try to guess what range my results are most likely to lie in. I could get these from well informed colleagues, similar projects or even the internet, depending on the nature of the project. If I have some time on my hand I even indulge in guesstimate puzzles. It’s easy to get too caught up over this. But I think of this as a guiding north star and it is not worth spending more than 20 minutes on this step.

在项目的早期阶段,我与利益相关者的讨论围绕业务所期望的结果。 我试图猜测我的结果最有可能处于什么范围。我可以从知情的同事,类似的项目甚至互联网上获得这些信息,具体取决于项目的性质。 如果我有时间,我什至会沉迷于猜谜。 太容易陷入这一点。 但是我认为这是一个指导性的北极星,在这一步上花费不超过20分钟是不值得的。

3.了解数据水平(LoD) (3. Understand Level of Data (LoD))

For every table that I am going to use I try to explain in english what every row represents. In other words what combination of columns uniquely defines a row.

对于我要使用的每个表格,我都会尝试用英语解释每一行代表什么。 换句话说,列的哪种组合唯一地定义了一行。

4.加入/合并礼节 (4. Join/Merge etiquette)

Understanding the mechanism of merging is super important. This involves understanding the LoD of all the tables involved and also having clarity on the LoD of the merged dataframe.

了解合并的机制非常重要。 这包括了解所有涉及的表的LoD,还需要清楚合并数据帧的LoD。

A general warning around merges:


If the tables involved are not being joined on columns that uniquely define them, expect duplicates!!


These duplicates should then be handled appropriately. Especially before aggregating.

然后应适当处理这些重复项。 特别是在汇总之前。

Also, a SQL specific warning around joins:


If table A is left joined with table B, and there table B specific where conditions, then it i equivalent to performing an inner join!


If I must have filters for table B then they can be included as join conditions.


5.使用数据子集进行初始运行 (5. Use subset of data for initial runs)

When I start coding / querying I run the codes on a subset of data, squeezing in as many filters as possible.


An illustration of how I get “creative” with the filters:


  • If the analysis is about understanding which category is performing better then I run the codes on a day’s data including all the categories


  • If the analysis is about how categories are performing over time then I run the codes for a week for a single category


By doing this I am not only able to catch errors without using up a lot of resources, I am also getting a sense of the final results.


6.在每个中间代码块之后暂停 (6. Pause after every intermediate code block)

A data pull sometimes involves writing multiple intermediate tables before getting to the result. I try to get comfortable with the numbers before adding in more layers to my codes. Sometimes I still end up making mistakes but doing this gives me a starting point to reexamine my codes and catch errors.

数据提取有时涉及在获得结果之前写入多个中间表。 在为代码添加更多层之前,我会尽量适应数字。 有时我仍然会犯错误,但是这样做给了我重新检查代码并发现错误的起点。

7.用不同的数据表核对结果 (7. Reconcile results with different data tables)

Once I have completed writing codes, I start thinking about how best I can validate my results. If there’s scope I write simple codes using tables I have not used in my main codes to cross check results.

完成代码编写后,我开始考虑如何最好地验证结果。 如果有范围,我会使用表编写简单的代码,而我的主代码中并未使用这些表来交叉检查结果。

8.调和总量 (8. Reconcile with aggregates)

Reconciling with other tables may not be possible always. In such cases I write a different set of codes to get high level aggregates and cross check results. I always keep the codes I use for these checks painfully simple. Cause y’know I don’t be making any mistakes here!

可能无法始终与其他表进行核对。 在这种情况下,我将编写一组不同的代码以获取高级汇总和交叉检查结果。 我总是使用于这些检查的代码非常简单。 因为您知道我在这里没有犯任何错误!

9.如果您无法调和怎么办 (9. What if you are not able to reconcile)

Even after mindfully pulling the data and making the checks, sometimes I am disheartened when my results don’t match. For exactly 17 seconds. I then determine if the differences are global (across all columns) or local (just one column or only a few rows). Based on this I dive deep into the most granular data to find the source of the difference

即使在仔细地提取数据并进行了检查之后,有时当我的结果不匹配时我也会感到沮丧。 恰好17秒。 然后,我确定差异是全局的(跨所有列)还是局部的(仅一列或仅几行)。 基于此,我将深入研究最细粒度的数据以找出差异的来源

10.了解您的最终数据框架(10. Get to know your final dataframe)

Once the results match I am almost there. But not quite. I take no (remaining) column for granted. I make sure that my final dataframe has no obvious logical inconsistencies. A simple illustration:

一旦结果匹配,我就快到了。 但不完全是。 我认为没有(剩余)列是理所当然的。 我确保我的最终数据框没有明显的逻辑不一致。 一个简单的例子:

  • If you want to know how long a user spends time on a website in a day, the values cannot exceed 24 hours.


Some additional questions to answer:


  • Unexpected Nulls?


  • Unexpected negative values?


  • Max and min of a column lying in the expected range?


At the end of this exercise I aim to have clarity on why my final dataframe looks the way it does.


Bonus tip:


If your final dataframe is small, dump it in Excel and make pivots to look at it from different perspectives


11.观察和见解(11. Observations and insights)

Another important part of writing correct codes is to look at my work through commercial lens. Is there a story? Are all my initial hypotheses being validated? And several other project specific questions. This step is crucial in evaluating my codes and possibly the methodology too.

编写正确代码的另一个重要部分是从商业角度看我的作品。 有故事吗? 我所有的最初假设都得到验证了吗? 以及其他一些项目特定的问题。 这一步对于评估我的代码以及可能的方法也至关重要。

In the end I might not have all the answers but I can be confident of the fact that if I present my results to my stakeholders, no question can possibly throw me off.


翻译自: https://medium/swlh/mindful-data-wrangling-1029df0a2dd1


