admin管理员组

文章数量:1582717

和同事技术争吵

Raise your hands if you’re an enthusiastic data analyst and you’ve heard this trite but very true line:

如果您是一位热心的数据分析师,请举手,您已经听过这条陈词滥调但非常真实的话

Data science is 80% data wrangling and 20% model building

数据科学是80%的数据争执和20%的模型构建

Enough has been said about how 80% of the work is data wrangling. But there’s not much being said about how to get it right. Data wrangling is mostly thought of as an elusive art but not something that needs some structure.

关于80%的工作是如何处理数据的说法已经足够了。 但是关于如何正确处理它并没有太多的话要说。 数据争用通常被认为是一种难以捉摸的艺术,但不需要某种结构。

Time and again I have been humbled by all the bad estimates I have made in terms of time required, effort required and difficulty of the seemingly soft, furry and harmless data pull requests that come my way every now and then.

一次又一次地,我对我所做出的所有错误估计感到沮丧,因为这些错误估计在时间,精力和对似乎不时出现的虚假,毛茸茸和无害的数据提取请求的难度上屡见不鲜。

After having made numerous mistakes, I now operate with self made guidelines for data wrangling. This has improved my estimates and reduced the number of mistakes.

在犯了许多错误之后,我现在使用有关数据处理的自制准则进行操作。 这提高了我的估计并减少了错误数量。

I want put these self made guidelines out there and hoping to receive suggestions on how to make this process better for myself and beginners who struggle with data wrangling.

我想把这些自制的指南放在那里,并希望收到关于如何使自己和那些为数据争吵而苦恼的初学者更好的过程的建议。

Now let me get straight to the point:

现在让我直接指出:

1.绘制最终数据帧 (1. Picture the final dataframe)

This one’s a no brainer. I usually start with visualising how my final result. If I have time on my hand I simply do this on Microsoft Excel. At this point I start to think about whether the final dataframe has all the information required. Being conscious of why I am pulling this data is helpful especially when the runtimes are long.

这是没有道理的。 我通常从可视化最终结果开始。 如果我有时间,可以在Microsoft Excel上进行操作。 在这一点上,我开始考虑最终数据帧是否具有所需的所有信息。 意识到为什么要提取此数据很有用,尤其是在运行时间较长的情况下。

2.有冷酷的业务估计 (2. Have cold business estimates)

During the early stages of a project my discussions with stakeholders revolve around what the business is expecting the results would be like. I try to guess what range my results are most likely to lie in. I could get these from well informed colleagues, similar projects or even the internet, depending on the nature of the project. If I have some time on my hand I even indulge in guesstimate puzzles. It’s easy to get too caught up over this. But I think of this as a guiding north star and it is not worth spending more than 20 minutes on this step.

在项目的早期阶段,我与利益相关者的讨论围绕业务所期望的结果。 我试图猜测我的结果最有可能处于什么范围。我可以从知情的同事,类似的项目甚至互联网上获得这些信息,具体取决于项目的性质。 如果我有时间,我什至会沉迷于猜谜。 太容易陷入这一点。 但是我认为这是一个指导性的北极星,在这一步上花费不超过20分钟是不值得的。

3.了解数据水平(LoD) (3. Understand Level of Data (LoD))

For every table that I am going to use I try to explain in english what every row represents. In other words what combination of columns uniquely defines a row.

对于我要使用的每个表格,我都会尝试用英语解释每一行代表什么。 换句话说,列的哪种组合唯一地定义了一行。

4.加入/合并礼节 (4. Join/Merge etiquette)

Understanding the mechanism of merging is super important. This involves understanding the LoD of all the tables involved and also having clarity on the LoD of the merged dataframe.

了解合并的机制非常重要。 这包括了解所有涉及的表的LoD,还需要清楚合并数据帧的LoD。

A general warning around merges:

有关合并的一般警告:

If the tables involved are not being joined on columns that uniquely define them, expect duplicates!!

如果所涉及的表未在唯一定义它们的列上未联接,请期待重复!

These duplicates should then be handled appropriately. Especially before aggregating.

然后应适当处理这些重复项。 特别是在汇总之前。

Also, a SQL specific warning around joins:

另外,有关联接SQL特定警告:

If table A is left joined with table B, and there table B specific where conditions, then it i equivalent to performing an inner join!

如果表A与表B保持连接,并且表B特定于何处​​的条件,那么它相当于执行内部联接!

If I must have filters for table B then they can be included as join conditions.

如果我必须为表B提供过滤器,则可以将它们作为连接条件包括在内。

5.使用数据子集进行初始运行 (5. Use subset of data for initial runs)

When I start coding / querying I run the codes on a subset of data, squeezing in as many filters as possible.

当我开始编码/查询时,我对一部分数据运行代码,并尽可能多地压缩过滤器。

An illustration of how I get “creative” with the filters:

我如何通过过滤器获得“创意”的说明:

  • If the analysis is about understanding which category is performing better then I run the codes on a day’s data including all the categories

    如果分析是关于了解哪个类别的效果更好,那么我将根据一天的数据(包括所有类别)运行代码

  • If the analysis is about how categories are performing over time then I run the codes for a week for a single category

    如果分析是关于类别在一段时间内的效果,那么我将为单个类别运行一周的代码

By doing this I am not only able to catch errors without using up a lot of resources, I am also getting a sense of the final results.

通过这样做,我不仅可以在不浪费大量资源的情况下捕获错误,还可以了解最终结果。

6.在每个中间代码块之后暂停 (6. Pause after every intermediate code block)

A data pull sometimes involves writing multiple intermediate tables before getting to the result. I try to get comfortable with the numbers before adding in more layers to my codes. Sometimes I still end up making mistakes but doing this gives me a starting point to reexamine my codes and catch errors.

数据提取有时涉及在获得结果之前写入多个中间表。 在为代码添加更多层之前,我会尽量适应数字。 有时我仍然会犯错误,但是这样做给了我重新检查代码并发现错误的起点。

7.用不同的数据表核对结果 (7. Reconcile results with different data tables)

Once I have completed writing codes, I start thinking about how best I can validate my results. If there’s scope I write simple codes using tables I have not used in my main codes to cross check results.

完成代码编写后,我开始考虑如何最好地验证结果。 如果有范围,我会使用表编写简单的代码,而我的主代码中并未使用这些表来交叉检查结果。

8.调和总量 (8. Reconcile with aggregates)

Reconciling with other tables may not be possible always. In such cases I write a different set of codes to get high level aggregates and cross check results. I always keep the codes I use for these checks painfully simple. Cause y’know I don’t be making any mistakes here!

可能无法始终与其他表进行核对。 在这种情况下,我将编写一组不同的代码以获取高级汇总和交叉检查结果。 我总是使用于这些检查的代码非常简单。 因为您知道我在这里没有犯任何错误!

9.如果您无法调和怎么办 (9. What if you are not able to reconcile)

Even after mindfully pulling the data and making the checks, sometimes I am disheartened when my results don’t match. For exactly 17 seconds. I then determine if the differences are global (across all columns) or local (just one column or only a few rows). Based on this I dive deep into the most granular data to find the source of the difference

即使在仔细地提取数据并进行了检查之后,有时当我的结果不匹配时我也会感到沮丧。 恰好17秒。 然后,我确定差异是全局的(跨所有列)还是局部的(仅一列或仅几行)。 基于此,我将深入研究最细粒度的数据以找出差异的来源

10.了解您的最终数据框架(10. Get to know your final dataframe)

Once the results match I am almost there. But not quite. I take no (remaining) column for granted. I make sure that my final dataframe has no obvious logical inconsistencies. A simple illustration:

一旦结果匹配,我就快到了。 但不完全是。 我认为没有(剩余)列是理所当然的。 我确保我的最终数据框没有明显的逻辑不一致。 一个简单的例子:

  • If you want to know how long a user spends time on a website in a day, the values cannot exceed 24 hours.

    如果您想知道用户一天在网站上停留的时间,则该值不能超过24小时。

Some additional questions to answer:

其他一些要回答的问题:

  • Unexpected Nulls?

    意外的空值?

  • Unexpected negative values?

    意外的负值?

  • Max and min of a column lying in the expected range?

    列的最大值和最小值在预期范围内?

At the end of this exercise I aim to have clarity on why my final dataframe looks the way it does.

在本练习的最后,我希望弄清楚为什么我的最终数据帧看起来像它的样子。

Bonus tip:

奖金提示:

If your final dataframe is small, dump it in Excel and make pivots to look at it from different perspectives

如果最终数据框很小,则将其转储到Excel中,并进行透视以从不同角度进行查看

11.观察和见解(11. Observations and insights)

Another important part of writing correct codes is to look at my work through commercial lens. Is there a story? Are all my initial hypotheses being validated? And several other project specific questions. This step is crucial in evaluating my codes and possibly the methodology too.

编写正确代码的另一个重要部分是从商业角度看我的作品。 有故事吗? 我所有的最初假设都得到验证了吗? 以及其他一些项目特定的问题。 这一步对于评估我的代码以及可能的方法也至关重要。

In the end I might not have all the answers but I can be confident of the fact that if I present my results to my stakeholders, no question can possibly throw me off.

最后,我可能没有所有的答案,但我可以相信,如果我将我的结果提交给我的利益相关者,那么毫无疑问可能会让我失望。

翻译自: https://medium/swlh/mindful-data-wrangling-1029df0a2dd1

和同事技术争吵

本文标签: 同事数据技术