和同事技术争吵_注意数据争吵|电子爱好者

admin管理员组
文章数量:1582717

和同事技术争吵

Raise your hands if you’re an enthusiastic data analyst and you’ve heard this trite but very true line:

如果您是一位热心的数据分析师，请举手，您已经听过这条陈词滥调但非常真实的话：

Data science is 80% data wrangling and 20% model building

数据科学是80％的数据争执和20％的模型构建

Enough has been said about how 80% of the work is data wrangling. But there’s not much being said about how to get it right. Data wrangling is mostly thought of as an elusive art but not something that needs some structure.

关于80％的工作是如何处理数据的说法已经足够了。但是关于如何正确处理它并没有太多的话要说。数据争用通常被认为是一种难以捉摸的艺术，但不需要某种结构。

Time and again I have been humbled by all the bad estimates I have made in terms of time required, effort required and difficulty of the seemingly soft, furry and harmless data pull requests that come my way every now and then.

一次又一次地，我对我所做出的所有错误估计感到沮丧，因为这些错误估计在时间，精力和对似乎不时出现的虚假，毛茸茸和无害的数据提取请求的难度上屡见不鲜。

After having made numerous mistakes, I now operate with self made guidelines for data wrangling. This has improved my estimates and reduced the number of mistakes.

在犯了许多错误之后，我现在使用有关数据处理的自制准则进行操作。这提高了我的估计并减少了错误数量。

I want put these self made guidelines out there and hoping to receive suggestions on how to make this process better for myself and beginners who struggle with data wrangling.

我想把这些自制的指南放在那里，并希望收到关于如何使自己和那些为数据争吵而苦恼的初学者更好的过程的建议。

Now let me get straight to the point:

现在让我直接指出：

1.绘制最终数据帧 (1. Picture the final dataframe)

This one’s a no brainer. I usually start with visualising how my final result. If I have time on my hand I simply do this on Microsoft Excel. At this point I start to think about whether the final dataframe has all the information required. Being conscious of why I am pulling this data is helpful especially when the runtimes are long.

这是没有道理的。我通常从可视化最终结果开始。如果我有时间，可以在Microsoft Excel上进行操作。在这一点上，我开始考虑最终数据帧是否具有所需的所有信息。意识到为什么要提取此数据很有用，尤其是在运行时间较长的情况下。

2.有冷酷的业务估计 (2. Have cold business estimates)

During the early stages of a project my discussions with stakeholders revolve around what the business is expecting the results would be like. I try to guess what range my results are most likely to lie in. I could get these from well informed colleagues, similar projects or even the internet, depending on the nature of the project. If I have some time on my hand I even indulge in guesstimate puzzles. It’s easy to get too caught up over this. But I think of this as a guiding north star and it is not worth spending more than 20 minutes on this step.

在项目的早期阶段，我与利益相关者的讨论围绕业务所期望的结果。我试图猜测我的结果最有可能处于什么范围。我可以从知情的同事，类似的项目甚至互联网上获得这些信息，具体取决于项目的性质。如果我有时间，我什至会沉迷于猜谜。太容易陷入这一点。但是我认为这是一个指导性的北极星，在这一步上花费不超过20分钟是不值得的。

3.了解数据水平(LoD) (3. Understand Level of Data (LoD))

For every table that I am going to use I try to explain in english what every row represents. In other words what combination of columns uniquely defines a row.

对于我要使用的每个表格，我都会尝试用英语解释每一行代表什么。换句话说，列的哪种组合唯一地定义了一行。

4.加入/合并礼节 (4. Join/Merge etiquette)

Understanding the mechanism of merging is super important. This involves understanding the LoD of all the tables involved and also having clarity on the LoD of the merged dataframe.

了解合并的机制非常重要。这包括了解所有涉及的表的LoD，还需要清楚合并数据帧的LoD。

A general warning around merges:

有关合并的一般警告：

If the tables involved are not being joined on columns that uniquely define them, expect duplicates!!

如果所涉及的表未在唯一定义它们的列上未联接，请期待重复！

These duplicates should then be handled appropriately. Especially before aggregating.

然后应适当处理这些重复项。特别是在汇总之前。

Also, a SQL specific warning around joins:

另外，有关联接SQL特定警告：

If table A is left joined with table B, and there table B specific where conditions, then it i equivalent to performing an inner join!

如果表A与表B保持连接，并且表B特定于何处的条件，那么它相当于执行内部联接！

If I must have filters for table B then they can be included as join conditions.

如果我必须为表B提供过滤器，则可以将它们作为连接条件包括在内。

5.使用数据子集进行初始运行 (5. Use subset of data for initial runs)

When I start coding / querying I run the codes on a subset of data, squeezing in as many filters as possible.

当我开始编码/查询时，我对一部分数据运行代码，并尽可能多地压缩过滤器。

An illustration of how I get “creative” with the filters:

我如何通过过滤器获得“创意”的说明：

If the analysis is about understanding which category is performing better then I run the codes on a day’s data including all the categories
如果分析是关于了解哪个类别的效果更好，那么我将根据一天的数据(包括所有类别)运行代码
If the analysis is about how categories are performing over time then I run the codes for a week for a single category
如果分析是关于类别在一段时间内的效果，那么我将为单个类别运行一周的代码

By doing this I am not only able to catch errors without using up a lot of resources, I am also getting a sense of the final results.

通过这样做，我不仅可以在不浪费大量资源的情况下捕获错误，还可以了解最终结果。

6.在每个中间代码块之后暂停 (6. Pause after every intermediate code block)

A data pull sometimes involves writing multiple intermediate tables before getting to the result. I try to get comfortable with the numbers before adding in more layers to my codes. Sometimes I still end up making mistakes but doing this gives me a starting point to reexamine my codes and catch errors.

数据提取有时涉及在获得结果之前写入多个中间表。在为代码添加更多层之前，我会尽量适应数字。有时我仍然会犯错误，但是这样做给了我重新检查代码并发现错误的起点。

7.用不同的数据表核对结果 (7. Reconcile results with different data tables)

Once I have completed writing codes, I start thinking about how best I can validate my results. If there’s scope I write simple codes using tables I have not used in my main codes to cross check results.

完成代码编写后，我开始考虑如何最好地验证结果。如果有范围，我会使用表编写简单的代码，而我的主代码中并未使用这些表来交叉检查结果。

8.调和总量 (8. Reconcile with aggregates)

Reconciling with other tables may not be possible always. In such cases I write a different set of codes to get high level aggregates and cross check results. I always keep the codes I use for these checks painfully simple. Cause y’know I don’t be making any mistakes here!

可能无法始终与其他表进行核对。在这种情况下，我将编写一组不同的代码以获取高级汇总和交叉检查结果。我总是使用于这些检查的代码非常简单。因为您知道我在这里没有犯任何错误！

9.如果您无法调和怎么办 (9. What if you are not able to reconcile)

Even after mindfully pulling the data and making the checks, sometimes I am disheartened when my results don’t match. For exactly 17 seconds. I then determine if the differences are global (across all columns) or local (just one column or only a few rows). Based on this I dive deep into the most granular data to find the source of the difference

即使在仔细地提取数据并进行了检查之后，有时当我的结果不匹配时我也会感到沮丧。恰好17秒。然后，我确定差异是全局的(跨所有列)还是局部的(仅一列或仅几行)。基于此，我将深入研究最细粒度的数据以找出差异的来源

10.了解您的最终数据框架(10. Get to know your final dataframe)

Once the results match I am almost there. But not quite. I take no (remaining) column for granted. I make sure that my final dataframe has no obvious logical inconsistencies. A simple illustration:

一旦结果匹配，我就快到了。但不完全是。我认为没有(剩余)列是理所当然的。我确保我的最终数据框没有明显的逻辑不一致。一个简单的例子：

If you want to know how long a user spends time on a website in a day, the values cannot exceed 24 hours.
如果您想知道用户一天在网站上停留的时间，则该值不能超过24小时。

Some additional questions to answer:

其他一些要回答的问题：

Unexpected Nulls?
意外的空值？
Unexpected negative values?
意外的负值？
Max and min of a column lying in the expected range?
列的最大值和最小值在预期范围内？

At the end of this exercise I aim to have clarity on why my final dataframe looks the way it does.

在本练习的最后，我希望弄清楚为什么我的最终数据帧看起来像它的样子。

Bonus tip:

奖金提示：

If your final dataframe is small, dump it in Excel and make pivots to look at it from different perspectives

如果最终数据框很小，则将其转储到Excel中，并进行透视以从不同角度进行查看

11.观察和见解(11. Observations and insights)

Another important part of writing correct codes is to look at my work through commercial lens. Is there a story? Are all my initial hypotheses being validated? And several other project specific questions. This step is crucial in evaluating my codes and possibly the methodology too.

编写正确代码的另一个重要部分是从商业角度看我的作品。有故事吗？我所有的最初假设都得到验证了吗？以及其他一些项目特定的问题。这一步对于评估我的代码以及可能的方法也至关重要。

In the end I might not have all the answers but I can be confident of the fact that if I present my results to my stakeholders, no question can possibly throw me off.

最后，我可能没有所有的答案，但我可以相信，如果我将我的结果提交给我的利益相关者，那么毫无疑问可能会让我失望。

翻译自: https://medium/swlh/mindful-data-wrangling-1029df0a2dd1

和同事技术争吵

本文标签：同事数据技术

版权声明：本文标题：和同事技术争吵_注意数据争吵内容由热心网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：https://www.elefans.com/xitong/1727896551a1136859.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

电子爱好者 - 最新技术资讯及电子产品介绍！

和同事技术争吵_注意数据争吵

1.绘制最终数据帧 (1. Picture the final dataframe)

2.有冷酷的业务估计 (2. Have cold business estimates)

3.了解数据水平(LoD) (3. Understand Level of Data (LoD))

4.加入/合并礼节 (4. Join/Merge etiquette)

5.使用数据子集进行初始运行 (5. Use subset of data for initial runs)

6.在每个中间代码块之后暂停 (6. Pause after every intermediate code block)

7.用不同的数据表核对结果 (7. Reconcile results with different data tables)

8.调和总量 (8. Reconcile with aggregates)

9.如果您无法调和怎么办 (9. What if you are not able to reconcile)

10.了解您的最终数据框架(10. Get to know your final dataframe)

11.观察和见解(11. Observations and insights)

更多相关文章

从关键新闻和最新技术看AI行业发展（2024.5.6-5.19第二十三期） |【WeThinkIn老实人报】

超强总结，用心分享丨大数据超神之路(七)：Apache Doris上篇

计算机控制技术课程解释与问题答疑

基于中台的公共图书馆数据服务研究

破解电脑卡顿难题，将数据优化，5分钟提升运行速度

Cesium+Vue 数据拦截引起的卡顿

最全的路由器无线桥接WDS技术配置过程及所遇问题总结

Python实用功能之pdf文件转png图片数据

多台群晖实现按计划WOL网络自动开关机数据冷备份

Sybase SQL Anywhere（ASA）数据库恢复，ASA 数据恢复，数据误删除恢复工具ReadASADB

[技术分享]VMware Esxi 6.7主机配置备份和恢复（SSH、PowerCLI方法）

Oracle中利用undo进行数据的恢复操作

利用PRM-DUL工具恢复oracle dbf文件中的数据

保健中的深度学习nlp技术用于决策

掌握大数据数据分析师吗?_要掌握您的数据吗？ 这就是为什么您应该关心元数据的原因...

codechef Enormous Input Test 快速读入数据 fread

解决百度云下载过慢、Linux下载百度云数据问题

JSON省市区三级联动数据（2020年06月最新 百度网盘后续不断更新）

百度云盘APP中去除我的应用数据图标：ES File Exploer

C#中利用Microsoft.Office.Interop.Excel向Excel中写数据

发表评论

推荐文章

win11打开文件夹卡顿解决办法汇总

MyEclipse卡顿

如何查询CAD图纸中具体位置的坐标？

Revit工作时处理CAD图层的5种方法，快get起来

视频编辑软件Vegas中的轨道怎么恢复默认大小？

热门文章

centos7下载，centos iso文件下载

敏捷专项练习题202207

最快解决Microsoft Edge、Chrome浏览器被2345绑架问题

Windows找不到文件‘chrome’。请确定文件名是否正确后，再试一次。

2345、hao123等浏览器页面劫持

大容量出众的CAD移动端看图让你的工作更加智能方便！

Python 程序实现电脑自动定时关机

在使用计算机时可以用什么键关机,函数计算器的功能有哪些关机是哪个键

word恢复默认样式

百度网盘打不开

最新文章

【C#】VS2019 添加引用中没有 Microsoft.Office.XXX 的解决办法

如何彻底卸载Microsoft office-修改照片

Microsoft Office 2010安装失败

Microsoft Office Professional Plus 2010 在安装过程中出现错误的解决方案

Office Online Server 安装和集成

Microsoft Office Excel 不能访问文件

Matlab中.mat文件变成Microsoft Office Access快捷方式还原方法（适用Win7平台）

小企鹅手把手教你安装Microsoft office 365！！！

java excel 2007兼容包_Microsoft Office 2007兼容包

OFFICE报错：无法将类型为“Microsoft.Office.Interop.Excel.ApplicationClass”的 COM 对象强制转换为接口类型

Microsoft Office 2007 安装解除安装失败解决方案

安装Microsoft Office 2019版本时报错网络问题和空间不足问题！Error Code：30050-1039

关闭microsoft office 2013上载中心

office365教育版学生edu认证永久免费注册微软官方最新方法

wpsmac和pc版的区别_Mac版WPS Office和微软Office 2019哪个更好？

小米手机肿么还原时钟

15000流明是多少瓦

一般普通投影机功率多大?

苹果绿联转换器有些投影机不能用

坚果V9投影机具体参数?

有关九年级作文850字精选

80后90后_高一作文

中级卫生专业资格中医全科学主治医师中级模拟题2021年(9)案与解析

(精品)师范大学招考硕士研究生课程八六0试卷

ZXMVC8900(V3

【模拟人生4（The Sims 4）性感露背黑色亮片礼服MOD V20190313】模拟人生4（The Sims 4）性感露背黑色亮片礼服MOD V20190313 官方免费下载

【生化危机2：重制版（Resident Evil 2 Remake）克莱尔红头发深色服装MOD】生化危机2：重制版（Resident Evil 2 Remake）克莱尔红头发深色服装MOD 官方免费下载

掌握大数据数据分析师吗?_要掌握您的数据吗？这就是为什么您应该关心元数据的原因...

JSON省市区三级联动数据（2020年06月最新百度网盘后续不断更新）

【鬼泣5（Devil May Cry V）v1.0十四项修改】鬼泣5（Devil May Cry V）v1.0十四项修改官方免费下载