数据科学与大数据技术_如何突破数据科学与技术

编程入门 行业动态 更新时间:2024-10-20 11:41:35

数据科学与大数据技术

Data Science is one of the most fascinating breakout stories in recent history. The profession was first thrown into the spotlight back in 2012 with Harvard Business Review’s landmark article [1] —

数据科学是最近历史上最引人入胜的突破故事之一。 早在2012年,《哈佛商业评论》的地标性文章[1]就将该行业首次引起关注。

Data Scientist: The Sexiest Job of the 21st Century

数据科学家:21世纪最性感的工作

As inherently humble beings, everyone and their cousin was suddenly a data scientist — myself included, despite being a few years late to the party.

作为天生谦虚的人,每个人及其堂兄突然变成了数据科学家,包括我在内,尽管晚了几年。

The hype in data science is real — but this doesn’t detract from the fact that being a data scientist comes with a fantastic opportunity.

数据科学中的炒作是真实的-但这并不妨碍成为数据科学家的机会巨大。

There is no other profession I am aware of that enables those who are willing to dig deep and grit their teeth the opportunity to transform their working lives completely.

我知道,没有其他专业可以使那些愿意深入挖掘和磨砺牙齿的人有机会彻底改变他们的工作生活。

It is a well paid, intellectually stimulating, and respected profession with a vast number of openings that keep on growing.

这是一个报酬高,在知识上令人鼓舞且受人尊敬的职业,并且拥有大量不断增长的空缺。

Glassdoor. Glassdoor 。

This article is for those aspiring data scientists. Those who have committed to mastering a craft and are treading down the same road that many self-taught data scientists have walked down themselves.

本文适用于有抱负的数据科学家。 那些致力于掌握技术并走在许多自学成才的数据科学家走过的道路上的人。

We will cover three main areas of focus that I believe are key for any data scientist.

我们将涵盖三个主要领域,我认为这对任何数据科学家都是至关重要的。

> Programming — CompSci, Python, and SQL

>编程 -CompSci,Python和SQL

> Statistics — understanding our data and how to communicate results

>统计 -了解我们的数据以及如何传达结果

> Data Science / Machine Learning — the discipline itself

>数据科学/机器学习 -学科本身

We will refer to data scientist as a blanket term covering data scientists, machine learning engineers, AI engineers, and many other ‘data specialist’ roles.

我们将数据科学家称为一个统称,涵盖数据科学家,机器学习工程师,AI工程师和许多其他“数据专家”角色。

程式设计 (Programming)

Many of you will likely have already begun this part of the journey. If what you’re doing works, keep rolling with it.

你们中的许多人可能已经开始了这一部分旅程。 如果您正在做什么,请继续努力。

Python is undoubtedly the language of choice in data science. But alongside that, a good grasp of SQL is essential for the vast majority of data science roles.

无疑,Python是数据科学中的首选语言。 但是除此之外,对绝大多数数据科学角色来说,掌握SQL也是必不可少的。

Learn the essentials of both Python and SQL — with the basics of computer science — and you’re good to go.

通过计算机科学的基础知识,学习Python和SQL的基本知识,您就可以开始了。

哈佛大学CS50 (CS50 at Harvard)

Simply the best introduction to programming and computer science available. CS50: Introduction to Computer Science is the module that all Harvard computer science undergraduates begin with.

仅仅是对编程和计算机科学的最佳介绍。 CS50:《计算机科学概论》是所有哈佛大学计算机科学专业的学生开始学习的模块。

The best part about it? It is all available — for free — online.

最好的部分呢? 可以在线免费获得所有内容。

The full course is offered through edX and can be audited for free — or as a certification for $90.

完整的课程通过edX提供,可以免费审核-或作为认证获得$ 90。

If you’re not sure about taking the course, try watching the first lecture — it is incredible.

如果您不确定是否要参加该课程,请尝试观看第一堂课,这真是不可思议。

密码学 (Codecademy)

When I first started, Codecademy was my savior — offering both Python and SQL courses. There is even a Data Science learning path that covers all of the essentials.

当我刚开始时, Codecademy是我的救星-提供Python和SQL课程。 甚至有一条涵盖所有基本要素的数据科学学习路径。

Screenshot showing ‘Skill Paths’ on Codecademy.
屏幕截图显示了Codecademy上的“ Skill Paths”。

Codecademy is a brilliant place to learn Python/SQL. There is a free version, but the paid version — Codecademy Pro — is needed for most material.

Codecademy是学习Python / SQL的绝佳场所。 有一个免费版本,但是大多数材料都需要付费版本Codecademy Pro。

Sentdex (Sentdex)

Another fantastic resource is Harrison Kinsley’s (better known as Sentdex) YouTube channel.

另一个很棒的资源是Harrison Kinsley(更好地称为Sentdex )的YouTube频道 。

An old ‘Introduction to Python’ series of his alongside Codecademy was all I needed to get started. A newer version is available on his channel — Learning to program with Python 3.

我需要与Codecademy一起使用一个古老的“ Python入门”系列,这是我入门所需的全部。 他的频道上有更新的版本- 学习使用Python 3编程

If there is just one place you go to learn Python, it should be here!

如果只有一个地方可以学习Python,那么应该在这里!

专案 (Projects)

What is essential in this first step is curiosity. That insatiable curiosity and drive to make something are worth the world.

第一步的关键是好奇心。 永无止境的好奇心和创造新事物的动力值得全世界关注。

To keep the fire burning, simply follow your curiosity and work on a few personal projects in that domain — even if it diverges a little from the final objective. Some quick examples:

为了保持火势,只需遵循您的好奇心并从事该领域中的一些个人项目-即使它与最终目标有所不同。 一些简单的例子:

> Time/productivity tracker

> 时间/生产力跟踪器

> Game development (this looks good)

> 游戏开发( 看起来不错 )

> Python for finance (from Sentdex)

> 用于金融的Python( 来自Sentdex )

> Raspberry Pi development (Sentdex again)

> Raspberry Pi开发( 再次是Sentdex )

If you don’t feel up for developing a project from scratch — YouTube is a brilliant resource for finding video series that we can use as ‘guided’ projects.

如果您不愿意从头开始开发项目,那么YouTube是寻找视频系列的绝妙资源,我们可以将其用作“指导性”项目。

其他 (Other)

GitHub is an online software development platform, and eventually — a portfolio for your work. If investing in a subscription, Codecademy covers GitHub — or use this tutorial on YouTube.

GitHub是一个在线软件开发平台,并且最终是您工作的投资组合。 如果投资订阅,则Codecademy涵盖GitHub-或在YouTube上使用本教程 。

Whenever you have an issue with your code — which will be almost always — find the solution on Stack Overflow. If you cannot find an answer to a problem, you can ask!

每当您的代码有问题时(几乎总是如此),请在Stack Overflow上找到解决方案。 如果找不到问题的答案,可以提出要求!

摘要 (Summary)

CS50 is the only CompSci primer you need. Python and SQL are important (but more Python). The best resources for learning both are —

CS50是您唯一需要的CompSci底漆。 Python和SQL很重要(但更多的Python)。 学习这两种资源的最佳资源是-

> Codecademy — interactive UI, teaches Python and SQL

> Codecademy —交互式UI,教授Python和SQL

> Sentdex — free Python and data science tutorials on YouTube

> Sentdex -YouTube上的免费Python和数据科学教程

> Projects — original ideas, or follow along with a project online

>项目 -最初的想法,或者在线跟随项目

Keep your projects on GitHub, and Stack Overflow will be your new home.

将项目保存在GitHub上,Stack Overflow将成为您的新家。

统计 (Statistics)

Statistics is the engine that drives every aspect of data science. It is incredibly important to have a good grasp of statistics, but you don’t need to be a statistician.

统计是驱动数据科学各个方面的引擎。 掌握统计数据非常重要,但是您不必成为统计学家。

There are two critical parts to this:

有两个关键部分:

  1. Learning how to perform analyses and interpret results

    学习如何进行分析和解释结果
  2. Learning how to communicate results through visualization

    学习如何通过可视化传达结果

Of course, there is much more to statistics than this — but these two points are the basics that data scientists will be required to understand very well.

当然,统计数据还不止于此–但是这两点是数据科学家必须非常了解的基本知识。

R统计— Coursera (Statistics with R — Coursera)

Offered by Duke University, this is by far the most comprehensive course on statistics I have seen.

杜克大学提供的这是迄今为止我所见过的最全面的统计学课程。

Not only does it cover all you need to know about statistics, but on completion, you will receive a Coursera certification for the course — which can be added to LinkedIn or a CV to prove you understand the concepts covered by the course.

它不仅涵盖了您需要了解的有关统计信息的全部知识,而且在完成后,您将获得该课程的Coursera认证-可以将其添加到LinkedIn或CV以证明您了解该课程涵盖的概念。

The specialization is long, with a recommended learning time of seven months from start to completion. For most, it is easily doable in significantly less time, even with a full-time job — but will still require a significant commitment.

专业化很长,从开始到完成的建议学习时间为七个月。 对于大多数人来说,即使是全职工作,也可以在很短的时间内完成,但是仍然需要大量的投入。

But the course is ‘Statistics with R’ — not ‘Statistics with Python’…

但是课程是“带R的统计信息”,而不是“带Python的统计信息”…

R is the second most popular language among data scientists. However, it is not a language that I would recommend an aspiring data scientist to focus on — due to Python’s dominance in the field.

R是数据科学家中第二流行的语言。 但是,由于Python在该领域的主导地位,因此我不推荐有抱负的数据科学家来关注这种语言。

Despite the title, this course has a minimal focus on R. The majority of the course is simply focused on statistics, with brief interludes to demonstrate the application of statistical methods with R.

尽管有标题,但本课程仅对R进行了最低限度的学习。课程的大部分内容仅针对统计,并简短地介绍了使用R进行统计的方法。

It is a statistics course, with a flavor of R. Our focus is statistics, but — R is still relevant. Familiarity with the language adds another tool to our data science toolkit.

这是一门统计学课程,带有R的风格。我们的重点是统计学,但是-R仍然很重要。 熟悉该语言会为我们的数据科学工具箱添加另一个工具。

A brilliant complementary guide alongside the course — which also focuses on statistics and R (+ Python) — is Practical Statistics for Data Scientists. It is certainly not required for the course, but a nice-to-have.

该课程的精妙补充指南是《数据科学家实用统计学》,该指南还侧重于统计和R(+ Python)。 当然,这不是课程所必需的,但很不错。

可视化库和软件 (Visualization Libraries and Software)

Typically, we pick up data visualization during the process of learning other things — during the Statistics with R course or while learning Python. But there are a few libraries and software packages to keep a note of:

通常,我们在学习其他事物的过程中(在使用R进行统计的过程中或在学习Python的过程中)进行数据可视化。 但是,需要注意一些库和软件包:

  • matplotlib — the founding father of data visualization in Python

    matplotlib — Python数据可视化的创始人

  • seaborn — makes matplotlib look pretty

    seaborn-使matplotlib看起来很漂亮

  • plotly — for more advanced visualizations with Python

    plotly-使用Python进行更高级的可视化

  • Tableau and Power BI — the leaders in Business Intelligence data visualization, both are ‘no-code’ alternatives to data visualization

    Tableau和Power BI —商业智能数据可视化的领导者,都是数据可视化的“无代码”替代方案

Both Codecademy and Sentdex — as already mentioned — are also great resources for learning the essentials of data visualization with Python.

正如已经提到的, Codecademy和Sentdex都是学习Python数据可视化基础知识的重要资源。

数据科学/机器学习 (Data Science / Machine Learning)

This section is the final part simply because a solid foundation in programming and statistics is incredibly important — and should be the focus before moving onto more specific data science learning.

本部分仅作为最后一部分,因为在编程和统计方面的扎实基础非常重要,因此在进行更具体的数据科学学习之前,应将其作为重点。

Different approaches work for different people — but these are the absolute best resources I have ever used for learning data science.

不同的方法对不同的人有用,但是这些是我用于学习数据科学的绝对最佳资源。

从...开始 (Start With)

(1) 100 Page Machine Learning Book by Andriy Burkov truly beats all others for both data science beginners — and more seasoned professionals. If there is only one book that you ever use — make it this one.

(1) Andriy Burkov撰写的100页机器学习书 ,对于数据科学初学者以及经验丰富的专业人士来说,都确实击败了所有其他人。 如果您只使用过一本书,那就把它做一本。

(2) Machine Learning on Coursera is the best known of the massive open online courses (MOOCs) with good reason — it manages to simplify incredibly complex ML algorithms into intuitive and easy-to-learn concepts.

(2) Coursera上的机器学习是大规模公开在线课程(MOOC)中最有名的,这有充分的理由-它设法将难以置信的复杂ML算法简化为直观且易于学习的概念。

然后处理这些 (Then Work on These)

(3) Kaggle is an online platform for data science competitions. Now, we won’t be using it to compete against the teams of cutting edge ML researchers — but it does provide a friendly environment to practice our skills on already scoped out data science problems. Additionally, we can learn from the pros.

(3) Kaggle是数据科学竞赛的在线平台 。 现在,我们不会用它来与最先进的ML研究人员团队竞争-但它确实提供了一个友好的环境来练习我们已经解决的数据科学问题的技能。 此外,我们可以向专家学习。

(4) Advanced Data Science Specialization with IBM on Coursera is another MOOC — introducing ML on the cloud with Spark, several ML frameworks, and further exploration of the algorithms behind the magic.

(4) 在Coursera上与IBM一起进行的高级数据科学专业化是另一项MOOC -通过Spark在云上引入ML,几个ML框架,并进一步探索魔术背后的算法。

(5) Deep Learning Specialization on Coursera, again from Andrew Ng — this covers the foundations of deep learning and is a perfect introduction to the TensorFlow framework in Python.

(5)同样来自Andrew Ng的Coursera上的Deep Learning Specialization ,它涵盖了深度学习的基础,并且是Python中TensorFlow框架的完美介绍。

最后 (Finally)

Apply what you have learned. Don’t just consume knowledge, but apply and share it.

运用所学知识。 不要只消耗知识,而要应用和共享知识。

Think of exciting project ideas — even if they’re useless, go for it! Enjoy the process.

想想令人振奋的项目构想-即使它们没用,也继续努力吧! 享受过程。

Work on your projects, build something cool, and share it with the world.

处理您的项目,构建有趣的东西,并与世界分享。

Write about the projects and what you’re learning. Speaking about your work and teaching others will massively improve your ability.

写下有关项目以及您正在学习的内容。 谈论您的工作并教别人,将大大提高您的能力。

一致性是关键 (Consistency is Key)

Learning a new discipline is a long and arduous process. More of a marathon than a sprint — but entirely possible for those willing to put in the work.

学习一门新学科是一个漫长而艰巨的过程。 马拉松比冲刺更重要-但对于那些愿意参加这项工作的人来说,这完全是可能的。

Staying consistent throughout the process is essential. Much like Aesop’s fable about the tortoise and the hare — if, like the hare, we race ahead only to become distracted or lazy — we will lose. Instead, we must stay consistent and focused.

在整个过程中保持一致至关重要。 就像伊索关于乌龟和野兔的寓言一样,如果我们像野兔一样前进,只是变得分心或懒惰,我们就会输。 相反,我们必须保持一致和专注。

If you have any questions or suggestions, get in touch!

如果您有任何疑问或建议,请联系!

Thanks for reading!

谢谢阅读!

翻译自: https://towardsdatascience/how-to-break-into-data-science-and-tech-24a34a5e6aff

数据科学与大数据技术

更多推荐

数据科学与大数据技术_如何突破数据科学与技术

本文发布于:2023-06-14 08:59:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1459015.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:数据   科学   技术

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!