admin管理员组

文章数量:1582641

2020算法提前批

One of my earliest exposures to data science research, or really any research at all, when I was growing up was SETI@Home, which when I was 11 promised a tantalizing possibility: what if you could potentially be the person to (have your computer) discover a signal from alien life. SETI@Home, many people’s first exposure to the kind of distributed processing that today makes up, among other things, cryptocurrency mining, is a pattern recognition project and in 1999 when it launched and I enthusiastically installed it on my 233 megahertz Windows 98 PC, it was my first exposure to the kind of pattern searching in data that I now do as a data researcher. Spoiler alert: my HP Pavilion never found an alien signal. Disappointing, but not remotely surprising. (This is an article about data science, but it’s also about alien first contact narratives — they’re not as unrelated as you think, and I ask that you bear with my layers of metaphors here. The Python code will come eventually!)

我长大时最早接触数据科学研究或几乎所有研究的其中之一是SETI @ Home ,当我11岁时就承诺了一个诱人的可能性:如果您有可能成为那个人(拥有计算机),该怎么办? )发现来自外星生命的信号。 SETI @ Home是许多人第一次接触到今天构成的分布式处理,其中包括加密货币挖掘,这是一个模式识别项目,它于1999年启动,我热情地将其安装在233兆赫Windows 98 PC上,这是我作为数据研究人员现在第一次接触数据中的模式搜索。 剧透警报:我的HP Pavilion从未发现外来信号。 令人失望,但并不令人意外。 (这是一篇有关数据科学的文章,但也涉及外星人首次接触的叙述,它们并没有您想像的那么无关紧要,我要求您在这里加上我的几层隐喻。Python代码最终会出现!)

For anyone who ever wanted to meet aliens, Fermi’s paradox is a grim problem: shouldn’t we have, by now? Humans have been searching the stars with SETI for nearly half a decade, and have turned up no credible evidence of alien life. As suggested in Sky & Telescope by SETI astronomer Seth Shostak back in 2006, one reason SETI’s quest for extraterrestial intelligence may seem, and ultimately prove, futile, is because of the progress of technology. That is to say: if hypothetical alien intelligence has a history that mimics ours, they will have a lifespan with a very short amount of time where they are detectable by radio telescope scans like the kind traditionally conducted by SETI.

对于任何想见外星人的人来说,费米悖论是一个严峻的问题:我们现在不应该吗? 人类已经用SETI搜索星星了将近五年,并且没有发现任何可靠的外星生命证据。 SETI天文学家Seth Shostak早在2006年就在《天空与望远镜》中提出了一项建议, SETI寻求外星智能的原因似乎似乎并最终证明是徒劳的,这是因为技术的进步。 这就是说:如果假设的外星情报具有模仿我们的历史,那么它们的寿命将非常短,可以像SETI一样通过射电望远镜扫描来检测。

For millions of years before the radio, human beings had no appreciable radio signal impact upon the universe. At the time SETI was conceived, we had a lot — leading to the famous scene in the 1997 Jodi Foster SETI drama Contact where one of our collective history’s least admirable members became the representative of all humanity by having (allegedly, in the story) featured in one of the earliest high power radio signals. What the ultimately optimistic Contact counts on and highlights in the movie’s opening credits, which feature a stellar journey at light speed of a signal sent back in response to the Fuhrer from the Vega system, is that we will continue to transmitting — so even if Adolf Hitler’s opening to the 1933 Olympics were the first transmission an alien civilization ever saw of us, it wouldn’t be the last, because they would begin to receive more and more information in the form of high powered television and radio signals, possibly allowing them to understand us.

在广播发生之前的数百万年中,人类对宇宙没有明显的无线电信号影响。 在SETI诞生之初,我们经历了很多事情,并在1997年的Jodi Foster SETI电视剧《接触》中崭露头角。在该剧中,我们集体历史上最令人钦佩的成员之一成为了全人类的代表人物(据称在故事中)最早的大功率无线电信号之一。 最终乐观的《接触》指望并在电影的开场片中突显出来,即以光速从Vega系统发送回响应Fuhrer的信号而实现的恒星旅程,这就是我们将继续传播-即使Adolf希特勒的开放到1933年奥运会一个外星文明所见到我们的第一次传输,也不会是最后一次,因为他们将开始受到越来越多的信息,在动力的电视和无线电信号的形式,有可能使他们了解我们。

But according to Shostak, the amount of time when our civilization is visible in this way may actually end within a few decades of when Contact is supposedly set (1997) — and as Shostak’s article came out in 2006, we may already be past that era. We publish reams and reams of data — as I know well, being a scholarly researcher and data science — some of which we would probably like aliens to see, and a lot of which we very much would not. But the chances are miniscule aliens will see our civilization today, in its greatness (the music of Evanescence), its horribleness (the speeches of so many present-day authoritarian leaders), or even its mediocreness (Avengers: Endgame), is miniscule, because we move all that data over fiber optic cables, or at most, terrestrial-directed satellites whose data links don’t bounce off into space.

但是根据Shostak的说法,以这种方式可见我们文明的时间实际上可能会在《接触》设定为“接触”(1997年)后的几十年内结束,而随着Shostak的文章于2006年发表,我们可能已经过去了那个时代。 我们发布大量数据(据我所知,是一名学者和数据科学专家),其中一些我们可能希望外星人能看到,而许多我们却非常希望。 但是,今天的外星人微不足道的机会是,我们的文明微不足道,因为它的伟大( E逝的音乐),它的可怕性(当今许多专制领导人的讲话),甚至它的平庸(复仇者联盟:残局),因为我们将所有数据都通过光缆移动,或者最多将其数据链路不会跳入太空的地面定向卫星移动。

I am not an alien, but I am a data researcher working on creating a portfolio of web-data based projects with actionable findings to answer questions in which I am interested — questions which may relate to that great, mediocre, or horrible mass media of our time, or to the subject of my doctoral dissertation, mass online harassment related to “nerd” culture. And in that sense, today’s web is increasingly hostile and, yes, alien — not necessarily or only in terms of its denizens and what they put there, but in terms of its architecture and structure. This is a complex issue with positive and negative aspects, and I want to dig into these problems a bit, using the concept of alien contact as a metaphor, while discussing some of the challenges I’ve faced in data science research, how I overcame them when I was a university-based academic researcher, and how I’m overcoming them as an independent researcher now. In essence, however, I am concerned that in terms of the distribution of information on the web, we are entering an end to the “broadcast era” of easily available data for research, moving to a number of new communication standards that carry great advantages for performance, privacy, and other concerns, but which stand to make a huge amount of human knowledge that probably should be public effectively invisible to researchers, just as our civilization may now be to any alien SETI programs that are out there.

我不是外星人,但我一名数据研究人员,致力于创建基于Web数据的项目组合,并提供可行的发现,以回答我感兴趣的问题,这些问题可能与伟大,平庸或恐怖的大众媒体有关我们这个时代,或者就我的博士学位论文而言,是与“书呆子”文化有关的大规模在线骚扰。 从这个意义上讲,当今的网络越来越具有敌意性,而且是越来越陌生的网络-不一定或仅就其居民和他们所放置的东西而言,而是就其体系结构而言。 这是一个具有积极和消极方面的复杂问题,我想用外星人接触的概念作为隐喻来深入探讨这些问题,同时讨论我在数据科学研究中面临的一些挑战以及如何克服当我还是大学的学术研究员时,他们以及我现在如何克服它们成为一名独立研究员。 但是,从本质上讲, 我担心,就网络上的信息分发而言,我们正在结束易于研究的数据“广播时代”,并转向了许多新的通信标准,这些标准在性能,隐私,以及其他问题,但这些问题使大量的人类知识可能应该对研究人员有效地公开消失,就像我们的文明现在对任何存在的外星人SETI计划一样。

Web X.0和数据收集 (Web X.0 and Data Harvesting)

When SETI@Home was getting started and I was an enthusiastic preteen messing around with whatever programming tools I could find on my aforementioned 233 Megahertz Intel Celeron desktop with a CRT (capable of up to 1024x768 display resolution!) I, of course, had a Geocities. And an Angelfire, and a FortuneCity, and at least half a dozen other sites on free web hosting providers long since lost to the churn of history (but probably archived somewhere on the Internet Archive, which I’ll talk about in a little while more extensively). HTML 3.0 and then 4.0 were the languages in which I created my first public facing web content, even though I had used BASIC for Mac and Microsoft QBASIC to write rudimentary games before that, as seems to be the programming “origin story” for most of my coder generation. Most of these free website tools offered a WYSIWYG (What You See Is What You Get) design interface, but it was limited by the lack of what we would now call responsive features in browsers, so those of us who were enthusiastic about our Star Wars fan pages learned HTML. This was Web 1.0, and at its more “mature,” Web 1.0 offered a quite organized, dignified, and structured insight into content that once would have been hosted on non-web-based mainframes and servers.

当SETI @ Home开始使用时,我还是一个热情的青春期人们,在我上面提到的233兆赫Intel Celeron台式机(带CRT)(能够达到1024x768的显示分辨率!)上找到的任何编程工具时,我当然都拥有一个地质城市。 和Angelfire和FortuneCity ,并且至少有半打早已失去了历史的流失(但可能存档的某处免费网站托管服务提供商的其他网站互联网档案,我将谈论一小会儿更多广泛地)。 HTML 3.0和4.0是我创建第一个面向公众的Web内容时使用的语言,即使在那之前我曾使用Mac的BASIC和Microsoft QBASIC编写基本的游戏,对于大多数人来说,这似乎都是编程的“起源故事”我的编码器一代。 这些免费网站工具大多数都提供了所见即所得(WYSIWYG)(所见即所得)设计界面,但是由于缺乏我们现在称之为浏览器中的响应功能的限制,因此我们当中那些热衷于《星球大战》的人粉丝页面学习了HTML。 这是Web 1.0,更成熟的是Web 1.0提供了对内容的相当有组织,有尊严和结构化的洞察力,这些内容曾经可以托管在非基于Web的大型机和服务器上。

Academic and government institutions shared scientific findings — one of my favorites at the time was of course NASA’s Jet Propulsion Laboratory, which shared and continues to share under newer web paradigms images of deep space collected by Hubble and other observatories. Accessing JPL to get a new space-y desktop background in those days was pretty simple: you’d navigate to the website, probably from a bookmark (we still have those, but does anyone really use them? Be honest…) and then select the kind of image you were looking for (pictures of planets in our solar system, extrasolar imaging, etc.), and eventually you’d either get to a page that embedded the image you were looking for with a standard <img> tag, or you’d get a description with a link to an FTP directory that you could access in your browser (technically leaving the “web” altogether) to download various resolutions and formats of the picture.

学术机构和政府机构共享科学发现—当时我最喜欢的之一当然是NASA的喷气推进实验室,该实验室在哈勃和其他天文台收集的深空图像的最新网络范例下共享并继续共享。 当时访问JPL以获得新的space-y桌面背景非常简单:您可能会从书签导航到该网站(我们仍然有这些,但是有人真的在使用它们吗?说实话...),然后选择所要查找的图像类型(太阳系中行星的图像,太阳系外成像等),最终您要么进入到一个页面,其中嵌入了您要查找的带有标准<img>标签的图像,或者您将获得带有FTP目录链接的说明,可以在浏览器中访问该FTP目录(技术上完全离开“网络”),以下载各种分辨率和图片格式。

Although the HTML4 standard was technically released in 1997, just a year after I started using the Internet and about a year before I learned to write in HTML, most amateur web designers and many professional web designers would continue for years — some to this day, for nostalgia or ideological reasons — to more or less use HTML to describe visually what the user was going to see, within the heavy set of constraints that Netscape Navigator and Internet Explorer offered at the time. While undoubtedly many government and academic sites adopted the growing standard of using <div> tags to make readable markup code, my review of the Internet Archive’s copy of JPL’s front page from September of 2000 finds that very little useful information is contained in what we now call the DOM (Document Object Model) of the page. The front page has a blurb about an article related to the discovery of a new “dog bone” asteroid, and if this were the modern JPL website, <div> tags would set the title, text, link, author, and other content and meta-content of this article aside in such a way that I could write an automated script to, say, check JPL daily and save new articles to a local database. In 2000, all I would have been able to do was save the page — it’s unlikely that even studying the exact structure of the page would have allowed me to make a reliable “scraper” that persisted across multiple iterations of the page itself.

尽管HTML4标准是在1997年发布的,但在我开始使用Internet的第一年和学习HTML的大约一年之前,大多数业余Web设计师和许多专业Web设计师都将持续数年-直到今天,出于怀旧或意识形态的原因-在当时Netscape Navigator和Internet Explorer提供的大量限制条件下,或多或少使用HTML来直观地描述用户将要看到的内容。 毫无疑问,尽管许多政府和学术机构都采用了使用<div>标签制作可读的标记代码的日益严格的标准,但我对2000年9月以来Internet存档的JPL主页的副本的评论发现,我们现在所包含的信息很少调用页面的DOM(文档对象模型)。 首页上有一篇有关发现新“狗骨头”小行星的文章的简介,如果这是现代JPL网站,则<div>标签将设置标题,文本,链接,作者和其他内容,除了本文的meta内容之外,我还可以编写一个自动化脚本来每天检查JPL并将新文章保存到本地数据库中。 在2000年,我所能做的就是保存页面-即使研究页面的确切结构也不太可能使我做出可靠的“抓取器”,该“抓取器”在页面本身的多次迭代中仍然存在。

Ah, Web 1.0 design. So crisp and clean, and yet so inscrutable to algorithms!
嗯,Web 1.0设计。 如此简洁明了,但算法却难以理解!

(I know “scraping” is something people have ethical concerns about — I do too, and I’ll discuss them subsequently, but in my opinion they’re essential to the practice of many kinds of data research, and JPL, as a government website meant to disseminate information, is a great example of a page that there should be minimal ethical concerns about scraping.) The point I’m making here is that even websites that were designed to disseminate data “back in the day” were still mostly only accessible to manual traversers of the web, which wouldn’t have been great for someone trying to do the sort of research I do now back then. One of the things that incentivized web developers to adopt more computer-comprehensible HTML structure was search engine optimization (SEO): the same technology that boosted your page rank by making it friendly to Google’s web crawlers also made it useful for more niche automated web programs, like those used by data scientists to scrape pages.

(我知道“抓取”是人们出于道德考虑的问题,我也是这样做的,我将在以后进行讨论,但是我认为,对于政府而言,它们对于实践各种数据研究和JPL至关重要网站旨在传播信息,这是一个很好的页面示例,表明应该对爬网的伦理关注降至最低。)我在这里要说的是,即使是旨在“追溯”数据的网站,仍然大部分只能由网络的手动遍历器访问,这对于尝试进行我现在所做的那种研究的人来说并不是一个很好的选择。 激励Web开发人员采用更多计算机可理解HTML结构的事情之一是搜索引擎优化(SEO):该技术通过使其对Google的网络爬虫友好而提高了页面排名,也使它对于更利基的自动化Web程序很有用。 ,就像数据科学家用来抓取页面的内容一样。

If “Web 1.0,” insofar as that’s a thing we can delineate, was the web before its “broadcast era” where researchers could find and collate data, like alien intelligences watching transmissions from a vast difference, “Web 2.0” was/is that era itself, for better and for worse. Because of my research interests, I’m focused on the better, but I intend to acknowledge the worse too. As dynamically generated server-side webpages became more ubiquitous, it became easier to predict where exactly in the webpage’s code — the DOM — particular information would fall. The proliferation of content management systems (CMSs) mean that you can look at a blog written in WordPress and find tell-tale signs that code can identify algorithmically reflecting all of the blog post’s content and metadata. By the 2010s, managing entire sites and even businesses in WordPress became a fairly normal and popular thing, and this was just one example. Because computers were generating the structure of the page, it was very easy to reverse engineer, and it was this environment when beloved scraping tools like Beautiful Soup and Scrapy were developed for. RESTful APIs, which existed in a sort of undefined raw form in Web 1.0 in the sense that you could often reverse engineer CGI queries that used URLs to communicate with the server, were formalized in the ’00s and meant that data queries to a server sent by a human couldn’t be easily distinguished from queries sent by a bot (or an alien!)

如果说“ Web 1.0”,就我们可以描述的东西来说,它是在“广播时代”之前的网络,研究人员可以找到并整理数据,例如外星情报机构观察到巨大差异的传输,那么“ Web 2.0”就是时代本身,好与坏。 由于我的研究兴趣,我专注于更好,但我也想承认更糟。 随着动态生成的服务器端网页变得越来越普遍,更容易预测网页代码(即DOM)中确切的位置将落入特定信息。 内容管理系统(CMS)的激增意味着您可以查看用WordPress编写的博客,并找到表明代码可以通过算法识别出博客文章的所有内容和元数据的明显标志。 到2010年代,使用WordPress管理整个网站甚至企业已成为一种相当普遍和流行的事情,而这仅仅是一个例子。 因为计算机正在生成页面的结构,所以逆向工程非常容易,并且正是在这种环境下开发了诸如Beautiful Soup和Scrapy之类的受欢迎的抓取工具。 RESTful API在Web 1.0中以某种未定义的原始形式存在,从某种意义上来说,您经常可以反向工程使用URL与服务器进行通信的CGI查询,并在'00'中进行了形式化,这意味着向服务器发送的数据查询很难将人类与机器人(或外星人!)发出的查询区分开来

This era is not gone, and it probably never will entirely be gone. But something I found to be rather confusing as I was doing research for this article (after hitting a wall with my data collection) was numerous articles about the idea of Web 3.0 being allegedly a “more open” web. As a data scientist who needs data I simply cannot collect manually, I cannot agree with this, and it’s not the direction I see things going. Right now, there are pages and sites on the Internet from which I can get useful information, and pages and sites from which I cannot. An example of a website that still works according to the machine-generated, machine-readable paradigm for the moment is mass media review site Metacritic, which covers movies and television but largely deals with video games. Right now I am working on a number of projects in which I try to answer research questions about both the public opinion about and financial success of media properties that include minority characters (especially LGBT people and women) in leading roles, and accessing the raw opinion of the online public is an important factor in this. Because Metacritic is the dominant location in gaming for both collating formal critic reviews and for users to post their own opinions (often in famously toxic ways), I have access to this kind of raw opinion when it comes to games — but not for movies, where Rotten Tomatoes, which I’ll discuss in a moment, fills much the same role. (People post reviews of movies to Metacritic, and its indexing of professional reviews is just as good as Rotten Tomatoes’, but it doesn’t reflect the raw public sentiment about movies the way it does about games.) To gain access to critics’ reviews of The Last of Us Part II, a controversial game that produced a vast diversity of opinions, I wrote a Python script to obtain the URLs of all those reviews (which I was later able to obtain the text and metadata of very simply with the Newspaper3k library, which specifically targets news and news-like sites to find the contents and authors/titles of articles).

这个时代没有消失,也许永远也不会完全消失。 但是,当我在为这篇文章进行研究时(在我的数据收集陷入困境之后),发现令我相当困惑的是有关Web 3.0被认为是“更开放”的Web的想法的许多文章。 作为需要数据的数据科学家,我根本无法手动收集数据,我不同意这一点,这也不是我所看到的方向。 现在,Internet上有一些页面和站点可供我获取有用的信息,而有些页面和站点却无法从中获取有用的信息。 大众媒体评论网站Metacritic是目前仍可以根据机器生成的机器可读范例进行工作的网站的示例,该网站涵盖了电影和电视,但主要涉及视频游戏。 目前,我正在从事多个项目,在这些项目中,我试图回答有关媒体财产的公众舆论和财务成功的研究问题,其中包括担任主角的少数民族角色(尤其是LGBT男女),并获得原始意见。在线公众的人数是其中的重要因素。 由于Metacritic在游戏中占主导地位,既可以整理正式的评论家评论,也可以让用户发表自己的观点(通常以有毒的方式),因此在游戏方面,我可以获得这种原始观点-但对于电影而言,我将在稍后讨论的烂番茄在其中扮演的角色大致相同。 (人们将电影评论发布到Metacritic上,并且其对专业评论的索引编制和烂番茄一样好,但是它并不能像对待游戏一样反映出公众对电影的原始看法。) The Last of Us Part II的评论,这是一个引起争议的游戏,产生了各种各样的观点,我编写了Python脚本来获取所有这些评论的URL(后来,我可以使用Newspaper3k库,专门针对新闻和类似新闻的站点,以查找内容和文章的作者/标题。

# request data from Metacritic's site *via* scraping
# created using assistance and code from Towards Data Science


# networking libraries
import requests
from bs4 import BeautifulSoup


# database libraries
import pandas as pd


# set the URL to be extracted for this script
url = 'https://www.metacritic/game/playstation-4/the-last-of-us-part-ii/critic-reviews'


# identify self as a standard browser
user_agent = {'User-agent': 'Mozilla/5.0'}
response = requests.get(url, headers = user_agent) # get the page into memory


soup = BeautifulSoup(response.text, 'html.parser') # standard Soup parsing




# look for review_content tags since every user review has this tag
for review in soup.find_all('a', class_='external'):


    print(review.get('href'))
        #review_dict['review'].append(review.find('span', class_='blurb blurb_expanded').text)

This is a really simple solution, and I was able to add some database code and change up the specific elements I was looking for and extract over 25,000 (very angry!) user reviews of the same title from the user reviews section, and place them in a local PostgreSQL database. (All of this was done consistently with the robots.txt for the site, for folks concerned with ethics.)

这是一个非常简单的解决方案,我能够添加一些数据库代码并更改所需的特定元素,并从“用户评论”部分中提取了25,000多个(非常生气!)相同标题的用户评论,并将其放入在本地PostgreSQL数据库中。 (所有这些工作都是通过网站的robots.txt进行的,涉及伦理方面。)

You can’t do this for RottenTomatoes. You can’t do it reliably with gaming review site Polygon either. RottenTomatoes is especially egregious with its user reviews, and its “Web 3.0” method of doing things is why. (Polygon is crawlable to an extent, it just doesn’t support deep dives into the archives, even for humans, which is sort of a separate issue.) RottenTomatoes loads 8–12 user reviews of a given film at a time, and you can only access the next set by clicking, at which point an API request using JavaScript is sent to the server (I’m working on reverse-engineering this right now to figure out exactly what kind of request it is) and you have to keep clicking, manually. The testing library Selenium can simulate this sort of thing by trying to fool the server into thinking it’s a human (this is the sort of thing that Captchas are designed to prevent, of course, but as far as I know RottenTomatoes doesn’t use those, nor does it actually forbid scraping — it just makes it really hard), but Selenium is not really for scraping. The point of the library is to do test-based development on JavaScript sites, and its ability to help with scraping is sort of a bonus. It’s not reliable, it’s not what it’s designed for, and changes to the site that break scraping scripts are going to be much harder to reliably fix when Selenium has to be put in the mix. The data isn’t inaccessible, but it’s exponentially hard to get to — and this is becoming a trend.

您不能对RottenTomatoes执行此操作。 您也无法通过游戏评论网站Polygon可靠地做到这一点。 RottenTomatoes的用户评论尤其令人震惊,其“ Web 3.0”做事方法就是原因。 (Polygon在某种程度上是可爬网的,它甚至不支持深入探究档案,甚至对人类来说,这是一个单独的问题。)RottenTomatoes一次加载给定电影的8–12个用户评论,您只能通过单击访问下一组,这时将使用JavaScript的API请求发送到服务器(我正在对此进行反向工程,以弄清楚到底是哪种请求),您必须保留单击,手动。 测试库Selenium可以通过尝试使服务器认为自己是人类来模拟这种事情(当然,这是Captchas旨在防止的事情,但是据我所知RottenTomatoes并没有使用它们,它实际上也没有禁止刮刮-确实使刮刮变得很困难),但是Selenium并不是真正用于刮刮。 该库的重点是在JavaScript网站上进行基于测试的开发,其帮助进行抓取的功能是一项额外的奖励。 它不可靠,不是它的设计目的,当必须添加Selenium时,要更改可靠的修复破坏脚本的站点将变得更加困难。 数据并非不可访问,但要获得指数级的难度很大,这正在成为一种趋势。

I’m also studying web development, and something I react in my JavaScript text made me feel very depressed and inspired this post: it was simply a basic description of the origin and purpose of GraphQL, which is designed in part to supplant RESTful APIs. GraphQL is a way of directly sending queries to the server’s backend, developed by Facebook to allow for the very complex kinds of queries the Facebook client has to make, many of which would be difficult to execute with a RESTful process. (The “Graph” in the name may seem confusing until you realize it refers to Facebook’s Social Graph.) Medium, incidentally, uses GraphQL.

我也在研究Web开发,我对JavaScript文本的React使我感到沮丧,并激发了这篇文章的内容:这只是GraphQL的起源和目的的基本描述,其部分目的是取代RESTful API。 GraphQL是一种直接将查询发送到服务器后端的方法,由Facebook开发,以允许Facebook客户端必须进行非常复杂的查询,其中许多查询很难通过RESTful流程执行。 (名称中的“ Graph”可能看起来令人困惑,直到您意识到它指的是Facebook的Social Graph。)顺便说一句,中号使用GraphQL。

The thing that has me broken up about GraphQL is that it takes us back in time, almost. Back when I had my 233Mhz Celeron, there was the web and there were applications, or, you know, “programs” as we used to call them. These days video games and specialized software are close to the only “programs” we run locally — even things like the Atom Text Editor are essentially JavaScript webpages running inside a light virtualization layer. The weird thing is that running programs locally didn’t use to make them necessarily understandable to the user — they were usually compiled code in a low-level language like C, so unless you could read binary Matrix style or were one of those super-haxxorz who could figure out memory locations that were being referenced like people used to do to cheat at games, you pretty much knew what the program’s UI told you. The web was somewhat unique in that you could always “view source” and could often decipher API requests being sent from traditional types of web pages and even some “web applications.”

我对GraphQL的困惑是,它几乎使我们回到了过去。 早在我拥有233Mhz Celeron时,就已经有了网络,并且有应用程序,或者,我们曾经称呼它们为“程序”。 如今,视频游戏和专用软件已接近我们在本地运行的唯一“程序”,即使像Atom文本编辑器之类的东西实际上也就是在轻量虚拟化层中运行JavaScript网页。 奇怪的是,在本地运行程序并没有使它们对于用户来说一定是可以理解的,它们通常是用C之类的低级语言编译的代码,因此除非您能阅读二进制Matrix样式或者是其中的一种, haxxorz可以弄清被引用的内存位置,就像过去经常在游戏中作弊的人一样,您几乎知道该程序的UI告诉了您什么。 网络有些独特,因为您可以始终“查看源代码”,并且经常可以解密从传统类型的网页甚至某些“网络应用程序”发送的API请求。

Today, I believe the dominance of the web application is taking us back to that era when data is obscured and we only see what developers want us to see. This is quite justifiable in a lot of cases — obviously, if your doctor’s office or your university maintains a site that contains information about your health or your grades, you probably don’t want that to be easily reverse engineered. Infosec is good! But as design principles for public facing web sites — like RottenTomatoes — become entangled with “responsive” web design that depends heavily on queries that are difficult or even impossible to emulate, massive amounts of public information, such as the sentiments people have about a movie, vanish from a research point of view. Us aliens, us researchers, can’t access it — and honestly, how useful is it to the average viewer either? 12 random user reviews of a movie are unlikely to reliably tell you if you want to see it or not, not without some way to sort and filter it.

今天,我相信Web应用程序的主导地位使我们回到了数据被遮盖的时代,我们只看到开发人员希望我们看到的内容。 在许多情况下,这是完全合理的-显然,如果您的医生办公室或大学维护的站点包含有关您的健康或成绩的信息,则您可能不希望对它进行轻松的反向工程。 Infosec很好! 但是,随着面向公众的网站(例如RottenTomatoes)的设计原则与“响应式”网站设计纠缠在一起,“响应式”网站设计在很大程度上依赖于难以甚至无法模仿的查询,因此产生了大量的公共信息,例如人们对电影的看法从研究的角度来看,它消失了。 美国的外星人和研究人员无法使用它-老实说,它对普通观众来说有多有用? 电影的12条随机用户评论不可能可靠地告诉您是否要观看电影,这并非没有某种分类和过滤的方法。

研究人员的首次接触:伦理学和破坏性探究与对知识的追求 (The Researcher’s First Contact: Ethics and Destructive Probes versus the Quest for Knowledge)

I mentioned I was going to talk about the ethics of scraping. As I’ve noted, the two websites I refer to scraping above, RottenTomatoes and Metacritic, don’t forbid scraping, and Metacritic had no problem with me downloading 25,000 reviews of a game — twice, even, once to raw text files and once into an SQL table. It’s clear RottenTomatoes is scraping-hostile, but they don’t say you can’t do it. And this brings me to the question of what should be scraped, and to two other first contact stories about actual aliens, both of which take a much darker look at the prospect of encountering sentient life. When we consider the potential and existing harms of web scraping, but also the consequences of silencing the exchange of web data for research, I think these provide worthwhile food for thought.

我提到我要谈论刮削的伦理。 正如我已经指出的,我在上面提到的两个抓取网站RottenTomatoes和Metacritic都禁止抓取,Metacritic毫无疑问地下载了25,000个游戏评论,甚至两次下载到原始文本文件中,一次到一个SQL表。 很明显,RottenTomatoes令人刮目相看,但他们并没有说你做不到。 这给我带来了一个问题,即应该铲除的东西,以及另外两个有关实际外星人的初次接触故事,这两个故事都更黑暗地看了遇到有情生活的前景。 当我们考虑到网络抓取的潜在危害和现有危害,同时也考虑到网络数据交换的沉默对研究的后果时,我认为这些提供了值得思考的结果。

In Peter Watts’ novel Blindsight (hosted in a very Web 1.0 HTML format), the “broadcast era” explanation for the failure of SETI intersects with another popular explanation: the deadly or destructive probes scenario, known probably most famously today as the inspiration for the Reapers in Bioware’s Mass Effect video game series, which were ultimately inspired by Fred Saberhagen’s Berserker cycle. In Blindsight, an alien intelligence that the narrator refers to as Rorschach, a sentient starship, arrives in our solar system and begins plotting our doom, because it views sentient, conscious life as a threat. The story takes place in our future, and Siri, the narrator, infers that Rorschach began traveling to destroy us as a result of our broadcast era — it detects signals indicating sentience and consciousness, and it seeks them out to destroy them. Blindsight, a sort of science fiction horror novel and an inversion of many of the positive first-contact tropes found in Contact or Interstellar, posits that making information visible and intelligible can be an existential threat.

在彼得·沃茨的小说《盲视》中 (以非常Web 1.0HTML格式托管),SETI失败的“广播时代”解释与另一个流行的解释相交:致命或破坏性的探测场景,今天可能最有名的是它被认为是Bioware质量效应收割者的灵感来源视频游戏系列,最终灵感来自Fred Saberhagen的《狂战士》周期。 在Blindsight中,叙述者将其称为有感知力的星际飞船Rorschach的外来情报到达我们的太阳系并开始绘制我们的厄运,因为它将有意识的有意识生命视为威胁。 这个故事发生在我们的未来,叙述者西里(Siri)推断,由于广播时代的缘故,罗尔沙赫(Rorschach)开始毁灭我们-它检测到表明情感和意识的信号,并寻找它们来摧毁它们。 《盲视》是一种科幻恐怖小说,并且在《接触》或《星际穿越》中发现了许多积极的第一接触比喻,他们认为使信息可见和易于理解可能是一个生存威胁。

Saberhagen’s Berserkers, like Rorschach, are a terrifying idea of an artificial intelligence that does NOT want to be “like us.”
像Rorschach一样,Saberhagen的Berserkers也是一个令人讨厌的人工智能想法,它不想像“我们一样”。

And it’s absolutely correct, if we look at how human communication has been reshaped since the founding of Facebook in 2004. (Not to lay all of this at Facebook’s feet, to be clear; I suspect this trend would have occurred regardless, and Facebook isn’t even currently the driving force in many of the dangerous elements of open communication.) Although there was a lot of strange paranoia that, to be honest, wasn’t really founded in anything real when I was a kid about how you shouldn’t share your real name online because you might be kidnapped by people-kidnappers, the reality has caught up to the paranoid fantasy. Social media has exposed so much of our lives, and even people who think they are careful, think they don’t reveal themselves, slip up.

如果我们看看自2004年Facebook成立以来如何改变了人际交流,那是绝对正确的。(显然,不要把所有这些都摆在Facebook的脚下;我怀疑这种趋势无论如何都会发生,而Facebook不是“甚至在开放式沟通的许多危险因素中,目前甚至还没有推动力。)尽管说实话,虽然有很多奇怪的偏执狂,但当我还是个孩子的时候,关于这该怎么办却并没有真正地建立在任何现实的基础上”不要在网上分享您的真实姓名,因为您可能被绑架者绑架了,现实赶上了偏执幻想。 社交媒体已经暴露了我们的生活,甚至那些认为自己很谨慎,认为自己没有透露自己的人也会溜走。

We know marketing companies have tons of data on us (and I’m not against this, to be clear! Marketers want to sell you things or persuade you — not harm you!) Unfortunately, sites like the Internet Archive — which I use, and whose archives of JPL I link in this article itself — also make it essentially impossible to erase a digital footprint. The Internet Archive relies on web scraping, incidentally, and it’s highly likely that its crawler is thwarted by the kinds of responsive design that also thwart my research crawlers — which means that if you’re building a personal website with data you don’t want to persist forever, maybe you should be building with complex React components and GraphQL and all the rest! But of course, we don’t do personal websites anymore, except as essentially digital business cards that we really have zero problem being archived forever. All the stuff we might wish would vanish, that could be used to destroy us like Rorschach wants to destroy humanity, lives in social media — in things like Internet Archive archiving deleted tweets, for instance. A true definition of a double-edged sword, archiving tweets allows us to hold political leaders accountable, but also, as I document extensively in my doctoral dissertation, foments harassment against marginalized groups and people. If someone’s personal information or secrets are exposed on the social web, erasing that data from amoral archives that, like the Rorschach spacecraft, function without true awareness or consciousness of what they are doing, is nigh-impossible.

我们知道,营销公司掌握着大量数据(很明显,我并不反对!营销人员想要向您出售东西或说服您-而不是伤害您!)不幸的是,我使用了诸如Internet Archive之类的网站,以及我在本文中链接的JPL档案,从根本上讲,消除数字足迹几乎是不可能的。 偶然地,Internet存档依赖于Web抓取,并且它的爬虫很可能受到也使我的研究爬虫不堪一击的响应式设计所挫败-这意味着如果您要使用数据来构建个人网站,要永远持续下去,也许您应该使用复杂的React组件和GraphQL以及其他所有东西来构建! 但是,当然,我们不再个人网站,除了本质上是数字名片之外,我们永远将其存档为零问题。 我们可能希望消失的所有东西都可以消失,就像罗夏(Rorschach)想要毁灭人类,生活在社交媒体中一样,摧毁我们,例如在Internet Archive存档已删除的推文中。 存档推文是对双刃剑的真正定义,它使我们能够追究政治领导人的责任,而且,正如我在博士论文中广泛记录的那样,煽动骚扰针对边缘化群体和人民。 如果某人的个人信息或秘密在社交网络上公开,则无法删除诸如Rorschach航天器之类的不道德档案中的数据,而这些数据在没有真正意识到或意识到自己正在做什么的情况下是不可能的。

So I understand why web crawling, “web bots,” and so on, have the reputation they do — of alien invaders, here to destroy. Because, sometimes they are. And I am not an “information wants to be free” person — if anything, I frequently feel we could take a step backwards, back toward pseudonymity, back toward privacy, even as I recognize that that’s likely impossible. But research is research, and the twin examples of Metacritic and RottenTomatoes provide an example of information, and the answers to questions, being rendered utterly inaccessible — “annihilated,” if you will — by modern web design paradigms. One research question I want to complete as part of my data science research portfolio is to compare audience reactions to the video game The Last of Us Part II with the film Atomic Blonde. This is both because I’m a fan of both of these, and because they are similarly rare entities — media about a lesbian or bisexual woman who engages in a lot of fighting and combat, something typically reserved for heterosexual men in media. I want to compare how people react to them. But I can’t, because I can only get information about The Last of Us Part II’s user reviews from a meaningful source. I need that RottenTomatoes data, and I’ve got hard work ahead to access it, if I even will be able to at all. No one’s privacy is being protected by this — people post these reviews publicly, they do show up in public search results with whatever user name they entered, and in any case I have no need to identify specific users, so in the case of my Metacritic user review scraping script, I simply scraped only the review text, because I’m interested in natural language processing and sentiment analysis, and I only need the text.

因此,我理解为什么网络爬虫,“网络bot”等具有他们所享有的声誉-外来入侵者在这里遭到破坏。 因为,有时候是。 而且我不是一个“信息想要自由”的人—如果有的话,我经常觉得我们可以退后一步,朝着假名,回到隐私,即使我意识到这是不可能的。 但是研究是研究,Metacritic和RottenTomatoes的两个示例提供了一个信息示例,以及问题的答案,而这些问题却被现代的Web设计范式完全变为无法访问的“ will灭”。 作为数据科学研究的一部分,我想完成的一个研究问题是将观众对视频游戏《我们的最后一部分》与电影《原子金发》的React进行比较。 这既是因为我是这两者的粉丝,又是因为它们是相似的稀有实体-有关从事大量战斗和战斗的女同性恋或双性恋女性的媒体,而媒体通常只为异性恋男人保留。 我想比较人们对他们的React。 但是我不能,因为我只能从有意义的来源获得有关《我们最后的回忆》第二部分用户评论的信息。 我需要RottenTomatoes数据,并且即使我完全能够访问,也要努力访问它。 没有人的隐私受到此保护-人们公开发布这些评论,它们的确会以输入的用户名显示在公共搜索结果中,无论如何我都无需识别特定用户,因此对于我的Metacritic用户评论抓取脚本,我只抓取评论文本,因为我对自然语言处理和情感分析感兴趣,并且只需要文本。

In the 2018 first-contact thriller Annihilation, an alien intelligence finds itself in the American South seemingly by accident, and Natalie Portman’s character is sent in to investigate after her husband, played by Oscar Isaac, disappears on an earlier expedition. Like Blindsight, Annihilation has horror at first contact with alien life, rather than the wonder of Contact.

在2018年的第一人称惊悚片《歼灭》中,外星情报分子偶然发现自己在美国南部,而娜塔莉·波特曼(Natalie Portman)的性格在奥斯卡·伊萨克(Oscar Isaac)饰演的丈夫失踪后被送去调查。 像盲目视线一样,歼灭最初与外星生命接触时会感到恐怖,而不是接触者的奇迹

The characters in Annihilation have good reason to fear the unknown — and yet that unknown simply wants to gain knowledge about them.
歼灭游戏中的角色有充分的理由惧怕未知的事物,而未知的事物只是想了解它们。

The meteor that impacted a gulf coast lighthouse is apparently an existential threat to life on Earth, growing an increasingly wide zone where its DNA blends with the DNA of terrestrial life. It’s a menace. It’s also trying to understand — which is why it sends back what appears to be a duplicated version of Portman’s husband, and the ending implies that Portman herself has been replaced with a clone of some kind, despite her successful destruction of the lighthouse in the film’s climax. What I see as the real tragedy in Annihilation is not that characters die, but that the hostility between the humans and the life/intelligence carried by the meteor might well result from an inability for the meteor to research humans and their culture. “Broadcast era” aside, it crashed into Earth (apparently) by mistake, it’s just here — it probably wouldn’t have been listening for transmissions until it got here in any case. And yet it’s here now, and the only way it can understand is through apparent hostility. Thus, it creates simulacra of humans to enter our culture and understand.

撞击到墨西哥湾沿岸灯塔的流星显然是对地球生命的生存威胁,其区域越来越宽,其DNA与陆地生物的DNA融合在一起。 这是一种威胁。 它还试图理解-这就是为什么它回送看似波特曼丈夫的复制品的原因,并且结尾暗示波特曼本人已被某种克隆代替,尽管她成功破坏了电影中的灯塔高潮。 我认为《歼灭》的真正悲剧不是人物死亡,而是人类与流星所携带的生命/智慧之间的敌对情绪很可能是由于流星无法研究人类及其文化而造成的。 抛开“广播时代”,它被误撞到了地球(显然),就在这里-在任何情况下,它可能一直没有监听传输。 然而,它现在就在这里,它唯一可以理解的方法就是通过明显的敌对行动。 因此,它创造了人类进入我们的文化和理解的模拟。

Web scraping puts us somewhere between the aliens in Contact and a less homicidal version of the “aliens” from Annihilation. We are forced to use deception — browser user agents, “headless browsers” such as Selenium, and so on — to gain access to a vast volume of data. Without this data, we will not understand, and we will make mistakes like the beings in Contact, who sent back images of Adolf Hitler without understanding that humans would find this quite threatening. They did not understand the context. Sites that reflect the common ideas and desires and emotions of humans need to be accessible, and certainly shouldn’t be made inaccessible to scraping simply because of design trends (rather than an intentional decision to shield data that should be shielded). The protection of privacy from Rorschach-like threats must be distinguished from the hiding of data from legitimate data scientists. The “annihilation” of public information from analytical use is a tide I fear deeply as a researcher.

网页抓取使我们介于“接触”中的外星人和“歼灭”中的“杀人狂”杀人事件之间。 我们被迫使用欺骗(浏览器用户代理,Selenium等“无头浏览器”等)来访问大量数据。 没有这些数据,我们将无法理解,并且会犯诸如Contact中的存在之类的错误,这些人会在不了解人类会发现这种威胁的情况下发回Adolf Hitler的图像。 他们不了解上下文。 反映人类共同思想,愿望和情感的站点需要可访问,并且绝对不应仅仅因为设计趋势而使其无法被抓取(而不是故意屏蔽应该屏蔽的数据)。 保护隐私免受类似罗夏(Rorschach)威胁的保护必须与对合法数据科学家隐藏数据的保护区分开。 作为研究人员,我非常担心分析用途对公共信息的“歼灭”浪潮。

翻译自: https://medium/out-of-the-midwest-with-software-data/the-great-silence-of-2020-web-design-the-annihilation-of-algorithmic-comprehensibility-6d12ad9925c4

2020算法提前批

本文标签: 算法网页设计沉默可理解NI