admin管理员组

文章数量:1594232

自然语言处理依存分析

To read this paper in document format click here.

要以文档格式阅读本文,请单击此处。

ABSTRACTNLP (Natural Language Processing) is a branch of artificial intelligence geared towards allowing computers to interact with humans through an understanding of human (natural) languages. This study focuses on training an NLP model to be used in a sentiment analysis on Big Tech policy by scraping and analyzing reactions to Big Tech articles linked on Reddit, using PRAW, an Reddit-specific web scraping API. Posts were scraped from the r/politics subreddit, a forum dedicated to the discussion of American politics. I found that there was a somewhat substantial skew towards support for policies intended to inhibit Big Tech power.

摘要 NLP(自然语言处理)是人工智能的一个分支,旨在允许计算机通过对人类(自然)语言的理解来与人类进行交互。 本研究的重点是通过使用Reddit专用的Web抓取API PRAW来抓取并分析对Reddit上链接的Big Tech文章的React,从而训练可用于Big Tech政策情绪分析的NLP模型。 从r / politics subreddit(专门讨论美国政治的论坛)中删除了帖子。 我发现,对于旨在抑制大高科技力量的政策的支持存在相当大的偏差。

MOTIVATIONIn the wake of Congress’s Big Tech Hearing [1], many social media activists began to release anti-Big Tech posts and graphics, as well as the tangentially-related blasts against billionaires and their role in wealth inequality. However, every post would always host a controversial comments section between those against sweeping antitrust moves and those supportive of them. This prompted me to wonder the true sentiment on Big Tech’s market power.

动机在国会的“大技术听证会” [1]之后,许多社交媒体活动家开始发布反大技术的帖子和图形,以及针对亿万富翁的切向冲击波及其在财富不平等中的作用。 但是,每个帖子在反对全面反托拉斯举措的人和支持它们的人之间总会有一个有争议的评论部分。 这使我想知道对大科技市场力量的真实看法。

METHODSBy web scraping Reddit with the PRAW API a list of this year’s top 100 articles about Big Tech was compiled from the r/politics subreddit. Since these articles all loosely involved policies intended to inhibit FAANG market power, using NLP to analyze the top-level comments for each post could provide an adequate representation of sentiment towards big tech. Each comment could be reasonably inferred to have a noticeable negative or positive sentiment towards a Big Tech policy, since I used a subreddit dedicated to discussing American politics.

方法通过使用PRAW API通过Web抓取Reddit,从r / politics subreddit汇总了今年有关大技术的前100条文章。 由于这些文章都松散地涉及旨在抑制FAANG市场势力的政策,因此使用NLP分析每个帖子的顶级评论可以充分表达对大型技术的兴趣。 可以合理地推断出每个评论对大技术政策都有明显的消极或积极情绪,因为我使用了专门讨论美国政治的评论。

Both the Reddits scraping and the machine learning model were coded in a single file in this Google Colab (the code can be run in the environment). More in-depth annotations with the code are provided as well. All machine learning code was written using the TensorFlow library.

Reddits抓取和机器学习模型都在此Google Colab中的单个文件中进行了编码(该代码可以在环境中运行)。 还提供了带有代码的更深入的注释。 所有机器学习代码都是使用TensorFlow库编写的。

The model was trained on the TensorFlow IMDb Dataset, an open-source dataset of 50,000 movie reviews split into 25,000 reviews intended for training and 25,000 reviews intended for validation. These were automatically randomized upon initialization. This dataset was inferred to be applicable to the Reddit comment data because reviews and political discussion often consist of a similar pool of opinionated words. To validate this assumption, I manually cluster-sampled and rated 20% of the Reddit posts and their comments to compare to the algorithm’s prediction later on.

该模型在TensorFlow IMDb数据集上进行了训练,TensorFlow IMDb数据集是50,000个电影评论的开源数据集,分为25,000个打算训练的评论和25,000个打算验证的评论。 这些在初始化时自动随机化。 推断此数据集适用于Reddit评论数据,因为评论和政治讨论通常由相似的带评论的单词组成。 为了验证这个假设,我手动对Reddit帖子及其评论进行了20%的群集采样并对其进行了评级,以与以后的算法预测进行比较。

After experimenting with training the neural network under a supervised learning system and comparing it against the validation data, it was determined that a good model would consist of a sequential model with the following layers:

经过在监督学习系统下训练神经网络的实验,并将其与验证数据进行比较后,确定一个好的模型将由具有以下几层的顺序模型组成:

This model would ultimately consist of 1,356,841 trainable parameters (see Fig. 1.). The binary_crossentropy loss function was used because the intention was to categorize the comments into an either negative or positive reaction towards each article. The adam optimizer was also used because it works particularly well with NLP models. The metric used was just left as accuracy.

该模型最终将包含1,356,841个可训练参数(见图1)。 之所以使用binary_crossentropy损失函数,是因为其目的是将评论分为对每篇文章的正面或负面React。 还使用了亚当优化器,因为它与NLP模型配合使用特别好。 所使用的度量标准只是精度。

During testing, the model was run across 10 epochs, noting the accuracy, val_accuracy, loss, and val_loss across the epochs. The changes were graphed using the matplotlib library:

在测试期间,模型跨10个时间段运行,并记录了各个时间段的准确性,val_accuracy,损耗val_loss 。 使用matplotlib库绘制了更改的图形:

By comparing the accuracy and loss versus epochs graphs (see Fig. 2.), it was evident that maximizing val_accuracy while minimizing val_loss would require writing an abrupt callback. Since val_accuracy had roughly plateaued by 0.93 accuracy and val_loss had hit a minimum, the training should have been called back under the 0.93 accuracy. Beyond 0.93 accuracy, the val_loss would increase, showing a risk of overfitting, for little to no gain in val_accuracy, with the possibility of decreasing val_accuracy. Thus, the callback was written to stop the training once the 0.93 accuracy benchmark was hit. Depending on the training process, the benchmark could be hit between 4 epochs (see Fig. 2. (a)) to 10 epochs (see Fig. 2. (b)).

通过比较精度图和损失图与历时图(参见图2),很明显, 要使val_accuracy最大化, 而使val_loss最小化, 需要编写一个突然的回调函数 。 由于v al_accuracy大致稳定在0.93的精度上,val_loss已达到最小值,因此应该在0.93的精度下召回训练 精度超过0.93时, val_loss会增加,这表明存在过拟合的风险,因为val_accuracy几乎没有增加, 或者有可能降低val_accuracy。 因此,一旦达到0.93的准确度基准,就会编写回调以停止训练。 根据训练过程的不同,基准可以在4个时期(见图2(a))到10个时期(见图2(b))之间达到。

DATAAfter running each Reddit comment against the trained model, each comment was assigned a predicted rating. They can be seen in the Colab code. Ratings from the machine learning algorithm returned a sentiment on a scale of 0 to 1, with 0 being a fully negative sentiment, 0.5 as neutral, and 1 as a fully positive sentiment. The floats for the Reddit comment ratings are somewhat difficult to discern from the rest of the computer jargon; however, I consolidated a list of overall post ratings derived from the weighted average of the comment ratings within the posts, a list that could be printed out with relative readability (see Table 1.).

数据在针对训练过的模型运行每个Reddit注释之后,为每个注释分配了预测等级。 可以在Colab代码中看到它们。 机器学习算法的评分返回的情绪等级为0到1,其中0代表完全负面情绪,0.5代表中性情绪,1代表完全正面情绪。 Reddit评论等级的浮点数很难从其余的计算机术语中分辨出来。 但是,我合并了从帖子中评论评分的加权平均值得出的总体帖子评分列表,该列表可以相对可读地打印出来(参见表1)。

DATA ANALYSISAfter individually rating each comment, they were weighted-averaged for each comment within a post to create a post rating. The weight was based on the scraped net upvotes of each comment, a useful feature on Reddit where people would either downvote (-1) or upvote (+1) a comment. There isn’t a downvote cap on net upvotes, so comments could potentially reach negative votes. This would allow for less-echoed sentiments to be punished and the more-respected comments to benefit.

数据分析在对每个评论分别评分之后,对帖子中的每个评论进行加权平均,以创建帖子评分。 权重基于每个评论的已刮除净投票,这是Reddit上的一个有用功能,人们可以对评论进行不赞成(-1)或不赞成(+1)。 没有净投票上限,因此评论可能会获得负面投票。 这将允许对不那么回味的情感进行惩罚,而更受尊敬的评论将受益。

After adding each of the post ratings into a post_rating list, the list was iterated through again to create another weighted average based on the votes on each post. Since Reddit upvotes on news posts often correlate to article exposure, this may help adjust for differences in sample sizes that are reflected between more popular and less popular posts.

将每个帖子评分添加到post_rating列表中之后,再次迭代该列表以基于每个帖子的投票创建另一个加权平均值。 由于Reddit对新闻帖子的赞誉通常与文章的曝光量相关,因此这可能有助于调整在较流行和较不流行的帖子之间反映的样本量差异。

As the post_rating list was passed through the iterations, a few data points were deleted–those that only included a 0, rather than a nested list (see Table 1.). The nested list served as an identifier for which posts were able to be rated, with the zeros identifying posts that had no comments to be scraped and would therefore be left out of the final calculations.

post_rating列表通过迭代传递时,删除了一些数据点-这些数据点仅包含0,而不是嵌套列表(请参见表1)。 嵌套列表用作能够对其评分的标识符,其中的零表示没有评论要删除的帖子,因此将被排除在最终计算之外。

The negative post ratings also had to be filtered out (see Fig. 3. (a)); occurrences consequent of posts consisting of populations of comments that were controversial enough to receive the downvotes necessary to create a negative net vote score, carrying on to cause the post sentiment to become a negative value from the multiplication in the weighting process. Since the negative values could not be accurately adjusted to a specific sentiment intensity supporting the opposite position, these three posts were removed from the pool.

负面职位评价也必须被过滤掉(见图3.(a)); 帖子引起的评论出现争议,这些评论具有足够的争议性,足以接受创建负净投票得分所需的降票,并继续导致帖子情绪在加权过程中因乘法而变为负值。 由于无法将负值准确地调整为支持相反位置的特定情感强度,因此将这三个职位从合并中删除。

Although the algorithm output on a weighted sentiment rating per post, the value’s meaning was still unclear–it only revealed the predicted sentiment towards the article, not on Big Tech itself. I then went back through each of the 100 articles and manually skimmed the headlines to confirm the specific side of the issue that article wrote about. Since each article generally prefaced anti-Big Tech policies, the returned values were subtracted from 1. The subtraction allowed the sides to swap, so that those supporting anti-Big Tech values would be represented on the negative (0) end of the scale and those opposing anti-Big Tech values would be represented on the positive (1) end (see Fig. 3. (b)).

尽管该算法以每篇文章的加权情感等级输出结果,但该值的含义仍不清楚-它仅显示了对文章的预测情感,而不是Big Tech本身。 然后,我回顾了100篇文章中的每篇文章,并手动浏览了标题,以确认文章所写问题的特定方面。 由于每篇文章通常都以反大技术政策为开头,因此从1中减去返回的值。减法允许双方互换,以便那些支持反大技术价值观的人将代表在量表的负数(0)上,并且那些相反的反Big Tech值将显示在正数(1)上(参见图3.(b))。

After the data manipulation, the algorithm came up with a sentiment score of approximately 0.4032, a substantial but not overwhelmingly negative sentiment towards Big Tech market power.

经过数据处理后,该算法得出的情感分数约为0.4032,这对大技术市场力量来说是实质性但绝不消极的负面情绪。

CONCLUSIONFrom this study, it can be concluded that there is currently a substantial anti-Big Tech sentiment. However, if there was a way to effectively convert the aggressively downvoted data points into a scaled opposite sentiment, the study could more accurately reflect the true Big Tech sentiment in America.

结论从这项研究中可以得出结论,当前存在着强烈的反大技术情绪。 但是,如果有一种方法可以有效地将经过大幅度投票的数据点转换为成比例的相反情绪,则该研究可以更准确地反映出美国真正的“大技术”情绪。

A limitation to note is Reddit’s young, left-leaning political skew. In 2016, Barthel [2] found that 47% of Reddit users identify as liberals compared to the estimated 24% across US adults. Scraping other forums to compare the sentiment may be an interesting next step.

需要注意的一个限制是Reddit的年轻,左倾政治倾向。 2016年,Barthel [2]发现Reddit用户中有47%识别为自由主义者,而在美国成年人中估计为24%。 下一步可能是刮擦其他论坛以比较情绪。

As a sanity check for the model, I manually cluster-sampled 20% of the posts before running the algorithm, picking every 5 posts to score and averaging the sentiment score out to approximately 0.3031. Relatively within range of the model’s 0.4032 prediction, the model was concluded to have a reasonably accurate, albeit still imperfect, fit.

作为模型的健全性检查,我在运行算法之前手动对20%的帖子进行了群集采样,每5个帖子进行一次评分,并将情感评分平均为大约0.3031。 相对在模型0.4032预测的范围内,该模型的结论是具有相当准确的拟合,尽管仍不完美。

Provided that necessary logistics are not an issue, a further step in the study could entail either compiling a separate, subreddit-specific labeled dataset to improve the model or developing a model that trained under an unsupervised learning environment.

如果没有必要的后勤问题,则研究的进一步步骤可能需要编译单独的,特定于redredit的带标签数据集以改进模型,或开发在无监督学习环境下训练的模型。

REFERENCES[1]: Kang, Cecilia, and David Mccabe. “Lawmakers, United in Their Ire, Lash Out at Big Tech’s Leaders.” The New York Times, The New York Times, 29 July 2020, www.nytimes/2020/07/29/technology/big-tech-hearing-apple-amazon-facebook-google.html.

参考文献 [1]:Kang,Cecilia和David Mccabe。 “制表者,团结一致,鞭Big大技术的领导者。” 纽约时报》 ,《纽约时报》,2020年7月29日, www.nytimes / 2020/07/29 / technology / big-tech-hearing-apple-amazon-facebook-google.html。

[2]: Barthel, Michael. “How the 2016 Presidential Campaign Is Being Discussed on Reddit.” Pew Research Center, Pew Research Center, 26 May 2016, www.pewresearch/fact-tank/2016/05/26/how-the-2016-presidential-campaign-is-being-discussed-on-reddit/.

[2]:Barthel,Michael。 “如何在Reddit上讨论2016年总统竞选。” 皮尤研究中心 ,皮尤研究中心,2016年5月26日, www.pewresearch / fact-tank / 2016/05/26 / how-the- 2016- presidential-campaign-is-being-discused-on-reddit /。

ACKNOWLEDGEMENTSI would like to thank Kevin Trickey for his help refining the study design, Fang Wang for her aid in the model validation process and data science expertise, and Kevin Chen for his research insight.

致谢我要感谢Kevin Trickey改进了研究设计,感谢Wang Fang在模型验证过程和数据科学专业知识方面的帮助,并感谢Kevin Chen的研究洞察力。

翻译自: https://medium/swlh/using-natural-language-processing-to-analyze-sentiment-towards-big-tech-market-power-3c79dc05f030

自然语言处理依存分析

本文标签: 自然语言信心力量市场科技