【建模分析】建模分析师_通过主题建模对大型盖茨进行主题分析

编程入门 行业动态 更新时间:2024-10-09 01:22:45

建模分析】建模分析师

I’ve always been interested in data analysis and literary criticism. They might seem like two vastly different fields of study, but to me, thinking critically about analytics and classic novels were quite similar activities as they both enabled me to gain new insights on social, cultural, and political issues. Based on my interests, I had an idea to apply natural language processing techniques on analyzing literary or philosophical texts. So I’ve decided to give it a go.

我一直对数据分析和文学批评感兴趣。 它们似乎是两个截然不同的研究领域,但对我而言,批判性地思考分析和经典小说是非常相似的活动,因为它们都使我对社会,文化和政治问题有了新的见解。 根据我的兴趣,我想到了将自然语言处理技术应用于分析文学或哲学文本的想法。 所以我决定试一试。

介绍 (Introduction)

In this article, I will share my own experiences of applying NLP processes in unveiling the central themes of one of the greatest American classics; the Great Gatsby. (for those who prefer live presentations to essays, I’ve also made a video which explains my methodology in detail. Check out the link above.)

在本文中,我将分享我自己的经验,他们将运用NLP流程来揭示美国最伟大的经典之一的中心主题。 伟大的盖茨比。 (对于那些喜欢现场演示而非论文的人,我还制作了一个视频,详细说明了我的方法。请查看上面的链接 。)

The Great Gatsby film adaptation (2013)
大盖茨比电影改编(2013)

Superficially, The Great Gatsby seems like a typical romance fiction as the plot mainly revolves around the millionaire J. Gatsby’s quest to win back the heart of his long lost love, Daisy Buchanan. However, viewed from the historical context of the hedonistic 1920’s Jazz age, it is evident that the novel’s purpose is to criticize the decay of the American Dream within the era of material excess. I was curious if data science, rather than subjective reading, could help clarify the main idea of the book. I came up with a hypothesis that, if I feed the text data into a topic modeling algorithm, then I could automatically extract the literary themes.

从表面上看,《大盖茨比》似乎是典型的浪漫小说,因为情节主要围绕着百万富翁J. Gatsby寻求夺回他久违的爱情的心脏Daisy Buchanan的追求。 但是,从享乐主义1920年代的爵士时代的历史背景来看,很明显,小说的目的是在物质过剩时代批评美国梦的衰落。 我很好奇数据科学而不是主观阅读可以帮助阐明这本书的主要思想。 我提出一个假设,如果我将文本数据输入主题建模算法中,那么我可以自动提取文学主题。

文字预处理 (Text Preprocessing)

First off, cleaning the original text of ‘The Great Gatsby’ is necessary. The book’s txt.file can be downloaded at the Gutenberg Project where digital texts of various literary masterpieces are displayed. Topic Modeling processes usually require more than one documents, so it would be appropriate to split the corpus into multiple paragraphs.

首先,有必要清除“伟大的盖茨比”的原文。 本书的txt.file可以在古腾堡项目(Gutenberg Project)中下载,该项目显示了各种文学杰作的数字文本。 主题建模过程通常需要多个文档,因此将语料库拆分为多个段落是适当的。

from textblob import TextBlob
from textblob import Word
from nltk.corpus import stopwords 


primary_file = open("D:/nemo/dataset/literature_dataset/the_great_gatsby.txt","r",encoding='UTF8') 
primary = primary_file.read().replace('"',' ').replace('’','\'').split('\n\n')
primary = [x.lower().replace('\n', ' ') for x in primary if len(x)>500]
print(primary[0])


primary_cleaned = []


stop_words = set(stopwords.words('english')) 


for sent in primary:
    tokens = ' '.join(TextBlob(sent).noun_phrases).split()
    cleaned_sent=[]
    for w in tokens:
        w = Word(w).lemmatize()
        if w not in stop_words and len(w)>4:
            cleaned_sent.append(w)
    if len(cleaned_sent)!=0:
        primary_cleaned.append(' '.join(cleaned_sent))

Then, using the NLTK and TextBlob packages, I eradicated all the stopwords (words that have a high frequency but lacks contextual meaning) and lemmatized(converting words into their original form, such as changing a plural into a singular) each token.

然后,使用NLTK和TextBlob包,我消除了所有停用词(具有较高频率但缺乏上下文含义的词)并进行了词形去除(将单词转换为原始形式,例如将复数形式更改为单数形式)的每个标记。

使用NMF算法进行主题建模 (Topic Modeling with NMF Algorithm)

It’s time to apply topic modeling to the preprocessed text corpus. To those unfamiliar with this technique, topic modeling is an unsupervised process for automatically extracting semantic topics from natural language data. In this case, the topics would be the novel’s main themes.

现在是时候将主题建模应用于预处理的文本语料库了。 对于不熟悉该技术的人来说,主题建模是一种从自然语言数据中自动提取语义主题的无监督过程。 在这种情况下,主题将是小说的主要主题。

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF


def get_topics(components, feature_names, n=50):
    for idx, topic in enumerate(components):
        print("\nTopic %d: " % (idx+1), [(feature_names[i], topic[i].round(2)) for i in topic.argsort()[:-n - 1:-1]])
        
vectorizer = TfidfVectorizer(max_features=1000) 
X = vectorizer.fit_transform(primary_cleaned)


nmf_model = NMF(n_components=10, init='random', random_state=0)
nmf_top = nmf_model.fit_transform(X)


terms = vectorizer.get_feature_names() 
get_topics(nmf_modelponents_,terms)

I converted the tokenized text data into Tf-idf vectors, which is a numerical statistic that reflects how important a word is to a document in a collection or corpus. Using the NMF algorithm, which is a topic model, I was able to extract ten topics from the corpus, and six of them ( Topic 5 : Gatsby, Topic 2 : Gatsby and the Green Light, Topic 1 : Daisy, Topic 4 : Daisy and her White Dress, Topic 10 : Doctor T. J. Eckleburg’s Yellow Spectacles ) were quite meaningful results. Here are some valuable insights that I gained by scrutinizing each topic.

我将标记化的文本数据转换为Tf-idf向量,这是一种数字统计,反映了单词对集合或语料库中文档的重要性。 使用主题模型NMF算法,我能够从语料库中提取十个主题,其中六个主题(主题5:盖茨比,主题2:盖茨比和绿灯,主题1:雏菊,主题4:雏菊)和她的《白色连衣裙,主题10:TJ Eckleburg医生的黄色眼镜》是非常有意义的结果。 这是我通过仔细研究每个主题获得的一些宝贵见解。

盖茨比与美国梦 (Gatsby and the American Dream)

from wordcloud import WordCloud
from wordcloud import STOPWORDS
import matplotlib.pyplot as plt
    


for idx, topic in enumerate(nmf_modelponents_):
    if idx == 0:
        topic_x = [(terms[i], topic[i].round(2)) for i in topic.argsort()[:-1000 - 1:-1]]
        topic_x = {i[0]:i[1] for i in topic_x}
            
wordcloud = WordCloud(width = 3000, height = 3000, stopwords=STOPWORDS, background_color="white", min_font_size = 30)
wordcloud = wordcloud.generate_from_frequencies(topic_x)


plt.axis("off")
plt.figure(figsize=(50, 50))
plt.imshow(wordcloud, interpolation="bilinear")
plt.show()

As I’ve mentioned before, the core concept dominating Fitzgerald’s masterpiece is the disillusionment with the American Dream. It could be argued that Gatsby, a hopeful, self-made millionaire originally from a poor social background, is a living embodiment of the American ideal.

正如我之前提到的,主导菲茨杰拉德杰作的核心概念是对美国梦的幻灭。 可以说,盖茨比(Gatsby)是一位有希望的,自我创造的百万富翁,最初来自贫穷的社会背景,是美国理想的生动体现。

Wordcloud Generated from Keywords of Topic 5 : Gatsby
从主题5的关键字生成的Wordcloud:盖茨比

The wordcloud generated from ‘Topic 5 : Gatsby’ displays words describing the astounding accomplishments and materialistic properties of Jay Gatsby, such as ‘dollar’, ‘automobile’, ‘beauty’, and ‘aesthetic’. The most interesting features that caught my eyes were ‘platonic’ and ‘father’, presumably derived from Nick Carraway’s description of Gatsby creating his own identity.

从“主题5:盖茨比”产生的词云显示的单词描述了杰伊·盖茨比的惊人成就和物质特性,例如“美元”,“汽车”,“美容”和“审美”。 引起我目光的最有趣的特征是“柏拉图式”和“父亲式”,大概是从尼克·卡拉威(Nick Carraway)对盖茨比(Gatsby)的描述中创造出来的。

The truth was that Jay Gatsby, of West Egg, Long Island, sprang from his Platonic conception of himself. He was a son of God — a phrase which, if it means anything, means just that — and he must be about His Father’s business, the service of a vast, vulgar, and meretricious beauty. So he invented just the sort of Jay Gatsby that a seventeen year old boy would be likely to invent, and to this conception he was faithful to the end.

事实是,长岛西蛋的杰伊·盖茨比(Jay Gatsby)摆脱了柏拉图式的自我观念。 他是上帝的儿子-这句话(如果有任何含义,就意味着那个意思)-他必须与父亲的生意有关,为那幅宽容,卑鄙和光荣的美丽服务。 因此,他发明了那种杰伊·盖茨比(Jay Gatsby)的身材,使一个十七岁的男孩很可能会发明发明,从这个观念上说,他一直忠实于最后。

Jay Gatsby fabricates his past when introducing himself to others, to conceal his insignificant origins. He believes that a man could become what he wills to be no matter his background. Put simply, Gatsby strongly believes in the American Dream.

杰伊·盖茨比(Jay Gatsby)在向他人介绍自己时掩饰了自己的过去,以掩盖自己微不足道的血统。 他认为,无论背景如何,一个人都可以成为他想要的。 简而言之,盖茨比坚信美国梦。

Wordcloud Generated from Keywords of Topic 2: Gatsby and the Green Light
从主题2的关键字生成的Wordcloud:盖茨比和绿灯

Gatsby’s belief is quite explicitly illustrated via The Green Light. Throughout the novel, Gatsby is infatuated with The Green Light across the shore. As Carraway describes his first encounter with Gatsby,

盖茨比的信念在《绿灯》中得到了明确的说明。 在整部小说中,盖茨比(Gatsby)痴迷于岸对面的绿灯。 正如Carraway所描述的,他与盖茨比的第一次相遇,

He stretched out his arms toward the dark water in a curious way, and, far as I was from him, I could have sworn he was trembling. Involuntarily I glanced seaward — and distinguished nothing except a single green light, minute and far away, that might have been at the end of a dock.

他以一种奇怪的方式向黑暗的水面伸了伸胳膊,就在我离他远的地方,我本可以宣誓他在发抖。 我不由自主地瞥了一眼向海看去,除了一个可能在码头尽头的微小的绿灯,只有几分钟和很远,什么也没分辨。

the millionaire seems to show a longing for The Green Light, as if it represents the dreams he desires the most. ‘Topic 2 : Gatsby and the Green Light’, with keywords such as ‘green’ and ‘shore’, effectively portrays this infatuation. I believe that The Green Light is a symbol for a goal that Gatsby and the American Dream strive toward, and ultimately, fail to reach.

这位百万富翁似乎表现出对《绿灯》的向往,似乎代表了他最渴望的梦想。 “主题2:盖茨比与绿灯”以及诸如“绿色”和“海岸”之类的关键词有效地描绘了这种痴迷。 我认为,“绿灯”是盖茨比和美国梦追求并最终未能实现的目标的象征。

黛西的空心 (Daisy’s Hollowness)

Wordcloud Generated from the Keywords of Topic 4 : Daisy and her White Dress
从主题4的关键字Wordcloud生成:黛西和她的白色连衣裙

The colour that best describes Daisy Buchanan is ‘white’. She wears a white dress when she meets Gatsby for the first time as well as when Nick visits her in the East Egg. Her house is full of the colour of white, for example, “The windows were ajar and gleaming white against the fresh grass outside” and “A breeze blew through the room, blew curtains in at one end and out the other like pale flags.” This ‘white’ is commonly associated with beauty and innocence, but throughout the novel, it also represents hollowness.

最能说明Daisy Buchanan的颜色是“白色”。 第一次见到盖茨比时,以及尼克在东蛋医院拜访她时,她都穿着白色连衣裙。 她的房子充满了白色,例如:“窗户半开着,白色的光芒映衬着外面的鲜草”和“微风拂过整个房间,窗帘的一端吹入,另一端则像苍白的旗帜吹出。 ” 这种“白色”通常与美和纯真相关,但在整本小说中,它也代表着空虚

Wordcloud Generated from the Keywords of Topic 1 : Daisy
从主题1的关键字生成的Wordcloud:雏菊

Daisy is fundamentally a hollow character who is incapable of caring for others. She is indifferent even to her infant daughter, and in the end, doesn’t even attend Gatsby’s funeral. As Carraway once said,

雏菊从根本上说是个无聊的人物,无法照顾他人。 她甚至对自己的宝贝女儿都无动于衷,最后甚至都不参加盖茨比的葬礼。 正如卡拉威曾经说过的,

They were careless people, Tom and Daisy — they smashed up things and then retreated back into their money, and let other people clean up the mess they had made.

他们是粗心的人,汤姆和黛西-他们捣碎了东西,然后撤回了自己的钱,然后让其他人清理他们制造的烂摊子。

From ‘Topic 1 : Daisy’, keywords about Daisy’s wealth, such as ‘power’, ‘party’, ‘drink’, and ‘perfect’ could be observed. In light of this data, Gatsby’s obsession with Daisy could be interpreted as an analogy of an American dreamer chasing after prosperity. However, the fact that this prosperity is ultimately devoid of meaning, makes Gatsby’s demise all the more tragic.

从“主题1:雏菊”中,可以观察到有关雏菊的财富的关键词,例如“权力”,“聚会”,“饮料”和“完美”。 根据这些数据,盖茨比对黛西的痴迷可以被解释为一个美国梦想家追求繁荣的类比。 但是,这种繁荣最终没有意义,这一事实使盖茨比的灭亡更加悲惨。

灰谷 (Valley of Ashes)

Wordcloud Generated from Keywords of Topic 9: Doctor T. J. Eckleburg’s Yellow Spectacles
从主题9的关键字生成的Wordcloud:TJ Eckleburg医生的黄色眼镜

The eyes of Doctor T. J. Eckleburg (an old advertising billboard) stares down on the Valley of Ashes, a desolate land created by industrial ashes where the poor like George Wilson reside.

TJ Eckleburg医生(一个古老的广告广告牌)的目光凝望着灰烬谷,这是一片由工业灰烬创造的荒凉土地,像乔治·威尔逊这样的穷人居住在那里。

Standing behind him, Michaelis saw with a shock that he was looking at the eyes of Doctor T. J. Eckleburg, which had just emerged, pale and enormous, from the dissolving night.“God sees everything,” repeated Wilson.“That’s an advertisement,” Michaelis assured him.

Michaelis站在他身后,震惊地看着他,正看着TJ Eckleburg医生的眼睛,他刚刚从溶解的夜晚中显得苍白而巨大。“上帝看到了一切,” Wilson重复道,“那是一则广告,” Michaelis向他保证。

They may represent the eyes of God judging the moral wasteland that is industrial America. These characteristics are represented by keywords such as ‘waste’, ‘bleak’, and ‘stare’. I believe that these eyes depict the decay of the American Dream. The industrial wasteland where the Eckleburg billboard presides is a result of the excessive spending by the rich, such as the Buchanans. While the poor have no choice but to live in such a desolate place, the elites surround themselves with aesthetically pleasing material excess. This example of class inequality aggravated by capitalism shows that the American ideal is far from reality.

他们可能代表上帝的眼睛,判断工业美国的道德荒原。 这些特征由诸如“浪费”,“黯淡”和“凝视”之类的关键字表示。 我相信这些眼睛描绘了美国梦的衰落。 Eckleburg广告牌主持的工业荒原是富人(如布坎南人)过度支出的结果。 穷人别无选择,只能生活在这样荒凉的地方,但精英们却在审美上取悦了多余的物质。 资本主义加剧的阶级不平等的这个例子表明, 美国的理想远非现实。

我的想法 (My Thoughts)

So this was my attempt to participate in literature analysis by using computational means, and I’m quite pleased with the results. I feel like data-driven literary criticism could broaden our prospectives and confirm our thoughts. I look forward to applying data science to studying various literary and philosophical texts soon.

因此,这是我尝试使用计算手段参与文献分析的尝试,我对结果感到非常满意。 我觉得数据驱动的文学批评可以拓宽我们的前景并证实我们的想法。 我期待很快将数据科学应用于研究各种文学和哲学文本。

Here’s the github code

这是github代码

p.s. If you like my article, I recommend you to check out my youtube video on the subject. (I’m thinking about analyzing Shakespeare or 1984 next time)

ps如果您喜欢我的文章,建议您查看有关该主题的youtube视频 。 (我正在考虑下次分析莎士比亚或1984)

翻译自: https://towardsdatascience/thematic-analysis-of-the-great-gatsby-with-topic-modeling-1f27baae55f1

【建模分析】建模分析师

更多推荐

【建模分析】建模分析师_通过主题建模对大型盖茨进行主题分析

本文发布于:2023-06-13 08:17:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1364654.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:建模   盖茨   主题   分析师

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!