我正在研究文本分类问题,我正在尝试将单词集合归为一类,是的,有很多库可供分类,因此,如果您建议使用它们,请不要回答.
让我解释一下我要实现的内容. (例如)
单词列表:
类别列表.
在这里我们将训练该集合,如下:
现在我们有一个短语"最好的Java编程书" 从给定的短语中,以下单词与我们的单词列表"相匹配:
编程"具有两个映射的类别"java"和""c-sharp",这是一个普通词.
"java"仅映射到类别"java".
因此,该短语的匹配类别为"java"
这就是我的想法,这个解决方案是否还可以,是否可以实施,您的建议是什么,我错过的任何事情,缺陷等等.
解决方案当然可以实现.如果您在正确的数据集(我想是Java和C#编程书的标题)上训练Naive Bayes分类器或线性SVM,它应该学会将术语"Java"与Java,"C#"和".NET"与C#相关联. ,并同时进行编程".也就是说,如果对数据集进行平均划分,则对于编程"之类的通用术语,朴素的贝叶斯分类器可能会学到Java或C#的大致均匀的概率.
I am working on a text classification problem, I am trying to classify a collection of words into category, yes there are plenty of libraries available for classification, so please dont answer if you are suggesting to use them.
Let me explain what I want to implement. ( take for example )
List of Words:
List of Categories.
here we will train the set, as:
Now we have a phrase "The best java programming book" from the given phrase following words are a match to our "List of Words.":
"programming" has two mapped categories "java" & "c-sharp" so it is a common word.
"java" is mapped to category "java" only.
So our matching category for the phrase is "java"
This is what came to my mind, is this solution fine, can it be implemented, what are your suggestions, any thing I am missing out, flaws, etc..
解决方案Of course this can be implemented. If you train a Naive Bayes classifier or linear SVM on the right dataset (titles of Java and C# programming books, I guess), it should learn to associate the term "Java" with Java, "C#" and ".NET" with C#, and "programming" with both. I.e., a Naive Bayes classifier would likely learn a roughly even probability of Java or C# for common terms like "programming" if the dataset is divided evenly.
更多推荐
文字分类
发布评论