文字分类NaiveBayes

编程入门 行业动态 更新时间:2024-10-11 05:22:56
本文介绍了文字分类NaiveBayes的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我正在尝试按类别对一系列文本示例新闻进行分类。我在数据库中有庞大的新闻文本数据集,其中包含类别。应该训练机器并确定新闻类别。

I am trying to classify a series of text example News by category. I have huge dataset of news text with category in database. Machine should be trained and decide the news category.

public static string[] Tokenize(string text) { StringBuilder sb = new StringBuilder(text); char[] invalid = "!-;':'\",.?\n\r\t".ToCharArray(); for (int i = 0; i < invalid.Length; i++) sb.Replace(invalid[i], ' '); return sb.ToString().Split(new[] { ' ' }, System.StringSplitOptions.RemoveEmptyEntries); } private void Form1_Load(object sender, EventArgs e) { string strDSN = "Provider=Microsoft.ACE.OLEDB.12.0;Data Source = c:\\users\\158820\\Documents\\Database4.accdb"; string strSQL = "SELECT * FROM NewsRepository"; // create Objects of ADOConnection and ADOCommand OleDbConnection myConn = new OleDbConnection(strDSN); OleDbDataAdapter myCmd = new OleDbDataAdapter(strSQL, myConn); myConn.Open(); DataSet dtSet = new DataSet(); myCmd.Fill(dtSet, "NewsRepository"); DataTable dTable = dtSet.Tables[0]; myConn.Close(); StringBuilder sWords = new StringBuilder(); string[][] swords = new string[dTable.Rows.Count][]; int i = 0; foreach (DataRowView dr in dTable.DefaultView) { swords[i] = Tokenize(dr[1].ToString()); i++; } Codification codebook = new Codification(dTable, new string[] { "NewsTitle", "Category" }); DataTable symbols = codebook.Apply(dTable); int[][] inputs = symbols.ToJagged<int>(new string[] { "NewsTitle" }); int[] outputs = symbols.ToArray<int>("Category"); bagOfWords(inputs, outputs); } private static void bagOfWords(int[][] inputs, int[] outputs) { var bow = new BagOfWords<int>(); var quantizer = bow.Learn(inputs); string filenamebow = Path.Combine(Application.StartupPath, "News_BOW.accord"); Serializer.Save(obj: bow, path: filenamebow); double[][] histograms = quantizer.Transform(inputs); // One way to perform sequence classification with an SVM is to use // a kernel defined over sequences, such as DynamicTimeWarping. // Create the multi-class learning algorithm as one-vs-one with DTW: var teacher = new MulticlassSupportVectorLearning<ChiSquare, double[]>() { Learner = (p) => new SequentialMinimalOptimization<ChiSquare, double[]>() { // Complexity = 100 // Create a hard SVM } }; // Learn a multi-label SVM using the teacher var svm = teacher.Learn(histograms, outputs); // Get the predictions for the inputs int[] predicted = svm.Decide(histograms); // Create a confusion matrix to check the quality of the predictions: var cm = new GeneralConfusionMatrix(predicted: predicted, expected: outputs); // Check the accuracy measure: double accuracy = cm.Accuracy; string filename = Path.Combine(Application.StartupPath, "News_SVM.accord"); Serializer.Save(obj: svm, path: filename); }

我对如何训练Accord对象有点困惑。我能够序列化经过训练的模型(9个类别中的3600个独特新闻大约需要106 MB)

I am bit confused on how to train accord objects. I am able to serialize the trained model (which is approx 106 MB for 3600 unique news within in 9 categories)

我如何使用该模型来预测新的新闻文本集?

How do I use the model to predict the category of a new set of news text?

推荐答案

对不在训练集中的数据使用模型就像调用svm一样简单另一个决定:

Using your model on data not in your training set is as simple as calling your svm to make another decision:

svm.Decide(outofSampleData)

由于已经序列化了训练好的模型,因此可以使用 Serializer.Load< T> 实例化svm对象,该文档记录了此处。

Since you have serialized your trained model you can instantiate the svm object using Serializer.Load<T> which is documented here.

更多推荐

文字分类NaiveBayes

本文发布于:2023-11-30 12:29:03,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1649956.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:文字   NaiveBayes

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!