我有一个数据集,由一对字符串和它所属的类组成。 这个字符串是一个句子。 班级可以是“男性”或“女性”。 一个例子 -
“嗨! 我的名字是杰克,男
我使用它作为训练集,因此,给定一组不同的字符串,它可以分类该声明是来自男性还是女性。 我正在使用WEKA的stringtowordvector将字符串转换为包含该字符串中单词数的向量。 使用结果arff我希望它生成一个预测算法(决策树?),我可以在未分类的数据集上使用它。 我该怎么做? 我应该使用哪种分类器? 在这种情况下,哪些其他预处理技术会有所帮助?
I have a data-set that consists of a pair of a string and the class it belongs to. The string is a sentence. The class can either be 'male' or 'female'. An example -
'Hi! My name is Jack', male
I am using this as a training set so that, given a different set of strings it can classify whether that statement came from a male or female. I am using WEKA's stringtowordvector to convert the string to a vector containing the count of words in that string. Using the resultant arff i want it to generate a prediction algorithm (decision trees?) that i can use on an unclassified data-set. How do i go about it? Which classifier should i use? And which other preprocessing techniques would help in this scenario?
最满意答案
也许一个好的起点可能是Weka主页上的Simple Message Classifier示例( 代码和wiki )示例,也可能是Text Categorization Wiki 。
几乎任何线性分类器都是一个很好的起点。 我建议Logistic回归或支持向量机作为一个很好的起点。
Perhaps a good place to start would be the Simple Message Classifier example (code and wiki) example on the Weka homepage, or maybe the Text Categorization Wiki.
Pretty much any linear classifier would be a good starting place. I'd suggest either Logistic Regression or Support Vector Machines as a good starting point.
更多推荐
发布评论