如何在Spark中创建一组ngram?

编程入门 行业动态 更新时间:2024-10-14 16:26:49
本文介绍了如何在Spark中创建一组ngram?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我正在使用Scala从Spark 2.2数据帧列中提取Ngram,因此(在此示例中为trigram):

I am extracting Ngrams from a Spark 2.2 dataframe column using Scala, thus (trigrams in this example):

val ngram = new NGram().setN(3).setInputCol("incol").setOutputCol("outcol")

如何创建包含1至5克的输出列?所以可能是这样的:

How do I create an output column that contains all of 1 to 5 grams? So it might be something like:

val ngram = new NGram().setN(1:5).setInputCol("incol").setOutputCol("outcol")

但这不起作用.我可以遍历N并为N的每个值创建新的数据帧,但这似乎效率很低.斯卡拉(Scala)挺拔的,有人能指出我正确的方向吗?

but that doesn't work. I could loop through N and create new dataframes for each value of N but this seems inefficient. Can anyone point me in the right direction, as my Scala is ropey?

推荐答案

如果要将它们组合成向量,则可以重写 Python答案通过 zero323 .

If you want to combine these into vectors you can rewrite Python answer by zero323.

import org.apache.spark.ml.feature._ import org.apache.spark.ml.Pipeline def buildNgrams(inputCol: String = "tokens", outputCol: String = "features", n: Int = 3) = { val ngrams = (1 to n).map(i => new NGram().setN(i) .setInputCol(inputCol).setOutputCol(s"${i}_grams") ) val vectorizers = (1 to n).map(i => new CountVectorizer() .setInputCol(s"${i}_grams") .setOutputCol(s"${i}_counts") ) val assembler = new VectorAssembler() .setInputCols(vectorizers.map(_.getOutputCol).toArray) .setOutputCol(outputCol) new Pipeline().setStages((ngrams ++ vectorizers :+ assembler).toArray) } val df = Seq((1, Seq("a", "b", "c", "d"))).toDF("id", "tokens")

结果

buildNgrams().fit(df).transform(df).show(1, false) // +---+------------+------------+---------------+--------------+-------------------------------+-------------------------+-------------------+-------------------------------------+ // |id |tokens |1_grams |2_grams |3_grams |1_counts |2_counts |3_counts |features | // +---+------------+------------+---------------+--------------+-------------------------------+-------------------------+-------------------+-------------------------------------+ // |1 |[a, b, c, d]|[a, b, c, d]|[a b, b c, c d]|[a b c, b c d]|(4,[0,1,2,3],[1.0,1.0,1.0,1.0])|(3,[0,1,2],[1.0,1.0,1.0])|(2,[0,1],[1.0,1.0])|[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]| // +---+------------+------------+---------------+--------------+-------------------------------+-------------------------+-------------------+-------------------------------------+

使用UDF可能会更简单:

This could be simpler with a UDF:

val ngram = udf((xs: Seq[String], n: Int) => (1 to n).map(i => xs.sliding(i).filter(_.size == i).map(_.mkString(" "))).flatten) spark.udf.register("ngram", ngram) val ngramer = new SQLTransformer().setStatement( """SELECT *, ngram(tokens, 3) AS ngrams FROM __THIS__""" ) ngramer.transform(df).show(false) // +---+------------+----------------------------------+ // |id |tokens |ngrams | // +---+------------+----------------------------------+ // |1 |[a, b, c, d]|[a, b, c, d, ab, bc, cd, abc, bcd]| // +---+------------+----------------------------------+

更多推荐

如何在Spark中创建一组ngram?

本文发布于:2023-05-23 15:05:06,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1326426.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:如何在   Spark   ngram

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!