当我们的输入数据为文本(句子)的时候,我们会想把他们切分为单词再进行数据处理,这时候就要用到Tokenizer类了。 Tokenization是一个将文本(如一个句子)转换为个体单元(如词)的处理过程。 一个简单的Tokenizer
类就提供了这个功能。下面的例子展示了如何将句子转换为此序列。
RegexTokenizer
基于正则表达式匹配提供了更高级的断词(tokenization
)。默认情况下,参数pattern
(默认是\s+
)作为分隔符, 用来切分输入文本。用户可以设置gaps
参数为false
用来表明正则参数pattern
表示tokens
而不是splitting gaps
,这个类可以找到所有匹配的事件并作为结果返回。下面是调用的例子。
import org.apache.spark.SparkConf
import org.apache.spark.ml.feature.{RegexTokenizer, Tokenizer}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
// $example off$
object TokenizerExample {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf();
sparkConf.setMaster("local[*]").setAppName(this.getClass.getSimpleName)
val spark = SparkSession
.builder
.config(sparkConf)
.appName("TokenizerExample")
.getOrCreate()
// $example on$
val sentenceDataFrame = spark.createDataFrame(Seq(
(0, "Hi I heard about Spark"),
(1, "I wish Java could use case classes"),
(2, "Logistic,regression,models,are,neat")
)).toDF("id", "sentence")
val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
val regexTokenizer = new RegexTokenizer()
.setInputCol("sentence")
.setOutputCol("words")
.setPattern("\\W") // alternatively .setPattern("\\w+").setGaps(false)
val countTokens = udf { (words: Seq[String]) => words.length }
val tokenized = tokenizer.transform(sentenceDataFrame)
tokenized.select("sentence", "words")
.withColumn("tokens", countTokens(col("words"))).show(false)
val regexTokenized = regexTokenizer.transform(sentenceDataFrame)
regexTokenized.select("sentence", "words")
.withColumn("tokens", countTokens(col("words"))).show(false)
// $example off$
spark.stop()
}
}
输出结果:
+———————————–+——————————————+——+
|sentence |words |tokens|
+———————————–+——————————————+——+
|Hi I heard about Spark |[hi, i, heard, about, spark] |5 |
|I wish Java could use case classes |[i, wish, java, could, use, case, classes]|7 |
|Logistic,regression,models,are,neat|[logistic,regression,models,are,neat] |1 |
+———————————–+——————————————+——+
+———————————–+——————————————+——+
|sentence |words |tokens|
+———————————–+——————————————+——+
|Hi I heard about Spark |[hi, i, heard, about, spark] |5 |
|I wish Java could use case classes |[i, wish, java, could, use, case, classes]|7 |
|Logistic,regression,models,are,neat|[logistic, regression, models, are, neat] |5 |
+———————————–+——————————————+——+