A set of Japanese tokenizers. Currently supported are tokenizers using: Kuromoji and KNP.
In your build.sbt
:
resolvers += "en-japan Maven OSS" at "http://dl.bintray.com/en-japan/maven-oss"
libraryDependencies += "com.enjapan" %% "japanese-tokenizers" % "0.0.5"
Example:
import com.enjapan.preprocessing.japanese.tokenizers.KuromojiTokenizer
val document = List("京都大学に行った。","飲み過ぎて二日酔いになりました。")
val tokenizer = new KuromojiTokenizer(stopPOS = Set(List("助詞"), List("助動詞"), List("記号"), List("終助詞")))
val tokenized = document.map(tokenizer.tokenize)
tokenized.foreach(tokens => println(tokens.mkString(",")))