japanese-tokenizers

Build Status Coverage Status

A set of Japanese tokenizers. Currently supported are tokenizers using: Kuromoji and KNP.

How to install

In your build.sbt:

resolvers += "en-japan Maven OSS" at "http://dl.bintray.com/en-japan/maven-oss"

libraryDependencies += "com.enjapan" %% "japanese-tokenizers" % "0.0.5"

How to use

Example:

import com.enjapan.preprocessing.japanese.tokenizers.KuromojiTokenizer

val document = List("京都大学に行った。","飲み過ぎて二日酔いになりました。")
val tokenizer = new KuromojiTokenizer(stopPOS = Set(List("助詞"), List("助動詞"), List("記号"), List("終助詞")))

val tokenized = document.map(tokenizer.tokenize)
tokenized.foreach(tokens => println(tokens.mkString(",")))