Scala bindings for the Hugging Face Tokenizers library, written in Rust.
import io.brunk.tokenizers.Tokenizer
val tokenizer = Tokenizer.fromPretrained("bert-base-cased")
val encoding = tokenizer.encode("Hello, y'all! How are you 😁 ?", addSpecialTokens=true)
println(encoding.length)
// 13
println(encoding.ids)
// ArraySeq(101, 8667, 117, 194, 112, 1155, 106, 1731, 1132, 1128, 100, 136, 102)
println(encoding.tokens)
// ArraySeq([CLS], Hello, ,, y, ', all, !, How, are, you, [UNK], ?, [SEP])
libraryDependencies += "io.brunk.tokenizers" %% "tokenizers" % "<version>"
//> using lib "io.brunk.tokenizers::tokenizers:<version>"
Copy coordinates from Maven Central for Scala 2.13 or Scala 3.
Currently, we can only load and run pre-trained tokenizers. Training is not yet possible.
- Install bleep
- Install Rust and Cargo
-
bleep compile bleep test