peoplepattern / lib-text   0.3.2

Apache License 2.0 Website GitHub

A little text processing library for Scala.

Scala versions: 2.11 2.10

lib-text

A little text processing library for Scala.

Build Status Coverage Status Gitter

Overview

This is a little text processing library which supports language identification, tokenization, stopword filtering and provides some useful helper functions. The tokenization has been tuned to work well with text conventions commonly used in social media such as Twitter, and supports URLs, emoji, hashtags, emails and @-mentions cleanly. Stopword filtering is currently supported for

  • German
  • English
  • Spanish
  • French
  • Indonesian
  • Japanese
  • Malay
  • Dutch
  • Portuguese
  • Swedish
  • Turkish
  • Arabic

More to come.

Usage

Add to your project dependencies:

resolvers += "peoplepattern" at "https://dl.bintray.com/peoplepattern/maven/"

libraryDependencies += "com.peoplepattern" %% "lib-text" % "0.3"

Example

import com.peoplepattern.text.Implicits._

val txt = "Did you get your personalised print with your copy of #MadeintheAM on Black Friday? If not, there's still time! http://www.myplaydirect.com/one-direction"

txt.lang
// Some(en)

txt.tokens
// Vector(Did, you, get, your, personalised, print, with, your, copy, of, #MadeintheAM, on, Black, Friday, ?, If, not, ,, there's, still, time, !, http://www.myplaydirect.com/one-direction)

txt.terms
// Set(print, personalised, black, copy, friday, time)

txt.termsPlus
// Set(print, personalised, black, #madeintheam, copy, friday, time)

txt.termBigrams
// Set(black friday, personalised print)

License

lib-text is open source and licensed under the Apache License 2.0.

Acknowledgements

Developed with ❤️ at People Pattern Corporation

People Pattern logo