Data qualification tools developped by HUPI.
import io.hupi.datascience_tools._
In this folder, you will find some tools for outliers detection, discretisation, correlation analysis and text correction.
This package has (for the moment) 4 tools :
1. Outliers detection
We use some statistical test and/or some hypothesis to determine if there is outliers in the RDD. Then there is a function to replace this points with some features like mean, mediane, nearest quartile, nearest "maximum", ...
A short preview :
Outliers.detection_test ( variable, method = "esd", alpha = 0.05, nbOutliers = 10 )
Outliers.replace( variable, outliers, replace = "quartile" )
2. Discretisation
This function allows you to discretized a quantitative variable with many options.
A short preview :
val discretize_data = Discretisation.apply_on ( variable, method = "quantile", nbClasse = None, direction = "right", labelsNames = List("NUMBERS") )
3. Correlation
There is a lot of function on dataframe which compute some features about correlation.
A short preview :
Correlation.findBestCor( data, schemaData, method = "pearson", test = "fisher", alpha = 0.05, numberOfCouples = 5, direction = "positive", level = "most" )
4. Text correction
With this function you can directly correct your text, it mostly removes typing mistakes.
A short preview :
val texte = "Nous soummes des speciallistes du big-data."
val correct = SpellingError.correctSentence (input = texte, secteur = "quotidien", sc)
The "Example.txt" file contains a short example for some of this function.
Created by HUPI - Anthony LAFFOND and Minh-tu Nguyen