oriollopezmassaguer / dataframe

A simple implementation of a in memory DataFrame in Scala

GitHub

DataFrame

A simple implementation of a in memory DataFrame in Scala

Example of manipulation of plain text files from Titanic dataset https://www.kaggle.com/c/titanic/data

  // Data from https://www.kaggle.com/c/titanic/data
  import models.dataframe._

  // Reading a plain text file with passenger data (tab separated)
  val passenger_data: DataFrame = DataFrame("data/plain_text/input/passenger.tsv")

  // Filter the rows corresponding to male passengers (SQL filter)
  val male_passengers: DataFrame = passenger_data.filter("Sex", "male")

  // Export the filtered data
  male_passengers.toText("data/plain_text/output/passenger_male.tsv")

  // generate a new dataTable with only 3 fields of the original table (SQL projection)
  val passenger_data_projected: DataFrame = passenger_data.project("Age", "PassengerId", "Sex")
  passenger_data_projected.toText("data/plain_text/output/passenger_projected.tsv")

  // Reading a plain text file with survival data (tab separated)
  val gender_model_data: DataFrame = DataFrame("data/plain_text/input/gendermodel.tsv")

  // Join the passenger data with survival data 
  // by PasssengerId
  // We perform a inner join (SQL inner join) only rows with same value by join field 
  val innerjoin_data = passenger_data.join(gender_model_data, "PassengerId", "PassengerId2")
  innerjoin_data.toText("data/plain_text/output/passenger_inner_join_survival.tsv")

  // Join the passenger data with survival data 
  // by PasssengerId
  // We perform a left join (SQL inner join) all rows in left dataFrame independently 
  // if they are in the right dataFrame  
  val outer_join_data = passenger_data.join_left(gender_model_data, "PassengerId", "PassengerId2")
  outer_join_data.toText("data/plain_text/output/passenger_outer_join_survival.tsv")

Example compounds data from DrugBank dataset of approved Drugs

  // Data from http://www.drugbank.ca/downloads#structures
  import models.dataframe._

  // Reading approved drugs in DrugBank (using RDKit libraries)
  val approved_drugs: DataFrame = DataFrame("data/drugbank/input/approved.sdf")

  // We compute MW and LogP for the approved drugs in DrugBank
  val aproved_drugs_mw_logp = approved_drugs
    .addMW
    .addLogP
    
  // We export the DataFrame with new fields
  // to SDF
  aproved_drugs_mw_logp.toSDF("data/drugbank/output/approved_mw_logp.sdf")
  // to TSV
  aproved_drugs_mw_logp.toText("data/drugbank/output/approved_mw_logp.tsv")

Setup

To test the software you need sbt http://www.scala-sbt.org/

$ sbt console

scala> models.ExampleTitanic

scala> models.ExampleDrugBank