A powerful, cross-platform data manipulation and analysis library for Scala 3. Tabula provides a pandas-like API for working with tabular data, supporting everything from basic data loading to advanced statistical operations.
- Cross-platform: Runs on JVM, JavaScript (Node.js), and Scala Native
- Multiple data sources: CSV files, tab-separated files, PostgreSQL databases, and in-memory data
- Type-safe operations: Automatic type inference with support for integers, floats, strings, booleans, and timestamps
- Statistical analysis: Built-in descriptive statistics, z-scores, percentiles, and more
- Data manipulation: Filtering, sorting, sampling, splitting, and reshaping operations
- Performance optimized: Efficient operations on large datasets
Add tabula to your build.sbt
:
libraryDependencies += "io.github.edadma" %%% "tabula" % "0.0.1"
For cross-platform projects, tabula supports:
- JVM: Full feature set including PostgreSQL connectivity
- Scala.js: Core data operations for browser/Node.js environments
- Scala Native: High-performance compiled operations
import io.github.edadma.tabula.*
// From CSV file
val iris = Dataset.fromCSVFile("iris.csv")
// From in-memory data
val sales = Dataset(
Seq("product", "revenue", "units"),
Seq(
Seq("Widget A", 1250.50, 45),
Seq("Widget B", 890.25, 32),
Seq("Widget C", 2100.75, 67)
)
)
// From a Map
val temperatures = Dataset(Map(
"city" -> Seq("New York", "London", "Tokyo"),
"temp_c" -> Seq(22.5, 18.2, 25.1),
"humidity" -> Seq(65, 78, 82)
))
// Display dataset info
iris.info()
// <class io.github.edadma.tabula.Dataset>
// 150 rows; 5 columns
// # Column Non-Null Count Datatype
// 0 sepal_length 150 float
// 1 sepal_width 150 float
// 2 petal_length 150 float
// 3 petal_width 150 float
// 4 species 150 string
// View first/last rows
iris.head() // First 5 rows
iris.tail(10) // Last 10 rows
// Get descriptive statistics
iris.describe.print()
// count mean std min 25% 50% 75% max
// sepal_length 150.0 5.8433 0.8281 4.3000 5.1000 5.8000 6.4000 7.9000
// sepal_width 150.0 3.0573 0.4359 2.0000 2.8000 3.0000 3.3000 4.4000
// ...
// Access columns
val sepalLength = iris("sepal_length")
val species = iris.species // Dynamic column access
// Filter by conditions
val largeSepals = iris(iris("sepal_length") > 6.0)
val setosa = iris(iris("species") == "setosa")
// Complex filtering
val filtered = iris(
(iris("sepal_length") > 5.0) &&
(iris("petal_width") < 2.0)
)
// Select specific rows by index
val firstTen = iris.iloc(0 to 9)
val sample = iris.sample(20) // Random sample of 20 rows
// Column statistics
iris("sepal_length").mean // Mean value
iris("sepal_length").std // Standard deviation
iris("sepal_length").min // Minimum value
iris("sepal_length").max // Maximum value
// Z-score normalization
val normalized = iris.zcode
// Remove outliers (keep values within 3 standard deviations)
val cleaned = iris((iris.zcode.abs < 3).all)
// Sort by column
val sorted = iris.sort("sepal_length")
// Add new columns
val enhanced = iris.append(
(iris("sepal_length") * iris("sepal_width")).rename("sepal_area")
)
// Drop columns
val reduced = iris.drop("sepal_width", "petal_width")
// Sample and split data
val shuffled = iris.shuffle
val Array(train, test) = iris.split(80, 20) // 80% train, 20% test
// Load timestamp data
val timeseries = Dataset(
Seq("timestamp", "value"),
Seq(
Seq("2023-01-01 09:00:00", 100.5),
Seq("2023-01-01 10:00:00", 102.3),
Seq("2023-01-01 11:00:00", 98.7)
)
)
// Time-based operations
val recent = timeseries(timeseries("timestamp") > Instant.parse("2023-01-01T09:30:00Z"))
import io.github.edadma.tabula.PG
val connection = PG.connect("localhost", "mydb", "user", "password", 5432)
val results = PG.query(
"SELECT * FROM sales WHERE revenue > 1000",
connection
)
// Correlation and regression metrics
val predictions = model.predict(testData)
val error = testData.error("predictions", "actual")
val r2Score = testData.r2("predictions", "actual")
val adjustedR2 = testData.adjustedR2("predictions", "actual")
// Value counts and frequency analysis
val speciesCounts = iris.counts("species")
val normalized = iris.countsNormalize("species")
Tabula automatically infers column types but also supports explicit type specification:
import io.github.edadma.tabula.Type.*
val dataset = Dataset(
Seq("id", "score", "active", "created"),
data,
types = Seq(IntType, FloatType, BoolType, TimestampType)
)
- Use
sample()
for working with large datasets during development - Leverage
iloc()
for efficient row selection by index - Consider splitting large datasets with
split()
for parallel processing - Use
shuffle()
before splitting for better data distribution
Dataset.fromCSVFile(path)
- Load from CSV fileDataset.fromTabFile(table, path)
- Load from tab-separated fileDataset(columns, data)
- Create from sequencesDataset(map)
- Create from Map of column name to values
dataset(columnName)
- Select columndataset.iloc(indices)
- Select rows by positiondataset.loc(labels)
- Select rows by labeldataset(booleanMask)
- Filter rows by boolean condition
dataset.describe
- Descriptive statisticsdataset.mean
,dataset.std
,dataset.min
,dataset.max
dataset.zcode
- Z-score normalizationdataset.count()
- Count non-null values
dataset.sort(column)
- Sort by columndataset.drop(columns...)
- Remove columnsdataset.append(other)
- Add columns from another datasetdataset.sample(n)
- Random sampledataset.split(percentages...)
- Split into multiple datasets
This project is licensed under the ISC License. See the LICENSE file for details.
Contributions are welcome! Please feel free to submit issues and pull requests.
- Initial release
- Cross-platform support (JVM, JS, Native)
- Core data manipulation operations
- Statistical analysis functions
- CSV and PostgreSQL data sources