Spark Universal Migrator is a Scala/Spark library for full-load table-by-table migration from Oracle to Hive. It captures an Oracle snapshot SCN, reads source rows through Spark JDBC using ROWID range queries built from Oracle extent metadata, applies an explicit Oracle-to-Spark schema policy, and writes the result into Hive through a temporary table plus INSERT OVERWRITE.
- Apache Spark 3.5.7
- Scala 2.12.x
- Oracle as the source system
- Hive metastore / Hive-enabled Spark as the target system
- Oracle JDBC driver support on the runtime classpath
- Full-load data migration
- Table-by-table execution
- Range-based reads using Oracle
ROWID - Snapshot-based reads through captured SCN
- Oracle type handling modes:
spark: keep Spark JDBC inferred types for ambiguous OracleNUMBERoracle: profile OracleNUMBERcolumns without precision/scale metadataskip: avoid extra profiling and fall back toStringTypefor ambiguousNUMBER
- CDC / change capture
- Incremental load orchestration
- Schema evolution management across runs
- Environment-variable based configuration
- Standalone CLI entrypoint in this repository
Run the test suite with:
sbt testThe test suite covers SQL generation, schema conversion, Spark session creation, Spark-side JDBC load composition, and Hive overwrite behavior.
Use the NewSpark.migrate(...) API from your own application entrypoint:
import queukat.spark_universal.NewSpark
object ExampleMigration {
def main(args: Array[String]): Unit = {
NewSpark.migrate(
url = "jdbc:oracle:thin:@//localhost:1521/ORCL",
oracleUser = "your_oracle_username",
oraclePassword = "your_oracle_password",
tableName = "employees",
owner = "HR",
hivetable = "employees_hive",
numPartitions = 8,
fetchSize = 1000,
typeCheck = "spark"
)
}
}- The library expects Oracle users with access to the required metadata views used to compute extent ranges.
- Unsupported Oracle types fail fast during schema conversion instead of being silently downgraded.
- Temporary Hive table names are generated uniquely per migration run.
- Logging stays on the existing
slf4jfacade so the host Spark application keeps control over the final backend. - The library now emits stage-oriented log messages such as
[MIGRATE],[SCHEMA],[LOAD], and[HIVE]for easier scanning. - ANSI color is opt-in/out through
-Dspark.universal.log.color=true|false; by default the library only colors logs when it detects an interactive terminal.
The repository includes GitHub Actions workflows for CI and Maven Central publishing. CI runs sbt test, and publishing uses sbt +publishSigned on release.