Table of Contents generated with DocToc
- dbevolv
- Usage
- Writing database migrations
- Building more complex upgrade / downgrade scripts
- Migration design guidelines
- Rebasing a database
- Getting the list of tenants
- Computing the database name / schema name / index name / keyspace (depending on underlying db kind)
- Testing your newly added script locally before committing
- Project examples
- Upgrading / downgrading a database
- Inspecting the migrations inside a schema manager
- Getting the list of already installed migrations in a database
- Using a test instance in automated tests
- Development
Allows to evolve data store instances. Supports automatic testing, multi-tenancy, test database generation, custom scripts, big data stores.
- cassandra
- elasticsearch
- mysql
Create a git repo with the following structure:
/db.conf
/build.sbt
/version.sbt
/.gitignore
/project/plugins.sbt
/build.properties
/migrations/0010/
/0020/
/0030/
/0040/
/0050/
The db.conf
should contain the description of the data store schema. You must also specify the connection strings for all the environments. For example:
database_kind = cassandra
has_instance_for_each_tenant = true
schema_name = reverse_geo
app_name = reverse_geo-schema-manager
create_database_statement = "CREATE KEYSPACE @@DATABASE_NAME@@ WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }"
test_configurations = [
{
tenant = "mycustomer1"
},
{
tenant = "mycustomer2"
}
]
dev {
host = "my-dev-cassandra-host1,my-dev-cassandra-host2,my-dev-cassandra-host3"
schema_version = "0050"
}
qa {
host = "<the qa host>"
schema_version = "0040"
}
preprod {
host = "<the preprod host>"
schema_version = "0030"
}
sandbox {
host = "<the sandbox host>"
create_database_statement = "CREATE KEYSPACE @@DATABASE_NAME@@ WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 2 }"
schema_version = "0030"
}
prod {
host = "<the prod host>"
create_database_statement = "CREATE KEYSPACE @@DATABASE_NAME@@ WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 3 }"
schema_version = "0030"
}
Here are the different parameters you can configure:
- database_kind: which kind of data store we are targeting. See "Supported data stores" for valid values.
- schema_name: the logical name of this database schema.
- app_name: the name of this schema manager (required by app-util).
- schema_version: the migration version the given environment is supposed to be at. If not specified, all migrations will be applied. Specifying it is mandatory for dev, qa, preprod, sandbox, and prod.
- docker_namespace: optional. The namespace under which the various docker images will be tagged. Ex: if set to myregistry, the docker images will be tagged as myregistry/xyz.
- host: the host or hosts name(s) to connect to.
- port: the port to connect to. Leave empty for default port.
- create_database_statement: The CQL / SQL / HQL statement to use if the database does not even exists when running the schema manager. The
@@DATABASE_NAME@@
place holder will automatically be replaced by the actual schema / keyspace name (see also "Computing the database name / schema name / index name / keyspace" below). - name_provider_class: See "Computing the database name / schema name / index name / keyspace" below.
- test_configurations: configurations for which to generate test instances during the
buildTestContainer
task. This allows you to have various keyspaces / indices / databases within the test containers.
For multi-tenant databases:
- has_instance_for_each_tenant: whether this database have a distinct instance for each of your tenants. Default is 'false' (the database is a 'global' one).
- tenant_repository_class: when
has_instance_for_each_tenant
is true, you must supply a tenant repository. See 'Getting the list of tenants' below. - tenant_configuration_provider_class: when
has_instance_for_each_tenant
is true, you can supply an optional configuration provider that will inject a different property depending of the tenant. See 'Tenant specific configuration' below. - tenant: only used within
test_configurations
, defines a tenant name for which a test db instance should be created.
For Cassandra stores only:
- max_schema_agreement_wait_seconds: sets the maximum time to wait for schema agreement before returning from a DDL query (default: 30).
For Elasticsearch stores only:
- shard_number: how many shards the index should have.
- replica_number: in how many additional replicas each shard should be replicated (0 means no replication).
For MySQL stores only:
- username: the username to use to connect to the data store instance.
- password: the password to use to connect to the data store instance.
Note: most of the settings can have a default value at the top, but can be overriden for a given environment. See for example create_database_statement
in the above example.
The build.sbt
file should activate the Dbevolv
SBT plugin that will take care of everything:
enablePlugins(DbevolvPlugin)
organization := "your organization name"
The version.sbt
file should contain the initial version of this particular schema manager. Always 1.0.0 for new project, this will be automatically managed in Jenkins after each build:
version in ThisBuild := "1.0.0"
The build.properties
file should contain which SBT version to use:
sbt.version=0.13.11
The plugins.sbt
should point to this plugin on Artifactory (the funky piece of code make sure to always use the latest version available from Artifactory):
addSbtPlugin("com.mnubo" % "dbevolv-sbt-plugin" % "1.0.11")
The directories names in /migrations
constitute the migration versions. Migrations will be applied in the lexical order of those directory names. Ex: when asking dbevolv to upgrade to version '0002' in the above example, '0001' will be executed first, then '0002'.
Migration directories must contain 2 files named upgrade.???
and downgrade.???
. The extension depend on the data store type. For example, for Cassandra:
/migrations/0001/upgrade.cql
/downgrade.cql
The upgrade file should contain what it takes to upgrade a given database to the given version. The downgrade file should contain what it takes to downgrade from the given version to the previous version.
Each statement can be laid out on multiple lines, and be terminated by a ';' character. Lines starting with a '#' character will be considered as comments and will be ignored. Empty lines are ignored.
If you need complex logic, you can create a custom Java / Scala class and reference it as if it was a statement, with the '@@' keyword at the begining of the line. Example:
CREATE TABLE .... ;
# Some comment
@@com.mnubo.platform_cassandra.UpgradeFields;
Your class should be located at the usual Maven/SBT location. In this example: src/main/scala/com/mnubo/platform_cassandra/UpgradeFields.java. It must have a constructor with no parameters, and an execute method taking a 2 parameters.
- the connection to the database. Exact type of the connection depends on the data store type.
- the name of the database (postgres) / schema (mysql) / keyspace (cassandra) / index (elasticsearch)
Example:
package com.mnubo.platform_cassandra;
public class UpgradeFields {
public void execute(com.datastax.driver.core.Session connection, String dataseName) {
// Your upgrade logic here
}
}
Note: you can also use this trick in downgrade scripts.
If your custom script needs additional dependencies, you can add them in a build.sbt
file through the libraryDependencies SBT key. See SBT documentation
You migrations MUST be (and will be tested for):
- Idempotency. The schema manager must be able to run a migration twice without any effect on the end result. This is critical when something goes wrong during the application on a migration in production. We must be able to retry a fixed version of a faulty migration, or the same migration after a corruption is fixed.
- Perfect rollback. When we have dozens of namespaces, the likelyhood of failure in one of them is great. In that case, the schema manager must be able to rollback all the already migrated namespaces to stay consistent. Even if the upgrades involved a loss of information. In that situation, the upgrade must store the lost information somewhere to be able to retrieve it when rolling back.
- Immutable in production. Once a migration has been applied to production, you cannot modify the migration anymore. If you do, the schema manager will refuse to execute any further migrations.
- Forward compatible for the application in production. Obviously, it must not break the current version of the application. In other words, your migration must support both the new and the old version of your application.
Sometimes, especially when a database came through a lot of migrations, you are in a situation where lots of databases and columns are created in earlier migrations to be removed in later migrations, making the database quite long to create. This is especially problematic in multi-tenant databases that gets created for a new tenant. It can also happen that there is too many migrations, and that makes the build pretty long.
A solution to that problem is to 'rebase' the migrations. It means taking the result of all of those migrations, and make a single script having the same end result as them. dbevolv helps you do that in a safe way.
Begin by creating a new migration directory, with a new version number. Put a single script rebase.*ql
in it. Do not create a rollback.*ql
file, rolling back a rebase migration is not supported. Ex:
/migrations/2000/rebase.cql
dbevolv does not support the automatic filling of that script right now. So you will have to use the existing data store tools to forge it for you from the latest test image (see below).
As always, you can test locally that your rebase script is sound by running:
sbt buildTestContainer
Rebase migrations are treated a bit differently than the others. Lets take an example where we do have the following migrations:
0010
0020
0030
0100 (rebase)
0110
0200 (rebase)
0210
- first, dbevolv will make sure that the database resulting from a rebase script is the same as the one resulting from all the previous migrations.
- your rebase migrations do not need to be idempotent.
- when applied on an existing database (version <=
0030
in our example), the rebase scripts will not be applied, but all metadata about previous migrations up to the rebase will be erased. This is done for each rebase encountered along the way. For example, if the database had these migrations installed before:[0010, 0020]
, then after running the schema manager it would have[0200, 0210]
. - when applied on a new database, dbevolv will start at the latest rebase, and only apply further migrations. In our example, it would apply only
0200
and0210
because200
is the latest migration of typerebase
. - WARNING!: once rebased, you cannot go back to previous migrations anymore. Which means that rolling back a database will only bring you to the previous rebase, even if you asked to rollback to a previous migration. For example, if a database is at version
0210
in our example, rolling back to0030
will actually only bring the database to0200
.
Cleanup: after a given rebase
migration has been applied to all environments, you can safely delete the previous migration directories from the build.
In order for dbevolv to know for which tenants to create and maintain a database instance, you need to provide a class implementing TenantRepository. You can place the class in src/main/java/...
or src/main/scala/...
, or just reference the jar in the dependencies in your build.sbt
:
libraryDependencies += "mycompany" % "mytenantrepo" % "1.0.0"
The constructor must have one and one argument only, which is the typesafe config loaded from the db.conf
file. This allows you to configure your class easily.
package mycompany.tenants
import com.mnubo.dbevolv._
class MyTenantRepository(config: Config) extends TenantRepository {
// Configuration specific to this particular repository
private val host = config.getString("tenantdb.host")
private val port = config.getInt("tenantdb.port")
private val dbConnectionPool = ...
override def fetchTenants = {
using(dbConnectionPool) { connection =>
// Pseudo JDBC like API
connection
.execute("SELECT customer_name FROM customer")
.map(row => row.getString("customer_name"))
.sorted
}
}
}
Then, you could add your repository specific configuration in the db.conf
file. In the previous fictitious example, it would look like:
has_instance_for_each_tenant = true
tenant_repository_class = "mycompany.tenants.MyTenantRepository"
tenantdb.host = "mydbhost"
tenantdb.port = 3306
Sometimes, you need specific configuration for certain tenant. For example, large customer might require more Elasticsearch shards in their indices. You can achieve that by supplying a configuration provider. As for the tenant repository, you can reference the jar with your class in the dependencies:
libraryDependencies += "mycompany" % "mytenantconfigprovider" % "1.0.0"
The constructor must have one and one argument only, which is the typesafe config loaded from the db.conf
file.
package mycompany.tenants
import com.mnubo.dbevolv._
class MyTenantConfiguration(config: Config) extends TenantConfigurationProvider {
override def configFor(tenant: String) = {
// load config from a database or a config file for this particular tenant
}
override def close() = ...
}
Then, you could reference this class in the db.conf
file. In the previous fictitious example, it would look like:
has_instance_for_each_tenant = true
tenant_configuration_provider_class = "mycompany.tenants.MyTenantConfiguration"
Note: you can provide an empty config for a given tenant. dbevolv will simply apply the default configuration from the db.conf
file for this tenant.
The actual database / keyspace name will be computed the following way:
- for global databases: the logical schema_name will be used.
- Ex: myappdb
- for databases per tenant: the name will be suffixed with the customer's / tenant name.
- Ex: myappdb_mycustomer
Sometimes, this is not suitable. For example, QA keyspace names might be totally custom. Or historical keyspaces might be jammed together. For all these use cases, you can customize the keyspace name provider. For example:
package com.mnubo.ingestion
import com.typesafe.config.Config
class LegacyDatabaseNameProvider extends DatabaseNameProvider {
private val default = new DefaultDatabaseNameProvider
def computeDatabaseName(schemaLogicalName: String, tenant: Option[String], config: Config) =
tenant
}
And then, in your db.conf
file, you need to override the default database name provider in the relevant environments:
prod {
host = "<the prod host>"
name_provider_class = "com.mnubo.ingestion.LegacyDatabaseNameProvider"
}
Just run:
sbt buildTestContainer
This will build the test container, through all the migration scripts. You can then look at 'Using a test instance in automated tests' to connect to this instance and verify it is working well.
To get usage:
docker run -it --rm -e ENV=<environment name> <schema_name>-mgr:latest --help
This should result to something like:
Upgrades / downgrades the mydatabase database to the given version for all the tenants.
Usage: docker run -it --rm -v $HOME/.docker/:/root/.docker/ -v /var/run/docker.sock:/var/run/docker.sock -v $(which docker):$(which docker) -e ENV=<environment name> mydatabase-mgr:latest [options]
-v <value> | --version <value>
The version you want to upgrade / downgrade to. If not specified, will upgrade to latest version.
-t <value> | --tenant <value>
The tenant you want to upgrade / downgrade to. If not specified, will upgrade all tenants.
--history
Display history of database migrations instead of migrating the database.
--help
Display this schema manager usage.
Note:
the volume mounts are only necessary when upgrading a schema. You can omit them when downgrading, getting help, or display the history.
Example:
docker run -it --rm -v $HOME/.docker/:/root/.docker/ -v /var/run/docker.sock:/var/run/docker.sock -v $(which docker):$(which docker) -e ENV=dev mydatabase:latest --version 0004
Note: the help message is slightly different for the databases that don't have one instance by tenant (global databases).
The schema manager will upgrade one tenant at a time. For each tenant, it will apply (or downgrade) all the necessary migration to reach the target version. If one of the tenants upgrade fail, it stopped. It is recommended to rollback all tenants to the origin version immediately, so the faulty migration could be fixed and reapplied to all of the migrations. Since migrations are checksumed, you cannot have a system with different flavours of the same migrations. This would make any subsequent upgrades or downgrades to fail immediately.
The schema manager will also perform some validation before starting to upgrade. It will check that the schema of the target instance match the expected schema (tables, columns, types).
If the database has been corrupted so that smooth migration is impossible, you will see a message explaining the issue(s) encountered and how to fix it:
18:10:04.988 [main] ERROR com.mnubo.dbevolv.Table - Table mytable does not contain a column somecolumn (type = text)
18:10:04.989 [main] ERROR com.mnubo.dbevolv.Table - Table mytable does not contain a column someothercolumn (type = double)
Exception in thread "main" java.lang.Exception: Oops, it seems the target schema of orb3a1 is not what it should be... Please call your dearest developer for helping de-corrupting the database.
at com.mnubo.dbevolv.DatabaseMigrator$.upgrade(DatabaseMigrator.scala:91)
at com.mnubo.dbevolv.DatabaseMigrator$.migrate(DatabaseMigrator.scala:69)
...
In this particular example, to repair the database, you need to create the objectmodel
and hdfs_import_period_sec
columns in the odainterpolationparams
table. You should then be able to restart the upgrade.
docker run -it --rm --entrypoint=/bin/bash <schema_name>-mgr:latest
ls -la /app/migrations
docker run -it --rm -e ENV=<environment name> <schema_name>-mgr:latest --history
Example output in dev on the mydatabase Cassandra database:
History of myfirstcustomer @ host1,host2,host3:
Version Date Checksum
0010 2015-05-11T00:00:00.000Z 555f57888cf9bea47e97bf6c9b7e9d3f
0020 2015-05-11T00:00:00.000Z c123fc10e716daf6275dfe67efa1efac
History of mysecondcustomer @ host1,host2,host3:
Version Date Checksum
0010 2015-05-11T00:00:00.000Z 555f57888cf9bea47e97bf6c9b7e9d3f
0020 2015-05-11T00:00:00.000Z c123fc10e716daf6275dfe67efa1efac
History of mythirdcustomer @ host1,host2,host3:
Version Date Checksum
0010 2015-05-11T00:00:00.000Z 555f57888cf9bea47e97bf6c9b7e9d3f
0020 2015-05-11T00:00:00.000Z c123fc10e716daf6275dfe67efa1efac
Each time a new migration is pushed to Gitlab, Jenkins will generate a test database instance with all the tables up to date. To start it:
docker run -dt -p <database kind standard port>:<desired_port> test-<schema_name>:latest
For example, with the Cassandra reverse_geo database:
docker run -dt -p 40155:9042 test-reverse_geo:latest
This will start a Cassandra instance, with a reverse_geo
keyspace (the logical database name) containing all the reverse_geo tables up to date. You can point your tests to use the 40155 port on the DOCKER_HOST in order to create a session.
The schema manager builder is actually a SBT plugin. To test the sbt plugin, we are using the scripted sbt plugin (yes, a pluging to test a plugin...). To launch the (quite long) tests, do:
sbt library/publishLocal dbevolvElasticsearch/publishLocal dbevolvElasticsearch2/publishLocal scripted
And go fetch a cup of coffee, you'll have time.
If you want to runs tests only on one kind of data store, specify the test build directory you want to fire (relative to src/sbt-test:
sbt library/publishLocal dbevolvElasticsearch/publishLocal dbevolvElasticsearch2/publishLocal "scripted schema-manager-generator/cassandradb"
Documentation for the scripted plugin is not the best. You can find a tutorial here: Testing SBT plugins