Apache Cassandra Spark Connector

Lightning-fast cluster computing with Apache Spark™ and Apache Cassandra®.

This is a fork from datastax/spark-cassandra-connector including features specific to ScyllaDB and to the needs of the ScyllaDB Migrator.

Changes compared to the original library

Add support for skipping some token ranges when reading a table, and track into a Spark accumulator the token ranges that have been written.

The complete changelog can be viewed here: master...scylla-4.x.

Quick Links

What	Where
Community	Chat with us at Apache Cassandra
Scala Docs	Most Recent Release (3.5.1): Connector API docs, Connector Driver docs
Latest Production Release	3.5.1

News

3.5.1

The latest release of the Spark-Cassandra-Connector introduces support for vector types, greatly enhancing its capabilities. This new feature allows developers to seamlessly integrate and work with Cassandra 5.0 and Astra vectors within the Spark ecosystem. By supporting vector types, the connector now provides insights into AI and Retrieval-Augmented Generation (RAG) data, enabling more advanced and efficient data processing and analysis.

Features

This library lets you expose Cassandra tables as Spark RDDs and Datasets/DataFrames, write Spark RDDs and Datasets/DataFrames to Cassandra tables, and execute arbitrary CQL queries in your Spark applications.

Compatible with Apache Cassandra version 2.1 or higher (see table below)
Compatible with Apache Spark 1.0 through 3.5 (see table below)
Compatible with Scala 2.11, 2.12 and 2.13
Exposes Cassandra tables as Spark RDDs and Datasets/DataFrames
Maps table rows to CassandraRow objects or tuples
Offers customizable object mapper for mapping rows to objects of user-defined classes
Saves RDDs back to Cassandra by implicit saveToCassandra call
Delete rows and columns from cassandra by implicit deleteFromCassandra call
Join with a subset of Cassandra data using joinWithCassandraTable call for RDDs, and optimizes join with data in Cassandra when using Datasets/DataFrames
Partition RDDs according to Cassandra replication using repartitionByCassandraReplica call
Converts data types between Cassandra and Scala
Supports all Cassandra data types including collections
Filters rows on the server side via the CQL WHERE clause
Allows for execution of arbitrary CQL statements
Plays nice with Cassandra Virtual Nodes
Could be used in all languages supporting Datasets/DataFrames API: Python, R, etc.

Version Compatibility

The connector project has several branches, each of which map into different supported versions of Spark and Cassandra. For previous releases the branch is named "bX.Y" where X.Y is the major+minor version; for example the "b1.6" branch corresponds to the 1.6 release. The "trunk" branch will normally contain development for the next connector release in progress.

Currently, the following branch is actively supported: 4.x (scylla-4.x).

Connector	Spark	Cassandra	Cassandra Java Driver	Minimum Java Version	Supported Scala Versions
4.0.0	3.5.x	2.1.5*, 2.2, 3.x, 4.x, 5.0	4.18.1	8	2.12, 2.13

Hosted API Docs

API documentation for the Scala and Java interfaces are available online:

Latest.

Download

This project is available on the Maven Central Repository. For SBT to download the connector binaries, sources and javadoc, put this in your project SBT config:

libraryDependencies += "com.scylladb" %% "spark-scylladb-connector" % "4.0.0"

The default Scala version for Spark 3.0+ is 2.12 please choose the appropriate build. See the FAQ for more information.

Building

See Building And Artifacts

Documentation

Online Training

In DS320: Analytics with Spark, you will learn how to effectively and efficiently solve analytical problems with Apache Spark, Apache Cassandra, and DataStax Enterprise. You will learn about Spark API, Spark-Cassandra Connector, Spark SQL, Spark Streaming, and crucial performance optimization techniques.

Community

Reporting Bugs

New issues may be reported using JIRA. Please include all relevant details including versions of Spark, Spark Cassandra Connector, Cassandra and/or DSE. A minimal reproducible case with sample code is ideal.

Mailing List

Questions and requests for help may be submitted to the user mailing list.

Q/A Exchange

For community help see https://cassandra.apache.org/_/community.html

Contributing

To protect the community, all contributors are required to sign the Apache Software Foundation's Contribution License Agreement.

Tips for Developing the Spark Cassandra Connector

Checklist for contributing changes to the project:

Create a CASSANALYTICS JIRA
Make sure that all unit tests and integration tests pass
Add an appropriate entry at the top of CHANGES.txt
If the change has any end-user impacts, also include changes to the ./doc files as needed
Prefix the pull request description with the JIRA number, for example: "SPARKC-123: Fix the ..."
Open a pull-request on GitHub and await review

Old issues from before the donation to the ASF and the Apache Cassandra project can be found in this SPARKC JIRA

Testing

Run make help to see all available targets. Common commands:

make compile                       # Compile all modules
make test-unit                     # Run unit tests
make lint                          # Check code with scalafix
make lint-fix                      # Auto-fix scalafix issues
make test-integration-cassandra    # Run integration tests with Cassandra
make test-integration-scylla       # Run integration tests with ScyllaDB

Integration tests require CCM (Cassandra Cluster Manager), Python 3.10+, and Java 17. Install CCM via make install-cassandra-ccm or make install-scylla-ccm.

Version aliases like LATEST, LTS-LATEST, and 4-LATEST are resolved automatically:

CASSANDRA_VERSION=4-LATEST make test-integration-cassandra
SCYLLA_VERSION=LATEST make test-integration-scylla

Or use exact versions:

CASSANDRA_VERSION=4.1.7 make test-integration-cassandra
SCYLLA_VERSION=2024.2.1 make test-integration-scylla

CI/CD

The project uses GitHub Actions. Workflows are in .github/workflows/.

Integration Tests (`integration-tests.yml`)

Runs on every push and PR to scylla-4.x:

Compile -- Compiles all modules
Lint -- Runs scalafix checks
Test Matrix -- Integration tests across database types and versions (ScyllaDB LATEST/LTS-LATEST, Cassandra 3-LATEST/4-LATEST/5-LATEST)

Can also be triggered manually with custom database and Scala version inputs.

Release (`release.yml`)

Manual workflow for publishing to Maven Central. Options: dry-run, skip-tests, target-tag (for re-releases).

The workflow removes -SNAPSHOT from the version, creates a tag, publishes the signed artifact to Maven Central via Sonatype, then bumps the version for the next development iteration.

Debugging CI Failures

Check the failing job's logs in the GitHub Actions tab
Look for test report annotations on the PR
Reproduce locally with the same make target and environment variables
For CCM issues, verify version resolution: make resolve-cassandra-version or make resolve-scylla-version

Branching Model

Our branch scylla-4.x is based off commit dbbf02890605692d163572cda4b2462993754d7b. It introduces binary incompatible changes compared to the upstream version 3.5.x.

We should occasionally merge the upstream changes to our fork.

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Apache Cassandra, Apache Spark, Apache, Tomcat, Lucene, Solr, Hadoop, Spark, TinkerPop, and Cassandra are trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries.

scylladb / spark-scylladb-connector 4.1.0

Apache Cassandra Spark Connector

Changes compared to the original library

Quick Links

News

3.5.1

Features

Version Compatibility

Hosted API Docs

Download

Building

Documentation

Online Training

Community

Reporting Bugs

Mailing List

Q/A Exchange

Contributing

Testing

CI/CD

Integration Tests (`integration-tests.yml`)

Release (`release.yml`)

Debugging CI Failures

Branching Model

License

Statistics

5 Dependencies

No Dependent

scylladb / spark-scylladb-connector 4.1.0

Apache Cassandra Spark Connector

Changes compared to the original library

Quick Links

News

3.5.1

Features

Version Compatibility

Hosted API Docs

Download

Building

Documentation

Online Training

Community

Reporting Bugs

Mailing List

Q/A Exchange

Contributing

Testing

CI/CD

Integration Tests (integration-tests.yml)

Release (release.yml)

Debugging CI Failures

Branching Model

License

Statistics

5 Dependencies

No Dependent

Integration Tests (`integration-tests.yml`)

Release (`release.yml`)