snappydatainc / snappydata

SnappyData - The Spark Database. Stream, Transact, Analyze, Predict in one cluster

Website GitHub

Introduction

SnappyData (aka TIBCO ComputeDB community edition) is a distributed, in-memory optimized analytics database. SnappyData delivers high throughput, low latency, and high concurrency for unified analytics workloads.

By fusing an in-memory hybrid database inside Apache Spark, it provides analytic query processing, mutability/transactions, access to virtually all big data sources and stream processing all in one unified cluster.

The primary use case of SnappyData is to provide analytics at interactive speeds over large volumes of data with minimal or no pre-processing of the dataset. For instance, often, there is no need to pre-aggregate/reduce or generate cubes over your large data sets for ad-hoc visual analytics. This is made possible by smartly managing data in-memory, dynamically generating code using vectorization optimizations and maximizing the potential of modern multi-core CPUs. SnappyData enables complex processing on large data sets in sub-second timeframes.

SnappyData Positioning

!!!Note SnappyData is not another Enterprise Data Warehouse (EDW) platform, but rather a nimble computational cluster that augments traditional EDWs and data lakes.

Important Capabilities

  • Easily discover and catalog big data sets: You can connect and discover datasets in SQL DBs, Hadoop, NoSQL stores, file systems, or even cloud data stores such as S3 by using SQL, infer schemas automatically and register them in a secure catalog. A wide variety of data formats are supported out of the box such as JSON, CSV, text, Objects, Parquet, ORC, SQL, XML, and more.

  • Rich connectivity: SnappyData is built with Apache Spark inside. Therefore, any data store that has a Spark connector can be accessed using SQL or by using the Spark RDD/Dataset API. Virtually all modern data stores do have Spark connector. see SparkPackages). You can also dynamically deploy connectors to a running SnappyData cluster.

  • Virtual or in-memory data: You can decide which datasets need to be provisioned into distributed memory or left at the source. When the data is left at source, after being modeled as a virtual/external tables, the analytic query processing is parallelized, and the query fragments are pushed down wherever possible and executed at high speed. When speed is essential, applications can selectively copy the external data into memory using a single SQL command.

  • In-memory Columnar + Row store: You can choose in-memory data to be stored in any of the following forms:

    • Columnar: The form that is compressed and designed for scanning/aggregating large data sets.
    • Row store: The form that has an extremely fast key access or highly selective access. The columnar store is automatically indexed using a skipping index. Applications can explicitly add indexes for the row store.
  • High performance: When data is loaded, the engine parallelizes all the accesses by carefully taking into account the available distributed cores, the available memory, and whether the source data can be partitioned to deliver extremely high-speed loading. Therefore, unlike a traditional warehouse, you can bring up SnappyData whenever required, load, process, and tear it down. Query processing uses code generation and vectorization techniques to shift the processing to the modern-day multi-core processor and L1/L2/L3 caches to the possible extent.

  • Flexible rich data transformations: External data sets when discovered automatically through schema inference will have the schema of the source. Users can cleanse, blend, reshape data using a SQL function library (Spark SQL+) or even submit Apache Spark jobs and use custom logic. The entire rich Spark API is at your disposal. This logic can be written in SQL, Java, Scala, or even Python.*

  • Prepares data for data science: Through the use of Spark API for statistics and machine learning, raw or curated datasets can be easily prepared for machine learning. You can understand the statistical characteristics such as correlation, independence of different variables and so on. You can generate distributed feature vectors from your data that is by using processes such as one-hot encoder, binarizer, and a range of functions built into the Spark ML library. These features can be stored back into column tables and shared across a group of users with security and avoid dumping copies to disk, which is slow and error-prone.

  • Stream ingestion and liveness: While it is common to see query service engines today, most resort to periodic refreshing of data sets from the source as the managed data cannot be mutated — for example query engines such as Presto, HDFS formats like parquet, etc. Moreover, when updates can be applied pre-processing, re-shaping of the data is not necessarily simple. In SnappyData, operational systems can feed data updates through Kafka to SnappyData. The incoming data can be CDC events (insert, updates, or deletes) and can be easily ingested into respective in-memory tables with ease, consistency, and exactly-once semantics. The Application can apply smart logic to reduce incoming streams, apply transformations, etc. by using APIS for Spark structured streaming.*

  • Approximate Query Processing(AQP): When dealing with huge data sets, for example, IoT sensor streaming time-series data, it may not be possible to provision the data in-memory, and if left at the source (say Hadoop or S3) your analytic query processing can take too long. In SnappyData, you can create one or more stratified data samples on the full data set. The query engine automatically uses these samples for aggregation queries, and a nearly accurate answer returned to clients. This can be immensely valuable when visualizing a trend, plotting a graph or bar chart.

  • Access from anywhere: You can use JDBC, ODBC, REST, or any of the Spark APIs. The product is fully compatible with Spark 2.1.1. SnappyData natively supports modern visualization tools such as TIBCO Spotfire, Tableau, and Qlikview.

Downloading and Installing SnappyData

You can download and install the latest version of SnappyData from github or you can download the enterprise version that is TIBCO ComputeDB from here. Refer to the documentation for installation steps.

Getting Started

Multiple options are provided to get started with SnappyData. You can run SnappyData on your laptop using any of the following options:

  • On-premise clusters
  • AWS
  • Docker
  • Kubernetes

You can find more information on options for running SnappyData here.

Quick Test to Measure Performance of SnappyData vs Apache Spark

If you are already using Spark, experience upto 20x speedup for your query performance with SnappyData. Try out this test using the Spark Shell.

Documentation

To understand SnappyData and its features refer to the documentation.

Other Relevant content

  • Paper on Snappydata at Conference on Innovative Data Systems Research (CIDR) - Info on key concepts and motivating problems.
  • Another early Paper that focuses on overall architecture, use cases, and benchmarks. ACM Sigmod 2016.
  • TPC-H benchmark comparing Apache Spark with SnappyData
  • Checkout the SnappyData blog for developer content
  • TIBCO community page for the latest info.

Community Support

We monitor the following channels comments/questions:

Stackoverflow Stackoverflow SlackSlack Gitter Gitter Mailing List Mailing List Reddit Reddit JIRA JIRA

Link with SnappyData Distribution

Using Maven Dependency

SnappyData artifacts are hosted in Maven Central. You can add a Maven dependency with the following coordinates:

groupId: io.snappydata
artifactId: snappydata-cluster_2.11
version: 1.1.1

Using SBT Dependency

If you are using SBT, add this line to your build.sbt for core SnappyData artifacts:

libraryDependencies += "io.snappydata" % "snappydata-core_2.11" % "1.1.1"

For additions related to SnappyData cluster, use:

libraryDependencies += "io.snappydata" % "snappydata-cluster_2.11" % "1.1.1"

You can find more specific SnappyData artifacts here

!!!Note If your project fails when resolving the above dependency (that is, it fails to download javax.ws.rs#javax.ws.rs-api;2.1), it may be due an issue with its pom file.
As a workaround, you can add the below code to your build.sbt:

val workaround = {
  sys.props += "packaging.type" -> "jar"
  ()
}

For more details, refer https://github.com/sbt/sbt/issues/3618.

Building from Source

If you would like to build SnappyData from source, refer to the documentation on building from source.

What is the Delta between SnappyData and Apache Spark? <!---Can we use another word instead of Delta.--->

Apache Spark is a general purpose parallel computational engine for analytics at scale. At its core, it has a batch design center and is capable of working with disparate data sources. While this provides rich unified access to data, this can also be quite inefficient and expensive. Analytic processing requires massive data sets to be repeatedly copied and data to be reformatted to suit Spark. In many cases, it ultimately fails to deliver the promise of interactive analytic performance. For instance, each time an aggregation is run on a large Cassandra table, it necessitates streaming the entire table into Spark to do the aggregation. Caching within Spark is immutable and results in stale insight.

The SnappyData Approach

Snappy Architecture

SnappyData Architecture

At SnappyData, we take a very different approach. SnappyData fuses a low latency, highly available in-memory transactional database (GemFireXD) into Spark with shared memory management and optimizations. Data in the highly available in-memory store is laid out using the same columnar format as Spark (Tungsten). All query engine operators are significantly more optimized through better vectorization and code generation.
The net effect is, an order of magnitude performance improvement when compared to native Spark caching, and more than two orders of magnitude better Spark performance when working with external data sources.

Essentially, we turn Spark into an in-memory operational database capable of transactions, point reads, writes, working with Streams (Spark) and running analytic SQL queries. Or, it is an in-memory scale out Hybrid Database that can execute Spark code, SQL, or even Objects.

Streaming Example - Ad Analytics

Here is a stream + Transactions + Analytics use case example to illustrate the SQL as well as the Spark programming approaches in SnappyData - Ad Analytics code example. Here is a screencast that showcases many useful features of SnappyData. The example also goes through a benchmark comparing SnappyData to a Hybrid in-memory database and Cassandra.

Contributing to SnappyData

If you are interested in contributing, please visit the community page for ways in which you can help.