This repository contains the codebases used for recovering bad rows emitted by a Snowplow pipeline.
The different Snowplow pipelines being all non-lossy, if something goes wrong during, for example, schema validation or enrichment the payloads (alongside the errors that happened) are stored into a bad rows storage solution, be it a data stream or object storage, instead of being discarded.
The goal of recovery is to fix the payloads contained in these bad rows so that they are ready to be processed successfully by a Snowplow enrichment platform.
For detailed documentation see docs.snowplow.io
Configuration mechanism allows for flexibility taking into account the most common usecases. Configuration is constructed with self-describing JSON consisting with 3 main concepts:
Steps are individual modifications applied to Bad Row payloads as atomic parts of recovery flows (scenarios).
Conditions are boolean expressions that operate on BadRow fields. If a conditions are satisfied, corresponding steps are applied. Otherwise next set of conditons is checked. If no conditions match row is marked failed with missing configuration.
Flows are sequences of Steps applied one by one.
- Define config
- Encode config
- Choose runner and deploy:
- Beam
- Spark (deprecated)
- Flink
There are several extension points for recovery: Steps, Conditions or additional BadRow types.
Technical Docs | Setup Guide |
---|---|
Technical Docs | [Setup Guide][setup] |
Copyright 2023 Snowplow Analytics Ltd.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this software except in compliance with the License.
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.