How to use Stream Processing with TIBCO ComputeDB
TIBCO ComputeDB supports Structured Streaming model.
The TIBCO ComputeDB structured streaming programming model is the same as Spark structured streaming. The only difference is support for ingesting streaming dataframes into TIBCO ComputeDB tables through a built-in Sink.
TIBCO ComputeDB provides a build-in output Sink which simplifies ingestion of streaming dataframes into TIBCO ComputeDB tables. The Sink supports idempotent writes, ensuring consistency of data when failures occur, as well as support for all mutation operations such as inserts, appends, updates, puts, and deletes.
The output data source name for SnappyData is
snappysink. A minimal code example for structured streaming with socket source and Snappy Sink is available here. You can also refer to Structured Streaming Quickstart guide. You can also refer to Structured Streaming Quickstart guide.
For more examples, refer to structured streaming examples. The following examples are shown:
|CDCExample.scala||An example explaining CDC (change data capture) use case with SnappyData streaming Sink.|
|CSVFileSourceExampleWithSnappySink.scala||An example of structured streaming depicting CSV file processing with Snappy Sink.|
|CSVKafkaSourceExampleWithSnappySink.scala||An example of structured streaming depicting processing of JSON coming from kafka source using snappy Sink.|
|JSONFileSourceExampleWithSnappySink.scala||An example of structured streaming depicting JSON file processing with Snappy Sink.|
|JSONKafkaSourceExampleWithSnappySink.scala||An example of structured streaming depicting processing of JSON coming from Kafka source using Snappy Sink|
|SocketSourceExample.scala||An example showing usage of structured streaming with console Sink.|
|SocketSourceExampleWithSnappySink.scala||An example showing usage of structured streaming with SnappyData.|
The topic included the following sections:
- Using SnappyData Structured Streaming API
- Handling Inserts, Updates and Deletes
- Event Processing Order
- Sink State Table
- Overriding Default Sink Behavior
- Resetting a Streaming Query
- Best Practices for Structured Streaming
Using TIBCO ComputeDB Structured Streaming API
The following code snippet, from the example, explains the usage of TIBCO ComputeDB's Structured Streaming API:
val streamingQuery = structDF .filter(_.signal > 10) // so transformation on input dataframe .writeStream .format("snappysink") // Required to ingest into TIBCO ComputeDB tables .queryName("Devices") // Required when using snappysink. Must be unique across the TIBCO ComputeDB cluster. .trigger(ProcessingTime("1 seconds")) .option("tableName", "devices") // Required: name of the snappy table where data will be ingested. .option("checkpointLocation", checkpointDirectory) .start()
TIBCO ComputeDB Specific options
The following are TIBCO ComputeDB specific options which can be configured for Structured Streaming:
||Name of the TIBCO ComputeDB table where the streaming data is ingested. The property is case-insensitive and is mandatory.|
||Name of the schema under which TIBCO ComputeDB’s internal state table will be created. This table is used to track the progress of the streaming queries and enables snappy sink to behave in an idempotent manner when streaming query is restarted after abrupt failures or planned down time.This is a mandatory property when security is enabled for the TIBCO ComputeDB cluster. When security is disabled, snappy sink uses APP schema by default to store the sink state table.|
||This is an optional boolean property with the default value set to
||This is an optional property which is used to override default Snappy Sink behavior. To override the default behavior, client codes should implement
Handling Inserts, Updates and Deletes
A common use case for streaming is capturing writes into another store (Operational database such as RDB or NoSQL DB) and streaming the events through Kafka, applying Spark transformations, and ingesting into an analytics datastore such as TIBCO ComputeDB. This pattern is commonly referred to as Change-Data-Capture (CDC).
To support CDC, the source DataFrame must have the following:
_eventType. The value in the
_eventTypecolumn can be any of the following:
0for insert events
1for update events
2for delete events
In case the input data is following a different convention for event types, then it must be transformed to match the above-mentioned format.
Records which have
_eventTypevalue other than the above-mentioned ones are skipped.
The target TIBCO ComputeDB table must have key columns defined for a column table or primary key defined for a row table.
An example explaining the CDC use case is available here.
_eventType column is not provided as part of source dataframe, then the following is observed:
- In a target table with key columns/primary key defined, the put into operation is applied to all events.
- In a target table without key columns/primary key defined, the insert operation is applied to all the events.
Event Processing Order
Currently, the ordering of events across partitions is not supported. Event processing occurs independently in each partition. Hence, you must ensure that in your application all the events, that are associated with a key, are always delivered on the same partition (shard on the key).
If your incoming stream is not partitioned on the key column, the application should first repartition the dataframe on the key column. You can ignore this requirement, if your incoming streams are continuously appending. For example, time series or when replacing data where ordering is irrelevant.
- Processes all delete events (deletes relevant records from target table)
- Processes all insert events (inserts relevant records into the target table
- Processes all update events (applies PutInto operation)
If the _eventType column is not provided as part of source dataframe, then the events are processed in the following manner:
- If key columns/primary keys are defined for the target table, then all the events are treated as update events and put into operation is performed for all events.
- If key columns/primary keys are not defined for the target table, then all the events are treated as insert events and insert operation is applied for all events.
- Group all the events in the given partition by key.
- Convert inserts into put into operations if the event type of the last event for a key is of insert type and there are more than one events for the same key.
- Keep the last event for each key and drop remaining events.
This results in a batch, where there is at most a single entry per key.
By default the
conflation property is set to
false. Therefore, the event processing semantics only ensures consistency when incoming events in a batch are for the unique key column(s).
For example:If an incoming batch contains an Insert(key1) event followed by a Delete(key1) event, the record for key1 is shown in the target table after the batch is processed. This is because all the Delete events are processed before Insert events as per the event processing order explained here. In such cases, you should enable the Conflation by setting the conflation property to true. Now, if a batch contains Insert(key1) event followed by a Delete(key1) event, then SnappyData Sink conflates these two events into a single event by selecting the last event which is Delete(key1) and only that event is processed for key1. Processing Delete(key1) event without processing Insert(key1) event does not result in a failure, as Delete events are ignored if corresponding records do not exist in the target table.
Sink State Table
A replicated row table with name SNAPPYSYS_INTERNAL____SINK_STATE_TABLE is created by Snappy Sink under schema specified by the stateTableSchema option if the table does not exist. If the stateTableSchema is not specified then the sink state table is created under the APP schema. During the processing of each batch, this state is updated.
This table is used by Snappy Sink to maintain the state of the streaming queries. This state is important to maintain the idempotency of the sink In case of stream failures. The Sink State table contains the following fields:
|stream_query_id||varchar(200)||Primary Key. Name of the streaming query|
|batch_id||long||Batch id of the most recent batch picked up for processing.|
Behavior of Sink State Table in a Secure cluster
When security is enabled for the cluster, the stateTableSchema becomes a mandatory option. Also, when you submit the streaming job, you must have the necessary permissions on the schema specified by stateTableSchema option.
Maintaining Idempotency In Case Of Stream Failures
When stream execution fails, it is possible that the streaming batch was half processed. Hence next time whenever the stream is started, Spark picks the half processed batch again for processing. This can lead to extraneous records in the target table if the batch contains insert events. To overcome this, Snappy Sink keeps the state of a stream query execution as part of the Sink State table.
The key columns in a column table are merely a hint (used to perform put into and delete operations) and does not enforce a unique constraint such as a primary key in case of a row table.
Using this state, Snappy Sink can detect whether a batch is a duplicate batch. If a batch is a duplicate batch then Snappy Sink processes all insert events from the batch using put into operation. This ensures that no duplicate records are inserted into the target table.
The above-mentioned behavior is applicable only when the key columns are defined on the target table as key columns are necessary to apply put into operation. When key columns are not defined on the target table, Snappy Sink does not behave in an idempotent manner and it can lead to duplicate records in the target table when the streaming query is restarted after stream failure.
Overriding Default Sink Behavior
If required, applications can override the default Snappy Sink semantics by implementing org.apache.spark.sql.streaming.SnappySinkCallback and passing the fully qualified name of the implementing class as a value of sinkCallback option of Snappy Sink.
SnappySinkCallback trait contains one method which needs to be implemented by the implementing class. This method is called for each streaming batch after checking the possibility of batch duplication which is indicated by possibleDuplicate flag.
A duplicate batch might be picked up for processing in case of failure. In the case of batch duplication, this method should handle batch in an idempotent manner in order to avoid data inconsistency.
def process(snappySession: SnappySession, sinkProps: Map[String, String], batchId: Long, df: Dataset[Row], possibleDuplicate: Boolean = false): Unit
Resetting a Streaming Query
Progress of a streaming query is saved as part of the checkpoint directory by Spark. On top of this Snappy Sink also maintains an internal state as part of the state table to ensure idempotency of the sink.
Hence to reset a streaming query, the following actions must be taken to clean the state of the streaming query:
When you use the following steps you may permanently lose the state of the streaming query.
- Delete the checkpoint directory. (or start streaming query with different checkpoint directory.)
Clear the state from the state table using following sql:
delete from [state_table_schema].snappysys_internal____sink_state_table where stream_query_id = <query_name>;
[state_table_schema]is the schema passed as part of stateTableSchema option of snappy sink. It should be skipped if stateTableSchema option was not provided while defining snappy sink.
<query_name>is the name of the query provided while defining the sink.
Best Practices for Structured Streaming
Refer to the Best Practices for Structured Streaming.
Limitations of Snappy Sink are as follows:
When the data coming from the source is not partitioned by key columns, then using Snappy Sink may result in inconsistent data. This is because each partition independently processes the data using the above-mentioned logic.
When key columns are not defined on the target table and the input dataframe does not contain
_eventTypecolumn, then Snappy Sink cannot guarantee idempotent behavior. This is because inserts cannot be converted into put into, as there are no key columns on the table. In such a scenario, Snappy Sink may insert duplicate records after an abrupt failure of the streaming job.
The default Snappy Sink implementation does not support partial records for updates. Which means that there is no support to merge updates on a few columns into the store. For all update events, the incoming records must provide values into all the columns of the target table.