Data Streaming Cloud Applications

Ram Vadranam

3 min readMay 6, 2021

Building Data Platform

What is the purpose of streaming data?

Time-critical decisions
Business intelligence

Time value to the data will help businesses to take decisions on drive efficiency, save cost, etc.

Data streaming varies from →milliseconds → to seconds→to Minutes

Data streaming in milliseconds use cases:

Push notifications
Messages between services

Data streaming in seconds use cases:

Customer experience logs
Infrastructure logs
Application logs
Security logs
IoT device logs
CDS(Change data capture)

Data streaming in minutes use cases:

Data lakes
Data warehouse

Enabling real-time analytics:

Data streaming enables customers to ingest, process, and analyze high volumes of high-velocity data from a variety of sources in real-time.

AWS uses Kinesis for real-time streaming

Kinesis Data Streams: Collects and stores data stream for real-time analytics.

Kinesis Data Firehose: Loads data streams into AWS resources.

Kinesis Data Analytics: Analyze data stream with SQL or Java.

Stream Producers:

Mobile Apps
Web Clickstream
Application Logs
Metering records
IoT sensors
Smart Buildings

Stream Ingestion:

Toolkit and Libraries
Service Integrations

Stream Storage:

Data is stored in the order it was received for a set duration of time and can be replayed indefinitely at this time.

Kinesis Data Streams vs Data Firehose:

Kinesis data stream used for custom processing, per incoming record, with sub-second processing latency and allows the choice of stream processing frameworks.

Kinesis Data Firehose used for serverless capability and use existing analytical tools and with the latency of 60 seconds or more

Stream processing:

Records are read in the order they produced, enabling real-time analytics or streaming ETL.

Kinesis Data Analytics SQL use cases:

Sub-second end-to-end processing latencies
SQL steps can be chained together for parallel or serial steps
Build applications with several queries
Prebuild functions like sum and count distinct
Running aggregations continuously.

Kinesis Data Analytics JAVA use cases:

Sophisticated Applications
Uses Apache Flink engine for stateful processing of the data
Strong data integrity

Uses cases Data categorization:

Logs: Application or service logs
eCommerce: Data describes what customer has purchased
Interactions: How customers interact with the website
Products: Web site content

Challenges with Data Platform Design:

Service re-architecture
slow and unreliable data processing
Built for humans vs built as a service
Private data center limitations

Considerations:

Migrating to cloud
Treating data as a service
Low latency for data processing

Why Streaming?

Business evaluation requires richer and more timely data
Solving query across multiple microservices
Machine learning

Streaming workloads:

API for products: <5 seconds end-to-end latency, Enrichment of data, and unified data model
Customer notifications: <5 minutes end-to-end latency, PII and data compliance, and exactly-once delivery
Analytics: <5 minutes latency

Stateful transformation use case:

Stateful transformation is required for events which need to produce aggregated results based on event happened in past. Example counting number of orders by existing orders and new order event. These kinds of results are achieved by using stateful data transformation by storing existing events in memory. Appending aggregated result to the event and storing it in a DB will enable accessing results using API.

Goals:

Single data platform for marketing and APIs for products
Low-latency, stateful streaming
<15 seconds end-to-end data latency across all scenarios
Engineering excellence using CI/CD, unit, integration and functional testing, DR and rebuild capabilities, monitoring, and alerting.

Replay events data platform use cases:

If business logic changes replay and execute all of the events from day zero
Disaster recovery if we lost the database restore the database by re-running events from the data platform with all of the events from day zero
If we need to add new integration like Redshift we need to flow all data history along with the latest data