Data Streaming Cloud Applications
Building Data Platform
What is the purpose of streaming data?
- Time-critical decisions
- Business intelligence
Time value to the data will help businesses to take decisions on drive efficiency, save cost, etc.
Data streaming varies from →milliseconds → to seconds→to Minutes
Data streaming in milliseconds use cases:
- Push notifications
- Messages between services
Data streaming in seconds use cases:
- Customer experience logs
- Infrastructure logs
- Application logs
- Security logs
- IoT device logs
- CDS(Change data capture)
Data streaming in minutes use cases:
- Data lakes
- Data warehouse
Enabling real-time analytics:
Data streaming enables customers to ingest, process, and analyze high volumes of high-velocity data from a variety of sources in real-time.
AWS uses Kinesis for real-time streaming
Kinesis Data Streams: Collects and stores data stream for real-time analytics.
Kinesis Data Firehose: Loads data streams into AWS resources.
Kinesis Data Analytics: Analyze data stream with SQL or Java.
Stream Producers:
- Mobile Apps
- Web Clickstream
- Application Logs
- Metering records
- IoT sensors
- Smart Buildings
Stream Ingestion:
- Toolkit and Libraries
- Service Integrations
Stream Storage:
Data is stored in the order it was received for a set duration of time and can be replayed indefinitely at this time.
Kinesis Data Streams vs Data Firehose:
Kinesis data stream used for custom processing, per incoming record, with sub-second processing latency and allows the choice of stream processing frameworks.
Kinesis Data Firehose used for serverless capability and use existing analytical tools and with the latency of 60 seconds or more
Stream processing:
Records are read in the order they produced, enabling real-time analytics or streaming ETL.
Kinesis Data Analytics SQL use cases:
- Sub-second end-to-end processing latencies
- SQL steps can be chained together for parallel or serial steps
- Build applications with several queries
- Prebuild functions like sum and count distinct
- Running aggregations continuously.
Kinesis Data Analytics JAVA use cases:
- Sophisticated Applications
- Uses Apache Flink engine for stateful processing of the data
- Strong data integrity
Uses cases Data categorization:
- Logs: Application or service logs
- eCommerce: Data describes what customer has purchased
- Interactions: How customers interact with the website
- Products: Web site content
Challenges with Data Platform Design:
- Service re-architecture
- slow and unreliable data processing
- Built for humans vs built as a service
- Private data center limitations
Considerations:
- Migrating to cloud
- Treating data as a service
- Low latency for data processing
Why Streaming?
- Business evaluation requires richer and more timely data
- Solving query across multiple microservices
- Machine learning
Streaming workloads:
- API for products: <5 seconds end-to-end latency, Enrichment of data, and unified data model
- Customer notifications: <5 minutes end-to-end latency, PII and data compliance, and exactly-once delivery
- Analytics: <5 minutes latency
Stateful transformation use case:
Stateful transformation is required for events which need to produce aggregated results based on event happened in past. Example counting number of orders by existing orders and new order event. These kinds of results are achieved by using stateful data transformation by storing existing events in memory. Appending aggregated result to the event and storing it in a DB will enable accessing results using API.
Goals:
- Single data platform for marketing and APIs for products
- Low-latency, stateful streaming
- <15 seconds end-to-end data latency across all scenarios
- Engineering excellence using CI/CD, unit, integration and functional testing, DR and rebuild capabilities, monitoring, and alerting.
Replay events data platform use cases:
- If business logic changes replay and execute all of the events from day zero
- Disaster recovery if we lost the database restore the database by re-running events from the data platform with all of the events from day zero
- If we need to add new integration like Redshift we need to flow all data history along with the latest data