Each function pushed data into a new Pulsar topic. This post explores the reasoning and process behind migrating streaming workflows from a highly distributed and complex stream processing architecture to a simplified one based on Apache Pulsar and Pulsar Functions. I look forward to exploring this functionality with Pulsar Functions soon. The state-of-the-art real-time data storage and processing approach. An example of Pulsar Functions writing to many sinksThere are many exciting sinks I could use with Pulsar Functions, and while Pulsar I/O handles most of the sinks, writing from Pulsar Functions could be advantageous for some of my pipelines. Pulsar was originally developed and deployed inside Yahoo as the consolidated messaging platform connecting critical Yahoo applications such as Yahoo Finance, Yahoo Mail, and Flickr, to data. This is insufficient for any form of stream processing use case where both input and output are from Pulsar. For the latter case tools like Spark, Heron and Flink seemed like a no-brainer, but for the simple case, there was some question about adopting a complex topology with the distributed state to do small computations on streams of data with no care about the order of the data.
Présentation de Apache Pulsar et de ses possibilités pour le développement d'applications de Stream Processing, faite lors de Paris Open Source Summit 2019. My original stream processing architecture for everything.While gathering the requirements for this new system, it became evident that not all stream processing is created equal. Stream Vision is an application that provides connection of your mobile devices with observation devices of Yukon or Pulsar via integrated Wi-Fi interface. Supports Isolation, Authentication, Authorization and QuotasPersistent message storage based on Apache BookKeeper.
I decided to narrow down my list and research tools that would enable a simple stream processing topology for these cases.My experiment had the following parameters. Since the Pulsar Nodes are on Kubernetes, we could (in theory) utilize the load balancer to spin up (or down) Pulsar Nodes to respond to demand. Throughout a week and a steady stream of work for this cluster, I observed system metrics and cluster behavior to note any anomalies, and test the overall resilience of the system. Pulsar + Pulsar Functions helped achieve a much-simplified stream topology when compared to Spark and Kinesis for these style of streaming jobs.
Stream processing avec Apache Pulsar. Deploy on bare metal or Kubernetes. I'm actively working on experimenting with this.