The HASTE team are pleased to announce the availability of a new publication of the arXiv pre-print service: ‘Apache Spark Streaming and HarmonicIO: A Performance and Architecture Comparison‘. We performed a benchmark analysis to compare two stream processing frameworks – the popular, Apache Spark framework, widely used in industry, and our own framework HarmonicIO (presented this summer at IEEE Cloud 2018 in San Francisco ).
Previous studies have demonstrated that Apache Spark, Flink and related frameworks can perform stream processing at very high frequencies, but they tend to focus on small messages with a computationally light ‘map’ stage for each message; a common enterprise use case (for example, processing JSON documents). In academic HPC contexts, we often want to analyze larger messages, with more CPU-intensive computations. Our study adds to these benchmarks by broadening the domain to include such processing loads – larger messages (leading to network-bound throughput), and that are computationally intensive (leading to CPU-bound throughput) in the map phase; in order to evaluate applicability of these frameworks to scientific computing applications.
We find that relative performance varies considerably across this domain, with the chosen means of stream source integration having a big impact. Most interestingly, we find that Spark performs very well for large (~10Mb) and small message sizes (~1Kb), but for medium-sized messages, it can be out-performed by HarmonicIO in some configurations. These message sizes are relevant to HASTE, because such file sizes are typical of microscopy applications.
We offer recommendations for choosing and configuring the frameworks, and present a benchmarking toolset developed for this study.
Pre-print is available at: https://arxiv.org/abs/1807.07724