Date of Award
Doctor of Philosophy
The colossal amounts of data generated daily are increasing exponentially at a never-before-seen pace. A variety of applications—including stock trading, banking systems, health-care, Internet of Things (IoT), and social media networks, among others—have created an unprecedented volume of real-time stream data estimated to reach billions of terabytes in the near future. As a result, we are currently living in the so-called Big Data era and witnessing a transition to the so-called IoT era. Enterprises and organizations are tackling the challenge of interpreting the enormous amount of raw data streams to achieve an improved understanding of data, and thus make efficient and well-informed decisions (i.e., data-driven decisions). Researchers have designed distributed data stream processing systems that can directly process data in near real-time. To extract valuable information from raw data streams, analysts need to create and implement data stream processing applications structured as a directed acyclic graphs (DAG). The infrastructure of distributed data stream processing systems, as well as the various requirements of stream applications, impose new challenges. Cluster heterogeneity in a distributed environment results in different cluster resources for task execution and data transmission, which make the optimal scheduling algorithms an NP-complete problem. Scheduling streaming applications plays a key role in optimizing system performance, particularly in maximizing the frame-rate, or how many instances of data sets can be processed per unit of time. The scheduling algorithm must consider data locality, resource heterogeneity, and communicational and computational latencies. The latencies associated with the bottleneck from computation or transmission need to be minimized when mapped to the heterogeneous and distributed cluster resources. Recent work on task scheduling for distributed data stream processing systems has a number of limitations. Most of the current schedulers are not designed to manage heterogeneous clusters. They also lack the ability to consider both task and machine characteristics in scheduling decisions. Furthermore, current default schedulers do not allow the user to control data locality aspects in application deployment.In this thesis, we investigate the problem of scheduling streaming applications on a heterogeneous cluster environment and develop the maximum throughput scheduler algorithm (MT-Scheduler) for streaming applications. The proposed algorithm uses a dynamic programming technique to efficiently map the application topology onto a heterogeneous distributed system based on computing and data transfer requirements, while also taking into account the capacity of underlying cluster resources. The proposed approach maximizes the system throughput by identifying and minimizing the time incurred at the computing/transfer bottleneck. The MT-Scheduler supports scheduling applications that are structured as a DAG, such as Amazon Timestream, Google Millwheel, and Twitter Heron. We conducted experiments using three Storm microbenchmark topologies in both simulated and real Apache Storm environments. To evaluate performance, we compared the proposed MT-Scheduler with the simulated round-robin and the default Storm scheduler algorithms. The results indicated that the MT-Scheduler outperforms the default round-robin approach in terms of both average system latency and throughput.
This dissertation is only available for download to the SIUC community. Current SIUC affiliates may also access this paper off campus by searching Dissertations & Theses @ Southern Illinois University Carbondale from ProQuest. Others should contact the interlibrary loan department of your local library or contact ProQuest's Dissertation Express service.