Load-Balanced Multi-JobcScheduling For Heterogeneous CPU-GPU Systems
Heterogeneous systems consisting of a Central Processing Unit (CPU) and a Graphics Processing Unit (GPU) are prevalent nowadays. The advent of heterogeneous programming models such as Open Compute Language (OpenCL) has made it possible to execute data-parallel applications on the GPU. As a result, developers are increasingly porting applications to OpenCL to accelerate application execution. However, mapping all heterogeneous applications to the GPU creates severe load imbalance across CPU and GPU in a multi-job scenario. This results in longer execution time of jobs and lower system throughput.
This thesis is an attempt to resolve the load imbalance problem during scheduling of applications in a heterogeneous environment. Among others, this thesis presents a novel scheduling mechanism, named Enhanced OpenCL Scheduler (E-OSched) that maps OpenCL applications on a heterogeneous system in a load-balanced manner. Load balance is achieved by contemplating the computation requirements of applications and processing power of heterogeneous processors in scheduling decisions.
This thesis also examines the impact of applications’ device suitability in multi-job scheduling on a heterogeneous CPU-GPU system. A machine learning-based application classifier, called Troodon, has been developed that classifies each application as either suitable for CPU execution or GPU execution. Furthermore, a speedup predictor has also been developed that predict the speedup when an application is executed on a suitable device in comparison to execution on a non-suitable device. Load-balanced mapping of jobs to heterogeneous devices is ensured by adopting the E-OSched scheduling mechanism.
A kernel fusion technique is also part of this thesis that increases GPU device utilization by fusing kernels with small data size. A machine learning-based fusion classifier has been developed that classifies jobs as either fusion suitable or fusion unsuitable. Thereafter, a pair of fusion suitable kernels, producing the highest speedup in comparison to their serial execution, are fused.