Sažetak | Heterogeni računalni sustav s programirljivim poljem logičkih elemenata za učenje stabla odluke U ovom radu prikazan je heterogeni računalni sustav i novi hibridni algoritam za učenje stabla odluke Dataflow decision tree construction – DF‑DTC. Algoritam DF‑DTC zasnovan je na algoritmu C4.5. Heterogeni sustav sadrži koprocesor izveden programirljivim poljem logičkih elemenata (FPGA, engl. field programmable gate array). Razrada arhitekture koprocesora i hibridnog algoritma DF‑DTC provedena je metodologijom programsko-sklopovskog suobliokovanja. U koprocesoru je izvedena obrada nominalnih atributa skupa za učenje, a u algoritam su uvedene prilagodbe podatkovnih struktura, te podrška za višedtretveno izvođenje. Vrednovanje performansi provedeno je mjerenjem ukupnog vremena izvršavanja rada programa, te mjerenjem vremena izvršavanja ključnih dijelova algoritma. Pri vrednovanju su korišteni sintetički skupovi za učenje, te skupovi za učenje javno dostupni na UCI repozitoriju. Performanse DF‑DTC-a uspoređene su s performansama postojeće programske implementacije algoritma EC4.5. Ubrzanje obrade nominalnih atributa na DF‑DTC-u iznosi u prosjeku 3,00 puta u usporedbi s programskom implementacijom EC4.5. Za cjelokupno izvršavanje programa najbolje ubrzanje iznosi 1,18 puta. Izvedba DF‑DTC-a za pokazala je potencijal FPGA-a kao platforme za ubrzanje učenja stabla odluke. |
Sažetak (engleski) | Heterogeneous computing system with field programmable gate array coprocessor for decision tree learning Heterogeneous computer systems comprise of a general purpose CPU and a special purpose processor, most commonly a GPU. Another platform that has recently gain traction in computing is field programmable gate arrays (FPGA). FPGAs provide a platform for implementing custom coprocessors tailored to the specific applications. The applications of interest in this thesis are data mining applications – more specifically, decision tree learning. Goal of this research is development of a novel FPGA coprocessor architecture suitable for implementation of decision tree learning algorithms, and definition and implementation of a hybrid algorithm adapted for the FPGA coprocessor. Chapter 1 (Introduction) presents motivation of the work and introduces fundamental concepts. Also, the objective of the research, and scientific contributions are presented. Chapter 2 (Field programmable gate arrays – FPGA) gives a short introduction to FPGA devices, their internal structure, and their use in computing systems. It also provides an outline of main features of Xilinx Virtex-6 FPGA, which is used for implementation of the coprocessor. In Chapter 3 (FPGA implementations of data mining algorithms) a survey of current state of the art of using FPGAs as hardware accelerators for data mining is presented. Three algorithms were surveyed: decision tree learning, support vector machines, and k-means clustering. Only FPGA implementations of learning phase of the algorithms were studied. The reviewed implementations for all algorithms were analyzed, and several common characteristics of implementations were identified. Chapter 4 (Software implementations of C4.5 decision tree learning algorithm) presents decision trees and decision tree learning algorithms in general terms, and in detail the C4.5 decision tree learning algorithm. Two implementations of the C4.5 algorithm were profiled: the original C4.5Release8, and its modified version EfficientC4.5 (EC4.5). Profiling revealed that the bulk of execution time for larger sets (≥20,000 items) is spent in three processes: frequency matrix computation for nominal attributes, finding the best split of the set for each nominal attribute, and splitting the input dataset into subsets. EC4.5 performs better than C4.5 due to optimized handling of numerical attributes, and is the better basis for development of the hybrid algorithm. Chapter 5 (Heterogeneous computer system for decision tree learning) presents design and implementation of the FPGA coprocessor. A description of the targeted heterogeneous computer system is given, with emphasis on the FPGA platform used for implementation of the coprocessor. The platform and its development tools are designed to utilize the dataflow model of computation. Using the results of algorithm profiling, two kernels for the coprocessors were defined and implemented – ComputeFreq, and GoupItems kernels. The ComputeFreq kernel implements computation of frequency matrix for nominal attributes, and three variants were developed with varying degree of parallelism. Third variant (ComputFreq MS2), which is used in the coprocessor, calculates frequency matrices for six attributes in parallel, and uses parallelism to compensate for internal latencies which would otherwise limit performance. The GroupItems kernel implements splitting input set into subsets. The two kernels form the core of the coprocessor. Along with the kernels, the coprocessor contains communication links to the CPU’s main memory over the PCIe bus, and a memory controller which provides access to FPGA attached DDR3 SDRAM. Chapter 6 (Hybrid algorithm for decision tree learning) presents the hybrid algorithm for tree learning DF-DTC – Dataflow Decision Tree Construction. DF-DTC is based on a multithreaded implementation EC4.5, which was developed to test different approaches in parallelization of the algorithm, as well as to asses possible gains. Two parallelization strategies were implemented – distributed items, and distributed attributes – and performance for both strategies was evaluated. The distributed attributes strategy was shown to be the superior one, and was selected as basis for development of the hybrid algorithm. DF-DTC implements a subset of EC4.5’s features. The most notable difference is that DF-DTC can work only with datasets which don’t contain unknown values. The algorithm flow is somewhat modified to accommodate the initial data transfer from CPU to coprocessor, as well as utilization of coprocessor to compute frequency matrices. Computation of frequency matrices can be executed on coprocessor or on CPU, depending on size of the input subset. Also, the hybrid algorithm implements a procedure for dataset synchronization, which ensures that contents of subsets in main memory and in coprocessor memory match. Chapter 7 (Performance evaluation of the heterogeneous computer system and the hybrid algorithm) presents the procedure and results of performance evaluation of the developed computer system and algorithm. The performance evaluation consists of two parts: evaluation of ComputeFreq kernel in isolation, and evaluation of DF-DTC executed on heterogeneous system. Evaluation of ComputeFreq kernel in isolation was carried out an all variants, and the results were compared to the multithreaded software implementation. Performance of the kernel is dependent on size of the dataset, with best performance achieved on largest datasets. Best speedup achieved is 14.7, while average speedup is 3.00. Evaluation of the DF-DTC is carried out in three parts: subset processing, number of items scaling, and number of attributes scaling. Performance was compared to the multihreaded EC4.5. Best total speedup is 1.18, achieved for datasets with 200 attributes and 2,000,000 items. Analysis of the evaluation results brought to light previously unaccounted for bottlenecks. Possible avenues for improving performance of the DF‑DTC are discussed. Chapter 8 (Conclusion) restates the significant results of the research, lays out scientific contributions, and proposes future work. This thesis presents a novel heterogeneous computer system and hybrid algorithm for decision tree learning DF-DTC. The key process which is accelerated using the FPGA coprocessor is frequency matrix computation. The kernel of the coprocessor provides an average speedup 3.00 times compared to multithreaded software implementation. Total speedup of the DF‑DTC is up to 1.18, compared to multithreaded EC4.5. Analysis of performance evaluation results shed light to causes of the lower than expected speedups, and guiding principles for improving the algorithm and coprocessor were derived from them. Development, implementation and performance evaluation of the coprocessor for decision tree learning has shown that FPGA is overall a suitable platform for acceleration of decision tree learning algorithms. |