While streaming data is often placed into the context of "Big Data" (or more specifically "Fast Data") wherein one-pass algorithms are used, true data streams offer additional hurdles due to their dynamic, evolving, and non-stationary nature. In this paper, we describe a new approach to using ensembles for stream classification. While the core method is straightforward, it is specifically designed to adapt quickly with very little overhead to the dynamic and evolving nature of data streams generated from non-stationary functions. Our method, M3 , is based on a weighted majority ensemble of heterogeneous model types where model weights are updated on-line using Reinforcement Learning techniques.
The SluiceBox method described in this dissertation
aims to adapt to novel classes, feature evolution, and concept drift while predicting
data instance labels and adhering to the constraints of continuous data streams.
The research presented here details the challenges found in data stream mining,
and explores the theoretical requirements for detecting emerging novel classes. It also
presents a framework for data stream experimentation as an extension of the Waikato
University MOA framework. Finally, the theoretical observations are tested within
the framework using an implementation of a new online stream classifier method
known as SluiceBox.
As the proliferation of constant data feeds increases from social media, embedded sensors, and other sources, the capability to provide predictive concept labels to these data streams will become ever more important and lucrative. However, the dynamic, non-stationary nature, and effectively infinite length of data streams pose additional challenges for stream data mining algorithms. The sparse quantity of training data also limits the use of algorithms that are heavily dependent on supervised training. To address all these issues, we propose an incremental semi-supervised method that provides general concept class label predictions, but it also tracks concept clusters within the feature space using an innovative new online clustering algorithm.
In our current work, we have designed a multi-tiered ensemble based method HSMiner to address aforementioned challenges to label instances in an evolving Big Data Stream. However, this method requires building large number of AdaBoost ensembles for each of the numeric features after receiving each new data chunk which is very costly. Thus, HSMiner may face scalability issue in case of classifying Big Data Stream. To address this problem, we propose three approaches to build these large number of AdaBoost ensembles using MapReduce based parallelism.
Mining streaming data becomes challenging when using a piece-wise or online approach, however, due to concept drift and feature evolution. As the stream progresses, features may be added, removed, or change in the range of possible values, which is known as feature evolution. The defining concepts for a label or class may also migrate over the span of the data stream. Novel classes that were not known a priori can also appear within the data stream. Our solution consists of an adaptive supervised ensemble for predicting instance labels, and a stream clustering approach to monitor concept defining characteristics and novel class development without regard to label (i.e. unsupervised).
In our current work we have proposed a multi-tiered ensemble based fast and robust method, which rapidly learns the concepts in a data stream, predicts labels for new data with strong accuracy, and agilely tracks the dynamic changes in the evolving concepts and feature space. Bottleneck of our current work is, it needs to build ADABOOST ensemble for each numeric feature. This can face scalability issue as number of features can be very large at times in data stream. In this paper we propose a method to parallelize the independent parts of that work using MapReduce framework. This increases scalability and achieve a significant speedup without compromising classification accuracy.
Static data mining assumptions with regard to features and labels often fail the streaming context. Features evolve, concepts drift, and novel classes are introduced. Our approach -- named HSMiner (Hierarchical Stream Miner) -- takes a hierarchical decomposition approach to the ensemble classifier concept. By breaking the classification problem into tiers, we can better prune the irrelevant features and counter individual classification error through weighted voting and boosting. In addition, the atomic decomposition of feature inputs enables straightforward mapping to distributing the ensemble among resources in the network.
As ISR sensors proliferate in theater, the autonomous and dynamic control of those sensors, especially those on board unmanned aircraft systems, will become a necessity as personnel and resource constraints drive the shift in operator control from manual tasking and prioritization to system supervision. This paper explores the experimentation, analysis, and viability of a decentralized, auction-less market-based algorithm to fulfill such a need.
Modern high performance computing is dependent on parallel processing systems. Most current benchmarks reveal only the high level computational throughput metrics, which may be sufficient for single processor systems, but can lead to misrepresentation of true system capability for parallel systems. A new benchmark is therefore proposed. CLUE (Cluster Evaluator) uses a cellular automata algorithm to evaluate the scalability of parallel processing machines. The benchmark also uses algorithmic variations to evaluate individual system components' impact on the overall serial fraction and efficiency.
CA are useful tools in modeling and simulation. However, the more complex a CA is, the longer it takes to run in typical environments. A dedicated CA machine solves this problem by allowing all cells to be computed in parallel. In this paper we present simple yet useful hardware (FPGA) that can perform the required computations in constant time complexity.
Additional publications and presentations at closed conferences.
Create a website - Go here