Research Interests

My research interests lie in data mining, machine learning and to an ever increasing extent, statistics. I enjoy solving the multidisciplinary challenges that arise in data mining and developing efficient algorithms that implement the solutions in large or high dimensional data sets. Applications include: Itemset and association rule mining; Uncertain and probabilistic databases; Rule mining; Classification, in particular in imbalanced datasets; Statistical approaches in data mining and machine learning (for example, the integration of rigorous statistical approaches); Spatial-temporal data mining; Data streams and fault tolerant data mining; Feature generation and selection; Clustering; Graph mining; Mining hetrogeneous and mixed data;

Within this context I've also developed frameworks and algorithms that solve a wide range of problems at the abstract level (for example, using vectorised computational models or graph based algorithms). This was demonstrated in my Dr. rer. nat. thesis for example.

I am also increasingly interested in parallel, randomised and approximation algorithms to deal with the inherent complexity of data mining and machine learning problems.

top | home

Generalised Interaction Mining: Probabilistic, Statistical and Vectorised Methods in High Dimensional or Uncertain Databases

The following summary is based on the abstract of my Dr. rer. nat. thesis.

Knowledge Discovery in Databases (KDD) is the non-trivial process of identifying valid, novel, useful and ultimately understandable patterns in data. The core step of the KDD process is the application of Data Mining (DM) algorithms to efficiently find interesting patterns in large databases. Often, such patterns are used to make automatic and intelligent decisions based on data, which is also a major focus of Machine Learning (ML) research.

This thesis concerns itself with three inter-related themes: Generalised interaction and rule mining; the incorporation of statistics into novel data mining and machine learning approaches; and probabilistic frequent pattern mining in uncertain databases.

Generalised Interaction and Rule Mining

An interaction describes an effect that variables have -- or appear to have -- on each other. Interaction mining is the process of mining structures on variables describing their interaction patterns -- usually represented as sets, graphs or rules. Interactions may be complex, represent both positive and negative relationships, and the presence of interactions can influence another interaction or variable in interesting ways. Finding interactions is useful in domains ranging from social network analysis, marketing, the sciences, e-commerce, to statistics and finance. Many data mining tasks may be considered as mining interactions, such as clustering; frequent itemset mining; association rule mining; classification rules; graph mining; flock mining; etc. Interaction mining problems can have very different semantics, pattern definitions, interestingness measures and data types. Solving a wide range of interaction mining problems at the abstract level, and doing so efficiently -- ideally more efficiently than with specialised approaches, is a challenging problem.

This thesis introduces and solves the Generalised Interaction Mining (GIM) and Generalised Rule Mining (GRM) problems. GIM and GRM use an efficient and intuitive computational model based purely on vector valued functions. The semantics of the interactions, their interestingness measures and the type of data considered are flexible components of vectorised frameworks. By separating the semantics of a problem from the algorithm used to mine it, the frameworks allow both to vary independently of each other. This makes it easier to develop new methods by focusing purely on a problem's semantics and removing the burden of designing an efficient algorithm. By encoding interactions as vectors in the space (or a sub-space) of samples, they provide an intuitive geometric interpretation that inspires novel methods. By operating in time linear in the number of interesting interactions that need to be examined, the GIM and GRM algorithms are optimal. The use of GRM or GIM provides efficient solutions to a range of problems in this thesis, including graph mining, counting based methods, itemset mining, clique mining, a clustering problem, complex pattern mining, negative pattern mining, solving an optimisation problem, spatial data mining, probabilistic itemset mining, probabilistic association rule mining, feature selection and generation, classification and multiplication rule mining.

Statistical Data Mining Methods

Data mining is a hypothesis generating endeavour, examining large databases for patterns suggesting novel and useful knowledge to the user. Since the database is a sample, the patterns found should describe hypotheses about the underlying process generating the data. In searching for these patterns, a DM algorithm makes additional hypothesis when it prunes the search space. Natural questions to ask then, are: "Does the algorithm find patterns that are statistically significant?" and "Did the algorithm make significant decisions during its search?". Such questions address the quality of patterns found though data mining and the confidence that a user can have in utilising them. Finally, statistics has a range of useful tools and measures that are applicable in data mining. In this context, this thesis incorporates statistical techniques -- in particular, non-parametric significance tests and correlation -- directly into novel data mining approaches. This idea is applied to statistically significant and relatively class correlated rule based classification of imbalanced data sets; significant frequent itemset mining; mining complex correlation structures between variables for feature selection; mining correlated multiplication rules for interaction mining and feature generation; and conjunctive correlation rules for classification. The application of GIM or GRM to these problems lead to efficient and intuitive solutions.

Probabilistic Frequent Itemset Mining in Uncertain Databases

Frequent itemset mining (FIM) is a fundamental problem in data mining. While it is usually assumed that the items occurring in a transaction are known for certain, in many applications the data is inherently noisy or probabilistic; such as adding noise in privacy preserving data mining applications, aggregation or grouping of records leading to estimated purchase probabilities, and databases capturing naturally uncertain phenomena. The consideration of existential uncertainty of item(sets) makes traditional techniques inapplicable. Prior to the work in this thesis, itemsets were mined if their expected support is high. This returns only an estimate, ignores the probability distribution of support, provides no confidence in the results, and can lead to scenarios where itemsets are labeled frequent even if they are more likely to be infrequent. Clearly, this is undesirable. This thesis proposes and solves the Probabilistic Frequent Itemset Mining (PFIM) problem, where itemsets are considered interesting if the probability that they are frequent is high. The problem is solved under the possible worlds model and a proposed probabilistic framework for PFIM. Novel and efficient methods are developed for computing an itemset's exact support probability distribution and frequentness probability, using the Poisson binomial recurrence, generating functions, or a Normal approximation. Incremental methods are proposed to answer queries such as finding the top-k probabilistic frequent itemsets. A number of specialised PFIM algorithms are developed, with each being more efficient than the last: ProApriori is the first solution to PFIM and is based on candidate generation and testing. ProFP-Growth is the first probabilistic FP-Growth type algorithm and uses a proposed probabilistic frequent pattern tree (Pro-FPTree) to avoid candidate generation. Finally, the application of GIM leads to GIM-PFIM; the fastest known algorithm for solving the PFIM problem. It achieves orders of magnitude improvements in space and time usage, and leads to an intuitive subspace and probability-vector based interpretation of PFIM.

top | home

Mining Complex Spatio-temporal Movement Patterns

The following summary is based on the abstract of my PhD thesis.

Knowledge Discovery in Databases (KDD) is the non-trivial process of identifying valid, novel, potentially useful and ultimately understandable patterns in data. The core step of the KDD process is the application of Data Mining algorithms in order to find interesting pattern instances in large databases, and to do so effciently. Spatio-Temporal Data Mining (STDM) is a sub-field of DM and KDD that deals with data containing both spatial and temporal dimensions. The particular semantics of both space and time, as well as the unique challenges involved in spatio-temporal applications requires the development of specialised algorithms and techniques in order to extract useful descriptive and predictive patterns from such data.

Moving Object Data (MOD) is a type of spatio-temporal data that has received much interest due to the growth in applications generating such data and the corresponding opportunities to discover valuable knowledge. A MOD database captures the location of objects over time and is often delivered as a data stream. Examples include animal tracking data from GPS collars, location data provided by cell phone networks, sensor networks, remote sensing technologies, the use of Radio Frequency Identification (RFID) tags in a distributed setting, traffic networks, mobile computing devices and location aware services. In the more general case where an object is not restricted to a physical entity, other sources such as web site traffic or streaming financial data can be interpreted as MOD. This thesis presents new methods for the extraction of complex spatio-temporal patterns describing object movement behaviour in streaming MOD. Particular emphasis is placed on the ability to mine group behaviour as well as explicit fault tolerance of both the proposed patterns and the algorithms used to mine them. The simplest patterns introduced describe interesting behaviours and movements made by groups of objects within or between regions of interest. High Traffic Regions (HTRs) include regions from which objects tend to leave (sources), regions objects tend to migrate to (sinks), regions where objects tend to remain (stationary regions) or regions through which they pass (thoroughfares). More complex behaviour is captured by Spatio-Temporal Association Rules (STARs), which describe how groups of objects move between regions of interest over time. Traditional Association Rule Mining (ARM) cannot be applied to this setting since items and transactions do not map to the spatio-temporal domain nor its semantics. While HTRs and STARs successfully find interesting short term patterns, it is difficult to reliably identify longer term movements without specialist algorithms. Movements made by objects and groups are typically noisy and overlap each other, which makes a set of non-sequential patterns difficult to interpret in large and dynamic data sets. Long term behaviours can also be very complex: Objects may split up and rejoin later, leading to space and time gaps; long paths may be of interest such as roads and migration paths, which should be found even if no object traverses their entire length in the given database. While complex behaviours are the most interesting to the user, they are also the most difficult to mine. Accordingly, a method is introduced for mining complex sequences of spatio-temporal patterns. When used with STARs as sequence elements, the patterns are called k-STARs. This method allows useful and otherwise hidden behaviours to be discovered in real data sets. Missing values and other faults are a significant challenge in this domain, as they can easily break an otherwise real sequence and prevent interesting patterns from being found. The method presented is explicitly developed with this in mind and provides excellent fault and noise tolerance. A lattice over the resulting complex sequences allows users to visualize and interpret the resulting patterns easily, in addition to performing drill down and roll up operations for exploratory data analysis.

The complex sequence mining method is not limited to its k-STARs application and can be generalised to a meta mining algorithm. Accordingly, other pattern types or heterogeneous pattern mixtures may be used as sequence elements. For example, existing object mobility pattern mining algorithms may be used as input, or the approach may be generalised and extended to areas outside the spatio-temporal domain, with financial market and web traffic data being two examples. By exploiting the approach in this way, new complex sequence mining methods can be developed that immediately inherit the fault and noise tolerance aspects of the algorithm.

All algorithms in this thesis rely on theoretical results to achieve a run time that is proved to be optimal, which typically means the run time is linear in the number of patterns found to be interesting. A number of novel data structures are introduced that aid in achieving this. All algorithms are experimentally evaluated for performance. Additionally, all algorithms are successfully applied to find verifiable and meaningful patterns in a real world database.

Streaming Data

Streaming data demands that an algorithm uses only a single pass over the data, that no storage of the data is possible and that the algorithm must be fast enough for real time processing. This poses considerable challenges for data mining algorithms. All algorithms in this thesis are streaming (on-line) algorithms.

Fault Tolerant Data Mining

Fault tolerant data mining refers to methods where the presence of faults in the data has little effect on the chances that an otherwise interesting pattern is found. Such approaches are particularly important in spatio-temporal data mining, where issues like distributed data collection and transmission increase the chances of faults occurring and where the data is inherently noisy. Unfortunately, while it is prevalent in the literature, relying on the traditional KDD model where pre-processing 'cleans' the data prior to the data mining step was found to be naive in such applications -- especially when complex and sequential patterns are of interest. This makes it important to design methods that explicitly cater for faults and noise, such as those proposed in this thesis.

Mining Group Behaviour

In MOD databases, group behaviour refers to objects performing similar movements in a similar time frame. Group behaviour is of practical interest for a number of reasons: Individuals' movements are often insignificant in the application or too noisy to be of use; there are often too many objects to make mining individual movements either meaningful or possible; there is a specific interest in group behaviour such as animal flocks or herds; and it turns out that many patterns of interest can emerge only when many groups of objects are considered. The methods in this thesis are designed primarily to mine group behaviour.

top | home