|
Published Research ProblemsNOTE: Unfortunately the information on this page has not been updated for a long time and is thus is rather out of date. Please refer to my publications for information. This page (should) contain a summary of each of Florian Verhein's published research problems. Eventually, it will also point the interested reader to the relevant publications and software. Geometrically Inspired Data MiningGeometrically Inspired Itemset MiningIn the geometric view, an itemset is a vector (itemvector) in the space of transactions. Linear and potentially non-linear transformations can be applied to the itemvectors before mining patterns. Aggregation functions and interestingness measures can be applied to the transformed vectors. Interesting itemset mining can be carried out by instantiating four abstract functions: a transformation (g), an algebraic aggregation operator (o) and measures (f and F). For Frequent Itemset Mining (FIM), g and F are identity transformations, o is intersection and f is the cardinality. Based on this geometric view, a novel algorithm is presented that uses space linear in the number of 1-itemsets to mine all interesting itemsets in a single pass over the data, with no candidate generation. It scales (roughly) linearly in running time with the number of interesting itemsets. FIM experiments show that it outperforms FPGrowth on realistic datasets above a small support threshold (0.29% and 1.2% in experiments). ... Statistics in Data MiningUsing Significant, Positively Associated and Relatively Class Correlated Rules For Associative Classification of Imbalanced Data setsThe application of association rule mining to classification has led to a new family of classifiers which are often referred to as "Associative Classifiers (ACs)". An advantage of ACs is that they are rule-based and thus lend themselves to an easier interpretation. Rule-based classifiers can play a very important role in applications such as medical diagnosis and fraud detection where "imbalanced data sets" are the norm and not the exception. The focus of this work is to extend and modify ACs for classification on imbalanced data sets using only statistical techniques. This work combines the use of statistically significant rules with a new measure, the Class Correlation Ratio (CCR), to build an AC called SPARCCC. Experiments show that in terms of classification quality, SPARCCC performs comparably on balanced data sets and outperforms other AC techniques on imbalanced data sets. It also has a significantly smaller rule base and is much more computationally efficient. Mining Complex, Maximal and Complete Sub-graphs and Sets of Correlated Variables with Applications to Feature Subset SelectionFinding interactions between variables is a fundamental concept in Data Mining. In this work, correlations between variables are considered using Pearson´s product moment correlation coefficient. Of interest are complex, complete, and maximal sub-graphs which describe the correlation structure between variables. This work considers both positive and negative correlations -- complex interactions. It is proved that under a constraint on the minimum level of correlation desired, there are useful guarantees on the structure of the correlations. In particular, the sign of the correlation between variables can be mapped to the variables themselves (i.e. to the vertices). This means that the complete complex sub-graphs can be represented as a complex set, where each element -- a variable with a positive or a negative sign -- is highly positively correlated with every other. This makes the interaction much easier to understand. It is also exploited to develop an algorithm that runs in the same time as if complex interactions were not considered, resulting in significantly improved scalability. Mining maximal sets of variables characterized by the lack of correlations is also briefly considered. The approach is useful for examining complex correlation structures, as well as mining a representative subset of the entire data set. The latter idea is extended to the problem of feature subset selection in a way that gives guarantees on the minimum correlation required for features to be considered interchangeable (redundant), while guaranteeing that the selected features are not correlated with each other. Experiments show the approach performs well. Mining Spatio-Temporal Patterns in Object Mobility DatabasesFlorian started this work during the research component of his undergraduate degree, and subsequently expanded the work significantly during his PhD, publishing one journal article, one conference paper and two workshop papers. Spatio-Temporal Association Rules, Sources, Sinks, Stationary Regions and ThoroughfaresWith the increasing use of wireless communication devices and the ability to track people and objects cheaply and easily, the amount of spatio-temporal data is growing substantially. Many of these applications cannot easily locate the exact position of objects, but they can determine the region in which each object is contained. Furthermore, the regions are fixed and may vary greatly in size. Examples include mobile/cell phone networks, RFID tag readers and satellite tracking. This demands techniques to mine such data. These techniques must also correct for the bias produced by different sized regions. This work provides a comprehensive definition of Spatio-Temporal Association Rules (STARs). STARs describe how objects move between regions over time. Other patterns that are useful for mobility data are also presented; stationary regions and high traffic regions. The latter consists of sources, sinks and thoroughfares. These patterns describe important temporal characteristics of regions and it is shown that they can be considered as special STARs. Spatial support is defined to effectively deal with the problem of different sized regions. Finally, an efficient algorithm -- STAR-Miner -- is presented to find these patterns by exploiting several pruning properties. Sequences of Spatio-Temporal Association Rules (k-STARs)A Spatio-Temporal Association Rule (STAR) describes how objects move between regions over time. Since they describe only a single movement between two regions, it is very difficult to see larger patterns in the data set by considering only the set of STARs. It is especially difficult on complex data sets where the underlying patterns overlap. At best important patterns are missed - by being unable to ``see the forest for the trees´´, and at worst this can lead to false interpretations. This work introduces the k-STAR pattern which describes the sequences of STARs that objects obey. Since a k-STAR captures sequences of object movements it solves these problems. The approach also allows space and time gaps between successive STARs, as well as supporting `replenishable´ k-STARs -- enabling the capture of the rich set of patterns that exist in real world data. Two important measures are defined -- min-l-support and min-l-confidence -- that allow the above to be achieved. Various antimonotonic and weakly anti-monotonic properties for reducing the search space are also presented and exploited. The resulting algorithm is known as k-STAR-Miner. Classifying Public Announcements for User CommunitiesFlorian worked on this project as a Research Assistant to Associate Professor Judy Kay during his third year of undergraduate study. One local conference paper was produced. Abstract: This paper describes our work towards building a smart personal assistant that helps users deal with the ever increasing volume of email. To this end, we filter incoming messages for the user, organising messages in ways that enable the user to deal with the information effectively. We describe an initial corpus of public email built for experiments on learning to classify email. We also report initial results of experiments based on that corpus. These explore the question of how well a classifier can perform the task of classifying a user´s email into categories that reflect the users´ thinking about that email. We also report initial experiments aimed at answering a second question: if a learner has been trained by one user, how effective is it for another user? That is, can we share information about classifications of messages improving collective performance? |