My research focused on a subset of Natural Language Processing (NLP) tasks, specifically the application of statistical machine learning solutions to the real-world tasks of spam filtering and document classification. Application of supervised learning algorithms in best suited to domains where the effort to collect a large number of labelled training examples is feasible. However, there are many domains where labelled examples are scarce due, in part, to the prohibitive expense of collecting a large number ground truth training examples (e.g. expensive tests required to find true label).
Active Learning addresses this problem by allowing the model to select training data from a pool (or stream) of unlabelled data, whereby only the selected examples are labelled. This strategy results in orders of magnitude reductions in labelling effort and allows for wider application of supervised learning as well as also helping to tackle concept drift.
Textual data is naturally high-dimensional, where the number of features used to represent documents can range from tens of thousands to millions. Dimensionality Reduction is a machine learning technique whereby high-dimensional data is represented by a much smaller set of features yet still retains the salient information, Reducing the size of the representation gives significant saving in both computational and memory requirements of a NLP systems. It is important to utilise the correct techniques when applying Active Learning.
Reinforcement Learning is an alternative way of training, whereby instead of supplying labelled training data, the learner is given rewards or punishments based on its actions. I investigated reinforcement learning techniques for use in active learning whereby a number of alternative strategies were available to the learner in each iteration of active learning. The learner utilised reinforcement learning techniques to adaptively choose between the alternative strategies based on their relative performance. The goal was to track, as close as possible, the optimal strategy in each iteration of active learning.
Sentiment Analysis is an interesting NLP task, in which the labels we wish to predict are the opinion (positive or negative) expressed in the given natural language text. I have worked at examining and predicting the polarity of financial blog data. Being able to automatically predict the polarity of text has direct applicability to marketing and advertisement campaigns.