Data Mining - Between Unsupervised and Supervised Techniques

Introduction

The DaM-BUST project focuses on real-life applications for machine learning that do not fit neatly into the traditional distinction between supervised and unsupervised techniques.  This research specifically involves the analyis of data from three different application areas: Bioinformatics (genome & proteomics), Multimedia (image & text) and Manufacturing. We will develop tools and techniques that will help analyse this data.

Research Focus

A wide variety of techniques have been applied to the problems of interest here. In this project we will focus on two broad sets of techniques that complement each other:

  • Kernel Techniques will be used for classification. We will use both Support Vector Machines and basic k-Nearest Neighbour techniques. 
  • Matrix Decomposition techniques will be used for dimension reduction and clustering. We will work on both Spectral and Non-Negative Matrix techniques.

Our research in The Knowledge Discovery Project and in the Muscle Network of Excellence indicates that these are the leading techniques for problems of this type. By focusing on just two related families of techniques we will establish a high level of expertise with these techniques.

Applications

The research in the DaM-BUST project will focus on applications of Machine Learning techniques in three distinct areas:

DaM-BUST Project Themes

 

1. Image Indexing System: Many of the research problems that motivate this research proposal occur in the processing of multimedia content. For this reason we will develop a prototype system for organising personal digital photo collections that will act as a test-bed for our research. This image indexing application is representative of a large set of image analysis problems that arise with images from a range of sources. In the course of this research we will collaborate with HP in Galway on the application of these image annotation techniques to the annotation and analysis of medical images.

Image indexing application


2. Microarray Data:  The initial focus of the research in Bioinformatics will be on the cross-platform analysis of microarray data in collaboration with Prof. Des Higgins from the Conway Institute in UCD and Dr. Aedin Culhane from the Harvard School of Public Health and Dana-Farber Cancer Research Institute.

3. Inspection and Process Control: A number of interesting challenges for machine learning arise in industrial inspection and testing. For instance, in the automatic inspection of solder joints in electronic assembly there is a ready supply of images of good joints from working devices but examples of bad joints are very scarce. A similar situation exists in process monitoring where there is plenty of data on the process operating correctly but getting data on the manner in which the process might drift out of tolerance is more problematic. This situation arises also in the food industry where determining the authenticity of food is an important consideration 

Research Structure 

The research in the DaM-BUST project will be organised around three central themes:

  1. One-Class Classification Problems
  2. Active Learning
  3. Semi-Supervised Clustering

Project Outputs

The major outputs of this research project are software systems, review documents and peer-reviewed research papers. The special characteristic of the subfield of machine learning that is identified here is that it is motivated by rather unusual problem formulations that arise in real-life situations. Thus, if we find effective solutions to these problems this research will have a considerable impact.

[Back to top]