DMS Home

Introduction to DMS

Suppose you are interested in the problem of smoking and you want to found out their main characteristics and how they are different from non-smokers. In order to do that, you first have to collect data about the population you are interested in and which includes both smokers and non-smokers. For every person you have to collect data (attributes) like age, sex, education, profession, income and so on. Also, for every person you will record the data if the person is a smoker or not. In many applications, real data collection phase is not necessary because the interesting data are already available in some form. Typical data collection task is a search for appropriate sources of data and their combination and/or transformation.

In any case, the final result of data collection phase is a data file in which every object (person in case of the 'smoker' problem) is represented as an example described by a fixed set of attributes like age, sex and so on. Identification attributes like name or ID numbers may be included as well. Some attributes values may be unknown. Just use '?' instead of the attribute value if you do not know the exact value, for example person's age or profession. Each person (example) is represented in the input data file in a separate row.

The collected data should be at the end transformed into the form suitable for data mining. In the 'smoker' problem the attribute containing the information if a person is smoker or non-smoker presents the target attribute. It means that we are interested in models which relate the property smoker to other attributes of the person. Every data mining task based on knowledge induction must have one and only one target attribute. All other attributes are input attributes which are used to build the model of the smoker.

After we have selected the target attribute, we must select also the target class. In our domain the target attribute has two classes: smokers and non-smokers. We can select any of these classes as the target (positive) class. The other class (or in domains in which the target attribute has many classes, all remaining classes) is the non-target or negative class. The result of the data mining process is one or more models (rules) which describe some of the most important subgroups of the target (positive) class. Models describe differences between the target and the non-target (negative) class. Input attributes are used in model descriptions. It must be noted that existence of examples in both target and non-target classes is mandatory because the object of induction is the search for differences between the classes. In our domain we can select either smokers or non-smokers as the target class. The choice depends only on the group for which the model is needed. But regardless which class is selected as the target class, we must have both smokers and non-smokers in the input data file. Here is the same input data file prepared so that 'SMOKER' is the target attribute and the class 'yes' is selected as the target class. This file can be uploaded and tested on this server. The used delimiter is TAB. You can learn about selection of the target attribute and the target class by comparison of the starting and the prepared data file or by reading the data preparation instructions.

A model (rule) induced by the server for our 'smoker' problem is:
SMOKER IF SEX is equal male AND INCOME is less than 15000

next page

© 2001 LIS - Rudjer Boskovic Institute
Last modified: September 08 2015 09:28:57.