Step by Step Preparation of the Meningitis Data File
- 1. step get the data to your machine
- Please start with original data set ,
(which is a true copy of a data set prepared for
JSAI KDD Challenge 2001). You can download the data file so that you
click with the right mouse on the link and then select SAVE .. AS.
- 2. step remove unnecessary lines (comments)
- Remove line 1,2,3, and 126 which are used as data preparation comments.
- 3. step select delimiter
- Colon is already used as the delimiter in the input file and it can remain so.
In all following experiments select delimiter type comma, number of
models 1, generalization parameter 1, and deselected
- 4. step substitute all '(' and ')' by '_'
- If so prepared file is uploaded to the server it could be expected
that the experiment will be not successful because no target attribute
is specified. But the reported Error is E1001 / 23 . The problem is '('
character detected in the first line. Remember, the server reports first and
only the first detected error in its execution. Other problems with
the input data file can be detected only after the present problem has been
- 5. step select the target attribute
- At this step the reported error is E1001 / 31 because no target
attribute has been specified so far. Let us supposed that we are interested in
differences between diagnosis BACTERIA and VIRUS and that rules for BACTERIA
as the positive class should be induced. Diag2 will be selected
as target attribute by substituting string 'Diag2' with string '!Diag2'.
Positive class is defined so that all attribute values 'BACTERIA' in column
four are substituted by '!BACTERIA'. The task is not completely simple
because there also strings 'BACTERIA' in third column which should not be changed.
- 6. step substitute '-' and '+' characters
- At this step the reported Error is E1001 / 35 because '-' is not a
valid input attribute value. The problem can be solved by substituting
the character '-' with string 'minus' and the character '+' with string
'plus'. If short names are preferred, the substitutions can be just
characters 'm' and 'p', respectively.
- 7. step IT WORKS but .. eliminate some input attributes
- After these substitutions the server will produce first rule. It seems
not very useful because it makes use of the column 3 named DIAG which
includes the same information as the target attribute in column 4.
Advice is to remove column 3 from induction process by substituting
string 'DIAG' with '?DIAG' . In the same way user can exclude some other
attributes and so direct the sort of induced rules.
- 8. step change delimiter (optional)
- It is rather straightforward to change comma for semicolon or
TAB in this input file. But when changing to the space delimiter
please note that some unknown attribute values exist which
are not explicitly defined by a '?' but by two commas, potentially
separated by one or more spaces. These attribute values must
be transformed to '?' when space delimiter is used.
© 2001 LIS - Rudjer Boskovic Institute
Last modified: September 09 2015 14:17:42.