Basic system description
(Documentation on this page and related pages describes how we intended to realize the system. Most of the text was written before January 2001. Up to date description of the realized system is at DMS Home .)The Data Mining Server will have two main parts. First is implementation of data mining algorithms and second is data mining documentation. The documentation part will include general data mining information, user guides for the implemented algorithms, and some expert level experience information about using data mining algorithms. The users guide for the implemented system should be tightly connected with the implementation so that the users can easily get necessary information, especially in cases when errors or problems with submitted data are detected. The general and expert level information parts will have extensive list of related internet resources.
General conceptsDuring this project we should try to implement the basic set of ILLM algorithms in a single data mining tool which will be easy to use. The documentation part will be broader but it is substantial that the users guide is clear and well connected with the implementation. Both implementation and documentation parts must be realized so that their future expansions are possible.
The data mining tool in this implementation is an attribute based rule induction system for two class problems. Its input is the data file with learning examples, potentially accompanied by some options and some attribute names. The input file must include examples of two classes: target and non-target (or positive and negative examples). The output is one or more rules that describe target class examples. The rules should be true (correct) for many target class examples and false for all (or as many as possible) of the non-target class examples. The rules are selected so that they present general properties of the available training examples. In this way it may be assumed that a) induced rules describe models of the target class examples in the training set (input file submitted by the user) and that b) rules can be used as class predictors for unclassified examples.
ImplementationAt this moment it is supposed that the implementation will have two parts. The first is the basic level which is extremely easy to use but with restricted execution possibilities. The second level will enable more complex operations, it will accept larger example sets, and it will include some options for user selectable search properties. The first part is called level A and it must be realized completely before we start with level B realization. In case of problems, the project may end with realized level A only.
The substantial difference between levels A and B is that at level A all information about the user and his/her data are temporary. At level B the user is identified by the user name which is used to create a separate subdirectory only for the user. This subdirectory will be automatically deleted by the system after a few hours (or a day) but the existence of the subdirectory will enable that the user can upload, besides the data file, also options file and test files. At level B the user will be able to download rule file, error file and translation file which are generated during the induction process.
SecurityThe system will not record any user data but it will not include any special security properties. Theoretically the user will have no guarantee that his/her data will not be read and stored by system or perhaps even by other users of the server. This must be clearly stated at the entry point of DMS site. In cases when this fact may be the problem for the user, it is his/her responsibility to code learning examples so that it is not possible to reconstruct the important private data. Generally this is not a difficult task. According to the project specification, we should prepare a program for automatic encoding. The encoding itself must be executed by the user on his/her own machine. The encoding program will be available for download from the main server page. At the moment it is supposed that there will be an executable version of the program for Windows based machines and the source version of the program in C language that can be compiled on different systems. The result of the encoding will be data file prepared for submission and the translation file that will remain on the client machine and used to interpret the rules generated by the induction process. Besides encoding, the program will also test if the user data have the form acceptable by the server. The advantage of using the program is that data consistency will be tested before their upload.
LanguageFinal version of the server must be, according to the contract, both in English and Croatian language. During realization only English will be used and in the last project phase for every html page a corresponding Croatian language page will be generated. Croatian pages will be connected among themselves by hyperlinks in the same way as corresponding English pages. On every English page it will be a link to corresponding Croatian page and vice versa.
In Details of Level A characteristics and properties of this basic data mining layer are described.
© 2001 LIS - Rudjer Boskovic Institute
Last modified: September 09 2015 14:17:42.