General information about Exploratory Clustring server

The goal of clustering is identification of groups of instances so that similarity of instances within each group is high. The task is difficult while there is no a single similarity measure that is appropriate for very different domains. Additionally, the same instances may be grouped in different ways and it is not necessary that the optimal grouping with the used similarity measure is most interesting/relevant for data analysis. For example, in a set of patient data the most strong similarity may be grouping based on the gender of patients while we are actually interested in grouping that relates to the severity of some disease.

It means that only the user may say if the clustering result is interesting/relevant. In order to faciliate the evalaution of the clustering result the Exploratory Clustring server gives report that consists of: a) list of examples included into clusters, b) list of most relevant attributes for this clustering result, c) typical values that instances included into clusters have for the most relevant attributes. Additionally, if classifications of at least a small part of instances is known and this information is uploaded to the server, then the user receives the information about the distribution of classes in each constructed cluster.

If the user is not satisfied with the result, he may start a new iteration. Two options are at his disposal.

A) If the user concludes that constructed clusters are relevent or interesting but too general or too specific for his application, he may ask the tool to merge two most similar clusters in order to reduce the total number of constructted clusters or to search for a solution consisting of more clusters.. The proccess may be repeated as long as the user is satisifed with the size and/or purity of clusters.

B) In cases when the current set of clusters is not appropriate for merging, the user may try to construct a substantially different set of clusters. This is performed so that the most relevant attribute for the current solution is eliminated from the dataset and the clustering is started from the begining without it. The process may be repeated A-2 times where A is the total number of attributes in the first data layer.

The user may improve quality of clustering by optional uploading an additional (auxilliary) data layer for the same set of instances. The number and the order of instances must be identical in both layers. When such data layer is uploaded, the server automatically performes multi-layer clustering in which similarity in both layers is a necessary condition for clustering instances. The server accepts both numerical and nominal attributes and they may include unknown values.

In order to increase human interpretability of the obtained results the user may prepare and upload a file with names of instances. Each name must be in its row and the number of rows must be identical to the number of instances in data files. If no file with names of instances is uploaded then examples are referenced by their position in the data files. Additionally, for each data layer the user may prepare a file with attribute names. Such file may include one row with the number of names equal to the number of columns in the corresponding data file or each name may be in a separate row. If no file with attribute names is uploaded then attributes are referenced as A1, A2, .. An, where n is the number of attributes in the layer.

The implemented clustering technique is based on two main steps: computation of an Example Similarity Table (EST) and grouping of instances that enable maximal reduction of the Clustering-Related Variability (CRV) score measured on the example similarity table. Details of the algorithm can be found in:
D. Gamberger etal. Clusters of male and female Alzheimer’s disease patients in the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database computation, Brain Informatics 2016.

Preparation of data files in the appropriate form is the most critical part of using this service. Please read the instructions very carefully. For practical reasons the server can accept only data files with up to 1000 instances and 1000 attributes. The service has a time limit of 10 minutes and datasets with both a large number of instances and a large number of attributes may be termined before the user receives any useful response.

There is an option to increase reliability of clustering by doing more detailed estimation of the similarity of instances. Use this option with care because for large datasets it may result by a computation longer than 10 minutes and no useful response.

After clustering all uploaded data are be removed from the server. If another experiment with the same data is needed, the data file(s) and related optional files must be resubmitted.

Security information

The system will not record any user data but it will also not include any special security properties. Theoretically, the user has no guarantee that his/her data will not be read and stored by system or perhaps even by other users of the server. In cases when this fact may be the problem for the user, it is his/her responsibility to code learning examples so that it is not possible to reconstruct the important private data. Generally, this is not a difficult task.

© 2016 LIS - Rudjer Boskovic Institute
Last modified: January 09 2017 21:21:50.