In this post, we will talk about measuring distance for categorical observations. Categorical dimensions can always be translated into numeric dimensions, and numeric distance metrics continue to be meaningful. However, Data Analytics purely categorical observations there are some special metrics which can be used.
There are two cases for purely categorical data: where number of dimensions is not constant across observations, and where they are. Example of former is text documents where number of words is number of dimensions in each document. Finding distances among documents is one the most common tasks in text analytics and Natural Language Processing. Example of later is from bio-informatics where gene-sequence is constant length categorical sequence of genotypes.
For purpose of demonstrations in this post, we shall use following three sentences as three observations among which we want to compute distances. Metrics discussed shall apply equally well to cases where number of dimensions are constant.
One of the biggest challenges of this decade is with databases having a variety of data types. Variety is among the key notion in the emerging concept of big data, which is known by the 4 Vs: Volume, Velocity, Variety and Variability [1,2]. Currently, there are a variety of data types available in databases, including: interval-scaled variables (salary, height), binary variables (gender), categorical variables (religion: Jewish, Muslim, Christian, etc.) and mixed type variables (multiple attributes with various types). Despite data type, the distance measure is a main component of distance-based clustering algorithms. Partitioning algorithms, such as k-means, k-medoids and more recently soft clustering approaches for instance fuzzy c-means  and rough clustering , are mainly dependent on distance measures to recognize clusters in a dataset.
In data mining, ample techniques use distance measures to some extent. Clustering is a well-known technique for knowledge discovery in various scientific areas, such as medical image analysis [5–7], clustering gene expression data [8–10], investigating and analyzing air pollution data [11–13], power consumption analysis [14–16], and many more fields of study. Improving clustering performance has always been a target for researchers. Since in distance-based clustering similarity or dissimilarity (distance) measures are the core algorithm components, their efficiency directly influences the performance of clustering algorithms. These algorithms use similarity or distance measures to cluster similar data points into the same clusters, while dissimilar or distant data points are placed into different clusters. Examples of distance-based clustering algorithms include partitioning clustering algorithms, such as k-means as well as k-medoids and hierarchical clustering .
Although there are various studies available for comparing similarity/distance measures for clustering numerical data, but there are two difference between this study and other existing studies and related works: first, the aim in this study is to investigate the similarity/distance measures against low dimensional and high dimensional datasets and we wanted to analyse their behaviour in this context. Second thing that distinguish our study from others is that our datasets are coming from a variety of applications and domains while other works confined with a specific domain. In essence, the target of this research is to compare and benchmark similarity and distance measures for clustering continuous data to examine their performance while they are applied to low and high-dimensional datasets. For the sake of reproducibility, fifteen publicly available datasets [18,19] were used for this study, so future distance measures could consequently be evaluated and compared with the results of traditional measures discussed in this study. These datasets are classified into low and high-dimensional, and each measure is studied against each category. But before doing the study on similarity or dissimilarity measures, it needs to be clarified that they have significant influence on clustering quality and are worthwhile to be studied. In sections 3 (methodology) it is elaborated that the similarity or distance measures have significant influence on clustering results.
The key contributions of this paper are as follows:
Twelve similarity measures frequently used for clustering continuous data from various fields are compiled in this study to be evaluated in a single framework. Most of these similarity measures have not been examined in domains other than the originally proposed one.
A technical framework is proposed in this study to analyze, compare and benchmark the influence of different similarity measures on the result of distance-based clustering algorithms.
Similarity measures are evaluated on a wide variety of publicly available datasets. Particularly, we evaluate and compare the performance of similarity measures for continuous data against datasets with low and high dimension.
The rest of paper is organized as follows: in section 2, a background on distance measures is discussed. In section 3, we have explained the methodology of the study. Experimental results with a discussion are represented in section 4, and section 5 summarizes the contributions of this study.
Background on Distance Measures for Continuous Data
Utilization of similarity measures is not limited to clustering, but in fact plenty of data mining algorithms use similarity measures to some extent. To reveal the influence of various distance measures on data mining, researchers have done experimental studies in various fields and have compared and evaluated the results generated by different distance measures. Although it is not practical to introduce a “Best” similarity measure or a best performing measure in general, a comparison study could shed a light on the performance and behavior of measures. For instance, Boriah et al. conducted a comparison study on similarity measures for categorical data and evaluated similarity measures in the context of outlier detection for categorical data . It was concluded that the performance of an outlier detection algorithm is significantly affected by the similarity measure. In their research, it was not possible to introduce a best performing similarity measure, but they analyzed and reported the situations in which a measure has poor or superior performance. In another research work, Fernando et al.  reviewed, compared and benchmarked binary-based similarity measures for categorical data. With some cases studies, Deshpande et al. focused on data from a single knowledge area, for example biological data, and conducted a comparison in favor of profile similarity measures for genetic interaction networks. They concluded that the Dot Product is consistent among the best measures in different conditions and genetic interaction datasets