# Volume 16 How To Detect And Handle Outliers 22.pdf ~REPACK~

When the data become sparse and methods based on the concept of proximity fail to maintain their viability, interpreting the data becomes difficult. This is because the sparsity of high-dimensional data can be comprehended in several ways that suggest that every data instance is an equally good anomaly with regard to the definition of a distance-based algorithm. Aggarwal and Yu [8] developed a method for outlier detection based on an understanding of the nature of projections. The method focused on lower dimensional projections that are locally sparse and cannot be recognized easily by brute-force techniques due to various possible combinations. They developed the method using a naive brute-force algorithm, which is very slow at identifying the meaningful insights because of its exhaustive search of the entire space, and another faster evolutionary algorithm to quickly discover underpinning patterns of dimensions. Due to the exponentially increasing search space, the brute-force technique is computationally weak for problems of even modest complexity. However, the technique has advantages over simple distance-based outliers that cannot overcome the effects of the curse of dimensionality. Other related studies that address the high dimensionality problem in anomaly detection are discussed below.

## Volume 16 How To Detect And Handle Outliers 22.pdf

One of the more effective ways to handle sparse data in high-dimensional space is to employ functions based on dissimilarities of the datapoints. An assessment of small portions of the larger data set could help to identify anomalies that would otherwise be obscured by other anomalies if one were to examine the entire data set as a whole. Measuring the similarity of one data instance to other within a data set is a critical part of low-dimensional anomaly detection procedures. This is because an uncommon data point possesses insufficient data instances that are alike in the data set. Several anomaly detection methods practice Manhattan or Euclidian distance to estimate similarity among data instances [35]; however, when Euclidian distance is used to measure the similarity, the results are often treated as meaningless due to the unwanted nearest neighbors from multiple dimensions. This is due to the distance between two similar data instances and the distance between two non-similar data instances can be almost equal; hence, methods such as k-nearest neighbor with O(\(n^2 m\)) runtime are not viable for high dimensionality data sets unless the runtime is improved. Nevertheless, Euclidean distance is the most common distance metric used to calculate the similarity of low-dimensional data sets. Though Euclidean distance is suitable for low-dimensional data sets, it does not work as effectively in high-dimensional data [60].

Density-based techniques deal with the dense localities of the data space, identified by various regions of lower object density. These are not effective when the dimensionality of the data increases because, as data points are scattered through a large volume, their density decreases due to the curse of dimensionality, making the relevant set of data points harder to find [7]. Chen et al. [62] introduced a density estimator for estimating measures in high-dimensional data and applied this to the problem of recognizing changes in data distribution. They approximated the probability of data \(\mu\), aiming to bypass the curse of dimensionality by utilizing the assumption that \(\mu\) is lying around a low-dimensional subset embedded in a high-dimensional space. However, the estimators they proposed for \(\mu\) are based on a geometric multiscale decomposition of the given data while controlling the overall model complexity. Chen et al. [62] proved strong finite sample performance bounds for various models and target probability measures that are based only on the intrinsic complexity of the data. The techniques implementing this construction are fast and parallelizable, and showed high accuracy in recognizing the outliers.

Wang et al. [86] presented a PCA, as well as separable compression sensing, to identify different matrices. Compressive sensing (or compressed sampling [CSG]) theory was proposed by Candes and Wakin [87], that uses a random measurement matrix to convert a high-dimensional signal to low-dimensional signal until the signal is compressible after which the original signal is restructured from the data of the low-dimensional signal. Moreover, the low-dimensional setting contains the main features of the high-dimensional signal, which means CSG can provide an effective method for anomaly detection in high-dimensional data sets. In the model of Wang et al. [86], abnormalities are more noticeable in a matrix of uncompression compared to a matrix of compression. Hence, their model could attain equal performance in volume anomaly detection, as it used the original uncompressed data and minimized the computational cost significantly.

Many anomaly detection techniques assume that data sets have only a few features, and most are aimed at identifying anomalies in low-dimensional data. Techniques that address the high dimensionality problem when data size is increasing face challenges in anomaly detection due to a range of factors, as outlined in Table 3. Anomalies are masked in high-dimensional space and are concealed in multiple high-contrast subspace projections. The selection of subspace is both vital and complex, especially when data is increasing. Fabien and Kelloer [100] proposed to estimate the contrast of subspaces by enhancing the quality of traditional anomaly rankings by calculating the scores of high-contrast projections as evaluated on real data sets. The factor that affects the nature of distance in high-dimensional space, hindering the anomaly detection process, is distance concentration: all data points become essentially equidistant [101]. Tomasev et al. [102] addressed the problem of clustering in high-dimensional data by evaluating frequently occurring points known as hubs. They developed an algorithm that proved that hubs are the best way of defining the point centrality within a high-dimensional data cluster. RadovanoviÄ‡ et al. [103] provided useful insights using data points, known as anti-hubs; although they appear very infrequently, they clearly distinguish the connection between outliers. The authors evaluated several methods, such as angle-based techniques, the classic k-NN method, density-based outlier detection, and anti-hub-based methods.

Velocity refers to the challenges associated with the continuous generation of data at high speed. A data stream is an infinite set of data instances in which each instance is a set of values with an explicit or implicit time stamp [35]. Data streams are unbounded sequences and the entry rate is continuously high, as the respective variations repeatedly change over time. Data streams reflect the important features of big data in that the aspects of both volume and velocity are temporally ordered, evolving, and potentially infinite. The data may comprise irrelevant attributes that raise problems for anomaly detection, and there are several other factors involved, such as whether the data are from a single source or multiple sources. Multiple data streams are made up of a set of data streams, and every data stream comprises an infinite sequence of data instances accompanied by an explicit or implicit time stamp history. In a single data stream, anomaly detection compares the history of data instances to determine whether an instance is an outlier or anomaly. By contrast, in multiple data streams, data instances are recognized as anomalies by comparing them to the history of data instances from the same stream or other streams. The unique challenges [35, 105, 114,115,116,117] of anomaly detection for data streams are listed in Table 5.

Most anomaly detection strategies assume that there is a finite volume of data generated by an unknown, stationary probability distribution, which can be stored and analyzed in various steps using a batch-mode algorithm [119]. Anomaly detection on streaming data is a highly complex process because the volume of data is unbounded and cannot be stored indefinitely [35]. Data streams are produced at a high rate of generation, which makes the task of detecting anomalies and extracting useful insights challenging. This is a main disadvantage that arises in several application areas [37].

Lozano and Acufia [154] designed two parallel algorithms to identify distance-based anomalies using randomization and a pruning rule to recognize density-based local anomalies. They also constructed parallel versions of Bay and Local Outlier Factor (LOF) procedures, which exhibited good performance in anomaly detection and run time. Bai et al. [34] focused on the issue of distributed density-based anomaly detection for large data. They proposed a grid-based partition algorithm as a data pre-processing technique that splits the data set into grids before distributing these to data nodes in a distributed environment. A distributed LOF computing method was presented for discovering density-based outliers in parallel by utilizing a few network communications. Reilly et al. [155] proposed a PCA-based outlier identification approach that works in a distributed environment, demonstrating robustness in its extraction of the principal components of a training set comprising outliers. Minimum volume elliptical PCA can determine principal components more vigorously in the presence of outliers by building a soft-margin, tiniest volume, ellipse around the data that lessens the effects of outliers in the training set. Local and centralized approaches to outlier detection were also studied. The projected outlier detection technique was reformulated using distributed convex optimization, which splits the issue across a number of nodes. Gunter et al. [147] explored various techniques for identifying outliers in large distributed systems and argued for a lightweight approach to enable real time analysis. No single optimal method was found; therefore, they concluded that combinations of various methods are needed due to the change in effectiveness that depends on the definition of the anomaly.