R语言学习笔记之Outlier Detection

标签:
it |
分类: R语言 |
Outlier Detection 孤立点检测
This page shows an example on outlier
detection with the LOF (Local Outlier Factor)
algorithm.
Function lofactor(data, k) in packages DMwR and dprep calculates local outlier factors using the LOF algorithm, where k is the number of neighbors used in the calculation of the local outlier factors.
The LOF algorithm
LOF (Local Outlier Factor) is an algorithm for identifying density-based local outliers [Breunig et al., 2000]. With LOF, the local density of a point is compared with that of its neighbors. If the former is signi.cantly lower than the latter (with an LOF value greater than one), the point is in a sparser region than its neighbors, which suggests it be an outlier.Function lofactor(data, k) in packages DMwR and dprep calculates local outlier factors using the LOF algorithm, where k is the number of neighbors used in the calculation of the local outlier factors.
Calculate Outlier Scores
>
library(DMwR)
> # remove "Species", which is a categorical
column
> iris2 <- iris[,1:4]
> outlier.scores <- lofactor(iris2, k=5)
> plot(density(outlier.scores))
http://s11/mw690/5d29ee45gdb3fefb1489a&690Detection" TITLE="R语言学习笔记之Outlier Detection" />
> # pick top 5 as outliers
> outliers <- order(outlier.scores,
decreasing=T)[1:5]
> # who are outliers
> print(outliers)
[1] 42 107 23 110 63
Visualize Outliers with Plots
Next, we show outliers with a biplot of the first two principal components.> n
<- nrow(iris2)
> labels <- 1:n
> labels[-outliers] <- "."
> biplot(prcomp(iris2), cex=.8, xlabs=labels)
http://s9/mw690/5d29ee45gdb3ff24e6178&690Detection" TITLE="R语言学习笔记之Outlier Detection" />
We can also show outliers with a pairs plot as below, where
outliers are labeled with "+" in red.
> pch <- rep(".", n)
> pch[outliers] <- "+"
> col <- rep("black", n)
> col[outliers] <- "red"
> pairs(iris2, pch=pch, col=col)
http://s6/mw690/5d29ee45gdb3ffab677c5&690Detection" TITLE="R语言学习笔记之Outlier Detection" />
Parallel Computation
of LOF Scores
Package Rlof provides function lof(), a parallel implementation of
the LOF algorithm. Its usage is similar to the above lofactor(),
but lof() has two additional features of supporting multiple values
of k and several choices of distance metrics. Below is an example
of lof().>
library(Rlof)
> outlier.scores <- lof(iris2, k=5)
> # try with different number of neighbors (k = 5,6,7,8,9
and 10)
> outlier.scores <- lof(iris2, k=c(5:10))