0%

异常点检测

1. 什么是异常点(outliers)

From wikipedia:

In statistics, an outlier is an observation point that is distant from other observations.

关键是observation,或者先验,通过人的知识来去除异常点

2. 数据分布(先验)

茶叶的特征点一般分布在叶片以及枝干上,高度差异在5cm以内

所以我们的预期是:

  • 大部分点在5cm以内,假设下限low,$low<x<low+5$
  • 超过范围,$x<low$ or $x>low+5$,判定为outliers

问题归结于如何找到low,或者平均值,或者low+5

3. 编程实现

1
2
3
4
5
6
7
8
9
10
11
12
13
import numpy as np
d=np.asarray([0.1,0.2,0.3,0.13,0.14,0.5])

mean=np.mean(d)
var=np.var(d)
std=np.std(d)
d=d[d>(mean-std)]
d=d[d<(mean+std)]
print(f'mean: {mean}')
print(f'var: {var}')
print(f'std: {std}')

print(d)

4. Ref

https://journals.sagepub.com/doi/pdf/10.1177/1094428112470848?casa_token=TRmPuCORiVsAAAAA:I8uslVeIpxILg8b5CK0eVAsFO3q6vgntik8xNM4esP_4jETnXM6x0Luxuqn1RWK9g8EaGQGbn9bqVA

image-20220517122404962