1. 什么是异常点(outliers)
From wikipedia:
In statistics, an outlier is an observation point that is distant from other observations.
关键是observation,或者先验,通过人的知识来去除异常点
2. 数据分布(先验)
茶叶的特征点一般分布在叶片以及枝干上,高度差异在5cm以内
所以我们的预期是:
- 大部分点在5cm以内,假设下限low,$low<x<low+5$
- 超过范围,$x<low$ or $x>low+5$,判定为outliers
问题归结于如何找到low,或者平均值,或者low+5
3. 编程实现
1 2 3 4 5 6 7 8 9 10 11 12 13
| import numpy as np d=np.asarray([0.1,0.2,0.3,0.13,0.14,0.5])
mean=np.mean(d) var=np.var(d) std=np.std(d) d=d[d>(mean-std)] d=d[d<(mean+std)] print(f'mean: {mean}') print(f'var: {var}') print(f'std: {std}')
print(d)
|
4. Ref
https://journals.sagepub.com/doi/pdf/10.1177/1094428112470848?casa_token=TRmPuCORiVsAAAAA:I8uslVeIpxILg8b5CK0eVAsFO3q6vgntik8xNM4esP_4jETnXM6x0Luxuqn1RWK9g8EaGQGbn9bqVA