Outlier Detection/Removal Algorithm

Outlier Detection/Removal Algorithm


So, I’m going to give you a practical way to detect outliers that work with almost every machine algorithm. It’s actually really straightforward, and it’s very very beautiful. Suppose you have this wonderful data set over here, with one outlier. Obviously, you don’t know what the outlier is, because you haven’t you haven’t even discovered the structure of the linear set. The algorithm is very simple. Step one, Train with all the data. In our case, it would be linear regression. Two is Remove. Find after training, the points in your training set with the highest visitor error, and remove those, perhaps usually remove 10% of your data points. And, step three is Train Again, using now the reduced data set. And, you can actually repeat this if you want, and do it multiple times. But, now our example over here, what it means is, the first time we run the regression, we get something that looks approximately like this. And, while this is not a good regression, it is good enough to recognize that if you look at all the visible errors of data points, that this one over here has the largest. This happens to be ten points, so 10% of removal would remove exactly one point. So, we take this point out over here. Our new regression line would look pretty much like this. Which is what you want.

2 thoughts to “Outlier Detection/Removal Algorithm”

  1. Don't you think it will make network bias toward a particular set of values and it will continue in a loop, removing even genuine outlier values?

Leave a Reply

Your email address will not be published. Required fields are marked *