Handle outliers with log-based normalization

Carl Gold, Chief Data Scientist at Zuora, recommends log-based normalization for compressing outliers. To avoid dividing by zero, transform the dataset by adding one. After using the log to compress the data, you can then apply standard normalization techniques, such as subtracting the mean and dividing by the standard deviation.

import numpy as np 
from sklearn.preprocessing import normalize
log_series = normalize(np.log(df.view_count +1))

Alternatively, you could choose to handle outliers with Winsorization, which refers to the process of replacing the most extreme values in a dataset that fall outside a given percentile. To implement with scipy, set the desired exclusion percentages with the limits parameter.

from scipy.stats.mstats import wisorize
a = np.array([10, 4, 9, 8, 5, 3, 7, 2, 1, 6])
winsorize(a, limits=[0.1, 0.2])

a becomes [8, 4, 8, 8, 5, 3, 7, 2, 2, 6] with the 10% lowest values (i.e., 1) and the 20% highest values (i.e., 9 and 10) replaced by the next lowest / highest values respectively.

Sign up for more tips