Improving Your Target Analysis with pandas qcut Function

In this excellent video about improving your machine learning workflow, Rebecca Bilbro offers the sage advice to check out your target column before doing feature analysis.

Begin with the end in mind — this way you set up a solid understanding of the target variable before jumping into your effort to predict or classify it. Taking this approach helps you identify potentially thorny problems (e.g. class imbalance) up front.

If you’re dealing with a continuous variable, it may be useful to bin your values. Working with 5 bins offers the opportunity to leverage the pareto principle. To create quintiles, simply use panda’s q-cut function:

amount_quintiles = pd.qcut(df.amount, q=5)

Each bin will contain 20% of your dataset. Comparing the top quintile of your target variable against the bottom quintile often yields interesting results. This technique serves as a good starting point for determining what might be anomalous about the top (or bottom) performers within your target variable.

Sign up for more tips