Identifying “At-Risk” Customers with Python and Pandas

We love our customers - especially when they have been growing with us over the years. I’ve spent many hours looking through data like this and it makes me proud to work for a company whose customers get continuous value from its products.

Some customers, however, and for various reasons, show recent periods of negative growth or a plateau in their storage growth. Obsessed with ensuring the best customer fit with our portfolio, we like to identify these anomalies and personally reach out to the customers.

The below is a simple example of identifying “at-risk” customers based on the trend of their last four months’s invoice amount (MRR). In future posts, I’ll explore other methods.

This example covers:

  1. Loading data from our SQL database into a Pandas DataFrame
  2. Grouping the data by customer by month and calculating the total invoice amount
  3. Writing a function that calculates a best fit line to a series and returns its slope and r-squared value
  4. Applying this function to each customer
  5. Filtering for actionable results, in this case, customers to be researched, called or emailed

1. Loading Data

Populate a Pandas DataFrame with the results of a SQL query. In this case, the query returns a subset of line items from customer invoices along with classification tags. Note about 6.1M rows and 24 features (columns) returned.

>>> df = pd.read_sql (<SQL_QUERY>, <DATABASE_CONNECTION>)
>>> df.shape
(6152682, 24)

A sample of the returned DataFrame is as follows. For this post, the interesting columns are cus (unique customer id), monc (last day of the month the invoice was created), and amount (total amount of the line item):

>>> df.cus.nunique ()

>>> df.monc.max ()
Timestamp('2017-05-31 00:00:00')

>>> df[['cus', 'monc', 'amount']].sample (5)
cus monc amount
2173 74623XXXX 2015-01-31 4.000000
8545 62785XXXX 2015-06-30 22.702842
245 90520XXXX 2014-12-31 3.000000
2951 88383XXXX 2014-03-31 4.000000
4527 12940XXXX 2017-04-30 4.000000

2. Grouping

Since the above DataFrame represents the line items of an invoice, group by customer by month created - aggregating the amount field by sum.

>>> a = df.groupby (['cus', 'monc']).amount.sum ().unstack ()
>>> a.shape
(41823, 99)

>>> a.sample (5)


monc 2016-12-31 2017-01-31 2017-02-28 2017-03-31 2017-04-30
74062XXXX NaN NaN NaN NaN NaN
47983XXXX 14.226606 14.078619 14.112420 14.907932 14.962422
39993XXXX NaN NaN NaN NaN NaN
11889XXXX 343.551325 345.919253 351.929715 358.429962 371.082920
11464XXXX 22.720224 22.849535 22.850852 23.010361 23.126976

Note how now each row can be represented as a series of monthly totals for a customer.

3. Writing a function that calculates a best fit line to a series and returns its slope and r-squared value

from scipy import stats
def get_tendency (a, n = 4):
	b = a[-n:].dropna ()
	if len (b) > 2:
		c = stats.linregress (b.reset_index (drop = True).reset_index ())
		return pd.Series ([c.slope, c.rvalue**2])
		return np.nan

4. Applying this function

With Pandas, this is simple and is calculated relatively fast. After calculation, note the column names m (slope) and rsq (r-squared value)

b = a.apply (get_tendency, axis = 1)
b.columns = ['m', 'rsq']

>>> b.shape
(41782, 2)

>>> b.dropna ().sample (5)
m rsq
87270XXXX 0.338661 0.282983
123863XXX -0.885078 0.572635
125119XXX 1.800000 0.600000
52400XXXX 0.080003 0.843176
121516XXX 0.290487 0.796639

5. Filtering for actionable results

Let’s start by targeting the top 10 customers with a negative MRR growth trend and r-squared value over 0.9.

>> b[(b.m < 0) & (b.rsq >= .9)].sort_values ('m').head (10)
                    m       rsq
92091XXXX -310.060129  0.945718
124653XXX  -98.439050  0.957354
62329XXXX  -95.104875  0.934173
61190XXXX  -88.285062  0.929398
115290XXX  -84.566811  0.918093
118413XXX  -63.178696  0.951263
76597XXXX  -61.686508  0.925345
70253XXXX  -52.739355  0.913199
124036XXX  -43.834437  0.901635
72208XXXX  -42.247019  0.935996

Ensuring Customers Receive Value from Your Business’ Products

It’s immensely important for a product company to continually ensure that its customers are receiving value from its products. Identifying anomalies in usage and appropriately reaching out to customers for feedback (and acting on this feedback) is key to long-term success.


  • MRR: Monthly Recurring Revenue
  • Churn: Churn rate, the percentage rate at which customers stop subscribing to a service.
  • SQL Query: Structured Query Language. Used for managing data held in relational databases. e.g. select * from people where awesome = True
  • Pandas DataFrame: Similar to an R Data Frame. A Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary Pandas data structure.

Protect Your Business Data

We are passionate about helping our customers protect their data. We want you to use Jungle Disk to protect yours. Click on Sign Up to get started. It takes less than 5 minutes!

Sign Up