You are now subscribed to our monthly blog digest. Happy reading!

Click anywhere to dismiss ...

Subscribe to the Jungle Disk monthly blog digest:


Identifying "At-Risk" Customers with Python and Pandas

by Michael DeFelice / Behind The Scenes, Data Science, Service Delivery, Strategy / May 12, 2017 / Comments

We love our customers - especially when they have been growing with us over the years. I’ve spent many hours looking through data like this and it makes me proud to work for a company whose customers get continuous value from its products.

Some customers, however, and for various reasons, show recent periods of negative growth or a plateau in their storage growth. Obsessed with ensuring the best customer fit with our portfolio, we like to identify these anomalies and personally reach out to the customers.

The below is a simple example of identifying “at-risk” customers based on the trend of their last four months’s invoice amount (MRR). In future posts, I’ll explore other methods.

This example covers:

  1. Loading data from our SQL database into a Pandas DataFrame
  2. Grouping the data by customer by month and calculating the total invoice amount
  3. Writing a function that calculates a best fit line to a series and returns its slope and r-squared value
  4. Applying this function to each customer
  5. Filtering for actionable results, in this case, customers to be researched, called or emailed

1. Loading Data

Populate a Pandas DataFrame with the results of a SQL query. In this case, the query returns a subset of line items from customer invoices along with classification tags. Note about 6.1M rows and 24 features (columns) returned.

>>> df = pd.read_sql (<SQL_QUERY>, <DATABASE_CONNECTION>)
>>> df.shape
(6152682, 24)

A sample of the returned DataFrame is as follows. For this post, the interesting columns are cus (unique customer id), monc (last day of the month the invoice was created), and amount (total amount of the line item):

>>> df.cus.nunique ()

>>> df.monc.max ()
Timestamp('2017-05-31 00:00:00')

>>> df[['cus', 'monc', 'amount']].sample (5)
            cus        monc     amount
2173  74623XXXX  2015-01-31   4.000000
8545  62785XXXX  2015-06-30  22.702842
245   90520XXXX  2014-12-31   3.000000
2951  88383XXXX  2014-03-31   4.000000
4527  12940XXXX  2017-04-30   4.000000

2. Grouping

Since the above DataFrame represents the line items of an invoice, group by customer by month created - aggregating the amount field by sum.

>>> a = df.groupby (['cus', 'monc']).amount.sum ().unstack ()
>>> a.shape
(41823, 99)

>>> a.sample (5)


monc       2016-12-31  2017-01-31  2017-02-28  2017-03-31  2017-04-30
74062XXXX         NaN         NaN         NaN         NaN         NaN
47983XXXX   14.226606   14.078619   14.112420   14.907932   14.962422
39993XXXX         NaN         NaN         NaN         NaN         NaN
11889XXXX  343.551325  345.919253  351.929715  358.429962  371.082920
11464XXXX   22.720224   22.849535   22.850852   23.010361   23.126976

Note how now each row can be represented as a series of monthly totals for a customer.

3. Writing a function that calculates a best fit line to a series and returns its slope and r-squared value

from scipy import stats
def get_tendency (a, n = 4):
	b = a[-n:].dropna ()
	if len (b) > 2:
		c = stats.linregress (b.reset_index (drop = True).reset_index ())
		return pd.Series ([c.slope, c.rvalue**2])
		return np.nan

4. Applying this function

With Pandas, this is simple and is calculated relatively fast. After calculation, note the column names m (slope) and rsq (r-squared value)

b = a.apply (get_tendency, axis = 1)
b.columns = ['m', 'rsq']

>>> b.shape   
(41782, 2)

>>> b.dropna ().sample (5)
                  m       rsq
87270XXXX  0.338661  0.282983
123863XXX -0.885078  0.572635
125119XXX  1.800000  0.600000
52400XXXX  0.080003  0.843176
121516XXX  0.290487  0.796639

5. Filtering for actionable results

Let’s start by targeting the top 10 customers with a negative MRR growth trend and r-squared value over 0.9.

>> b[(b.m < 0) & (b.rsq >= .9)].sort_values ('m').head (10)
                    m       rsq
92091XXXX -310.060129  0.945718
124653XXX  -98.439050  0.957354
62329XXXX  -95.104875  0.934173
61190XXXX  -88.285062  0.929398
115290XXX  -84.566811  0.918093
118413XXX  -63.178696  0.951263
76597XXXX  -61.686508  0.925345
70253XXXX  -52.739355  0.913199
124036XXX  -43.834437  0.901635
72208XXXX  -42.247019  0.935996

Ensuring Customers Receive Value from Your Business’ Products

It’s immensely important for a product company to continually ensure that its customers are receiving value from its products. Identifying anomalies in usage and appropriately reaching out to customers for feedback (and acting on this feedback) is key to long-term success.


  • MRR: Monthly Recurring Revenue
  • Churn: Churn rate, the percentage rate at which customers stop subscribing to a service.
  • SQL Query: Structured Query Language. Used for managing data held in relational databases. e.g. select * from people where awesome = True
  • Pandas DataFrame: Similar to an R Data Frame. A Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary Pandas data structure.