Industry News by Michael DeFelice May 12, 2017 Identifying "At-Risk" Customers with Python and Pandas We love our customers - especially when they have been growing with us over the years. I’ve spent many hours looking through data like this and it makes me proud to work for a company whose customers get continuous value from its products. Some customers, however, and for various reasons, show recent periods of negative growth or a plateau in their storage growth. Obsessed with ensuring the best customer fit with our portfolio, we like to identify these anomalies and personally reach out to the customers. The below is a simple example of identifying “at-risk” customers based on the trend of their last four months’s invoice amount (MRR). In future posts, I’ll explore other methods. This example covers: Loading data from our SQL database into a Pandas DataFrame Grouping the data by customer by month and calculating the total invoice amount Writing a function that calculates a best fit line to a series and returns its slope and r-squared value Applying this function to each customer Filtering for actionable results, in this case, customers to be researched, called or emailed 1. Loading Data Populate a Pandas DataFrame with the results of a SQL query. In this case, the query returns a subset of line items from customer invoices along with classification tags. Note about 6.1M rows and 24 features (columns) returned. >>> df = pd.read_sql (<SQL_QUERY>, <DATABASE_CONNECTION>) >>> df.shape (6152682, 24) A sample of the returned DataFrame is as follows. For this post, the interesting columns are cus (unique customer id), monc (last day of the month the invoice was created), and amount (total amount of the line item): >>> df.cus.nunique () 41823 >>> df.monc.max () Timestamp('2017-05-31 00:00:00') >>> df[['cus', 'monc', 'amount']].sample (5) cus monc amount 2173 74623XXXX 2015-01-31 4.000000 8545 62785XXXX 2015-06-30 22.702842 245 90520XXXX 2014-12-31 3.000000 2951 88383XXXX 2014-03-31 4.000000 4527 12940XXXX 2017-04-30 4.000000 2. Grouping Since the above DataFrame represents the line items of an invoice, group by customer by month created - aggregating the amount field by sum. >>> a = df.groupby (['cus', 'monc']).amount.sum ().unstack () >>> a.shape (41823, 99) >>> a.sample (5) ... monc 2016-12-31 2017-01-31 2017-02-28 2017-03-31 2017-04-30 cus 74062XXXX NaN NaN NaN NaN NaN 47983XXXX 14.226606 14.078619 14.112420 14.907932 14.962422 39993XXXX NaN NaN NaN NaN NaN 11889XXXX 343.551325 345.919253 351.929715 358.429962 371.082920 11464XXXX 22.720224 22.849535 22.850852 23.010361 23.126976 Note how now each row can be represented as a series of monthly totals for a customer. 3. Writing a function that calculates a best fit line to a series and returns its slope and r-squared value from scipy import stats def get_tendency (a, n = 4): b = a[-n:].dropna () if len (b) > 2: c = stats.linregress (b.reset_index (drop = True).reset_index ()) return pd.Series ([c.slope, c.rvalue**2]) else: return np.nan 4. Applying this function With Pandas, this is simple and is calculated relatively fast. After calculation, note the column names m (slope) and rsq (r-squared value) b = a.apply (get_tendency, axis = 1) b.columns = ['m', 'rsq'] >>> b.shape (41782, 2) >>> b.dropna ().sample (5) m rsq cus 87270XXXX 0.338661 0.282983 123863XXX -0.885078 0.572635 125119XXX 1.800000 0.600000 52400XXXX 0.080003 0.843176 121516XXX 0.290487 0.796639 5. Filtering for actionable results Let’s start by targeting the top 10 customers with a negative MRR growth trend and r-squared value over 0.9. >> b[(b.m < 0) & (b.rsq >= .9)].sort_values ('m').head (10) m rsq cus 92091XXXX -310.060129 0.945718 124653XXX -98.439050 0.957354 62329XXXX -95.104875 0.934173 61190XXXX -88.285062 0.929398 115290XXX -84.566811 0.918093 118413XXX -63.178696 0.951263 76597XXXX -61.686508 0.925345 70253XXXX -52.739355 0.913199 124036XXX -43.834437 0.901635 72208XXXX -42.247019 0.935996 Ensuring Customers Receive Value from Your Business’ Products It’s immensely important for a product company to continually ensure that its customers are receiving value from its products. Identifying anomalies in usage and appropriately reaching out to customers for feedback (and acting on this feedback) is key to long-term success. Definitions MRR: Monthly Recurring Revenue Churn: Churn rate, the percentage rate at which customers stop subscribing to a service. SQL Query: Structured Query Language. Used for managing data held in relational databases. e.g. select * from people where awesome = True Pandas DataFrame: Similar to an R Data Frame. A Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary Pandas data structure.