min read

Consumer expenditure categorization using FastProp and Relboost

Written by

Patrick Urbanke

Published on

October 22, 2019

Predict item purchases that will be gifts. This analysis is based on a public domain data set provided by the American Bureau of Labor Statistics.

The consumer expenditures data set is about analyzing consumer's consumption patterns to predict whether an item was purchased as a gift. We train two prediction pipelines using two of getML's algorithms. The first pipeline using FastProp, a propositionalization algorithm, and the second pipeline using Relboost, a relational learning algorithm. We show that with relational learning, we can get an AUC of over 90%. The learned features would have been impossible to build by hand or by using brute-force approaches.

‍

Summary:

Prediction type: Classification model
Domain: Retail
Prediction target: If a purchase is a gift
Source data: Relational data set, 4 tables
Population size: 398.895
Used algorithms: FastProp, Relboost

‍

The challenge

The Consumer Expenditure Data Set is a public domain data set provided by the American Bureau of Labor Statistics (https://www.bls.gov/cex/pumd.htm). It includes the diary entries, where American consumers are asked to keep diaries of the products they have purchased each month.

These consumer goods are categorized using a six-digit classification system the UCC. This system is hierarchical, meaning that every digit represents an increasingly granular category.

For instance, all UCC codes beginning with ‘200’ represent beverages. UCC codes beginning with ‘20011’ represents beer and ‘200111’ represents ‘beer and ale’ and ‘200112’ represents ‘nonalcoholic beer’ (https://www.bls.gov/cex/pumd/ce_pumd_interview_diary_dictionary.xlsx).

The diaries also contain a flag that indicates whether the product was purchased as a gift. The challenge is to predict that flag using other information in the diary entries.

‍

This can be done based on the following considerations:

Some items are less likely to be purchased as gifts than others (for instance, it is unlikely that toilet paper is ever purchased as a gift).
Items that diverge from the usual consumption patterns are more likely to be gifts.

‍

In total, there are three tables which we find interesting:

EXPD, which contains information on the consumer expenditures, including the target variable GIFT.
FMLD, which contains socio-demographic information on the households.
MEMD, which contains socio-demographic information on each member of the households.

‍

Result

Without hyperparameter optimization getML's Fastprop achieves and AUC of ~0.91 whereas Relboost achieves an AUC of ~0.92. We transpile both features to SQL. This is how a FastProp feature looks like:

A feature generated by Relboost looks like this when transpiled to SQL:

‍

The learned feature from Relboost is mainly based on the UCC codes, both the UCC codes of the product in question (marked t1.UCC), but it also compares the UCC code to other products that the household has purchased (marked t2.UCC). This means that both the product itself, but also the household's usual consumption patterns predict whether this item was purchased as a gift.

‍

Related code example

Notebook:
Open in nbviewer
Open in mybinder

‍