The interaction of sampling ratio and modelling method in prediction of binary target with rare target class
No Thumbnail Available
Date
2009-09-14T07:35:48Z
Authors
Hirschowitz, Steven
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
In many practical predictive data mining problems with a binary target, one of the target
classes is rare. In such a situation it is common practice to decrease the ratio of common to
rare class cases in the training set by under-sampling the common class. The relationship
between the ratio of common to rare class cases in the training set and model performance
was investigated empirically on three artificial and three real-world data sets. The results
indicated that a flexible modelling method without regularisation benefits in both mean and
variance of performance from a larger ratio when evaluated on a criterion sensitive to
overfitting, and benefits in mean but not variance of performance when evaluated on a
criterion less sensitive to overfitting. For an inflexible modelling method and a flexible
method with regularisation, the effects of a larger ratio were less consistent. In no
circumstances, however, was a larger ratio found to be detrimental to model performance,
however measured.