Random forest em dados desbalanceados: uma aplicação na modelagem de churn em seguro saúde

Carregando...
Imagem de Miniatura
Data
2017-03-27

Orientador(res)

Mendes, Eduardo Fonseca

Métricas

Título da Revista

ISSN da Revista

Título de Volume

Resumo
In this work we study churn in health insurance, that is predicting which clients will cancel the product or service within a preset time-frame. Traditionally, the probability whether a client will cancel the service is modeled using logistic regression. Recently, modern machine learning techniques are becoming popular in churn modeling, having been applied in the areas of telecommunications, banking, and car insurance, among others. One of the big challenges in this problem is that only a fraction of all customers cancel the service, meaning that we have to deal with highly imbalanced class probabilities. Under-sampling and over-sampling techniques have been used to overcome this issue. We use random forests, that are ensembles of decision trees, where each of the trees fits a subsample of the data constructed using either under-sampling or over-sampling. We compare the distinct specifications of random forests using various metrics that are robust to imbalanced classes, both in-sample and out-of-sample. We observe that random forests using imbalanced random samples with fewer observations than the original series present a better overall performance. Random forests also present a better performance than the classical logistic regression, often used in health insurance companies to model churn.

Descrição

Área do Conhecimento

Avaliação

Revisão

Suplementado Por

Referenciado Por