预测连续变量的创新方法。从回归到分类

Introduction: When it comes to the prediction of continuous variables, the first thing that comes to our mind is always the regression model. For instance, linear regression is the most commonly used regression model, and it has the benefits of simple implementation and high interpretability. On the other hand, random forest regression can handle missing data and is adaptive to interactions and nonlinearity. While these algorithms all work well for continuous target variables in different scenarios analytically, they provide less information on the predicted numbers’ confidence level, especially in real-world applications.

引言。当涉及到连续变量的预测时，我们首先想到的总是回归模型。例如，线性回归是最常用的回归模型，它的优点是实施简单，可解释性强。另一方面，随机森林回归可以处理缺失数据，并且对交互作用和非线性有适应性。虽然这些算法对于不同场景下的连续目标变量都能很好地进行分析，但它们对预测数字的置信度提供的信息较少，尤其是在现实世界的应用。

In this article, we will explore an unconventional framework to predict continuous variables with given confidence scores. Instead of framing the prediction as a regression problem, we twist the problem into a classification problem. This framework also allows us to have more visibility on the predicted results and can be adjustable to different confidence levels. The article will use revenue estimation as an example. Given a variety of business attributes for many businesses, we will illustrate how we can predict the revenue for each business given a specific confidence level.

在这篇文章中，我们将探索一个非常规的框架，以给定的信心分数来预测连续变量。我们不把预测作为一个回归问题，而是把问题扭曲成一个分类问题。这个框架还允许我们对预测结果有更多的了解，并且可以根据不同的信心水平进行调整。这篇文章将以收入估计为例。鉴于许多企业的各种业务属性，我们将说明在特定的置信度下，我们如何预测每项业务的收入。

From regression to classification

从回归到分类

The framework requires more granular data on confidence scores and estimated revenue, and this can be accomplished by transforming the problem into the Binary Relevance (BR) problem. BR involves the decomposition of the target variable into a set of binary classification tasks where each model is learned independently (Madjarov et al., 2012¹).

该框架需要关于置信度分数和估计收入的更细化的数据，这可以通过将问题转化为二元相关性（BR）问题来实现。BR涉及将目标...