Sage Journals: Discover world-class research

Abstract

This paper introduces the general philosophy of the Operational Data Analytics (ODA) framework for data‐based decision modeling. The fundamental development of this framework lies in establishing the direct mapping from data to decision by identifying the appropriate class of operational statistics. The efficient decision making relies on a careful balance between data integration and decision validation. Through a canonical decision making problem under uncertainty, we show that the existing approaches (including statistical estimation and then optimization, retrospective optimization, sample average approximation, regularization, robust optimization, and robust satisficing) can all be unified through the lens of the ODA formulation. To make the key concepts accessible, we demonstrate, using a simple running example, how some of the existing approaches may become equivalent under the ODA framework, and how the ODA solution can improve the decision efficiency, especially in the small sample regime.

Keywords

data‐integrated decision Operational Data Analytics operational statistics small samples

INTRODUCTION

With the increasingly available data from business operations and the recent emphasis on analytics, significant efforts are made to develop data‐driven or data‐integration approaches to improve business operations (Bertsimas & Kallus, 2020; Feng & Shanthikumar, 2022; Mĭsić & Perakis, 2020; Simchi‐Levi, 2014). As a departure from the traditional decision‐making framework, which assumes full structural and statistical characterizations of the operating systems, the emerging trend is to use data to supplement imperfect structural or statistical knowledge.

In practice, even when the volume of the historical data is large, the number of repeatable instances can be limited. When launching a new product, entering a new market, building a new production line, or offering a new service, firms need to formulate strategies with limited experience. Even for a matured operating process, the external or internal environment may have changed due to unanticipated events. It is thus important for practical purposes that the design of the data analytics approach can produce efficient and robust performance in the small‐sample regime.

The intent of this paper is to provide an overview of a general framework, named the Operational Data Dnalytics (ODA), for predictive and prescriptive analysis based on limited data. The philosophy of the ODA framework emphasizes that the way of data‐integration should directly capture the natural relationship between the decision and the data implied by the ultimate performance measure for the decision‐making problem. Such a direct data integration, through identifying the appropriate class of operational statistics (Chu et al., 2008), must leverage the structural properties of the underlying stochastic decision model to ensure efficient solution quality in a finite sample regime. As it is in the case for any data‐based approach, the input of the ODA framework consists of the knowledge‐centric validation domain and the underlying data‐generation model. The domain of validation contains all potential models, within which we believe the true model lies. The knowledge of the business is translated into structural and statistical assumptions, which define this domain. The data‐generation model determines the amount of information contained in the observed data.

The two pillars, differentiating the ODA framework from the existing approaches, are the data‐integration model and the validating model. There is a delicate trade‐off between these two models in the ODA framework to best leverage the knowledge of the operating system. We present the ODA framework for decision‐making problems that reveal some homogeneous property. Such problems arise in many operational contexts including inventory planning, pricing strategy, quality configuration, and service design (see the examples described in Section 2.3). We show that the existing approaches (e.g., predict and then optimization, retrospective optimization, regularization, robust optimization, and robust satisficing) produce solutions that belong to a class of homogeneous functions of the data. Thus, the existing approaches can be generalized by the ODA framework, within which the data‐integration model is a subclass of homogeneous operational statistics. With the ODA framework, the operational statistics within the homogeneous class is validated against the ultimate performance measure, leading to superior efficiency over the existing solutions. The formulation of both the data‐integration model and the validating model depends on the amount of knowledge we possess about the system (i.e., the knowledge‐centric validation domain), and we demonstrate the application of the ODA framework under different levels of knowledge.

When we do not know anything about the statistical characterization of the system (i.e., in the fully nonparametric setting), we can formulate the data‐integration model by sequentially boosting a given oracle solution (derived from some aforementioned existing approaches). Boosting extends a single given solution to a family of operational statistics within the homogeneous class. Using a sample‐average approximation of the objective to validate the decision within the data‐integration model leads to significantly improved performance over the given oracle solution. When we know the distribution family but not the distributional parameters, it is possible to directly optimize the ultimate performance over the entire homogeneous class. The resulting decision, called the parametric ODA solution, is uniformly optimal over the entire homogeneous class. In other words, the parametric ODA solution dominates those derived from the existing approaches for any sample size. When there are additional unknown parameters, one may first estimate these parameters and then apply the parametric ODA solution. Our numerical experiments suggest that the knowledge of the distributional family can significantly improve the solution quality.

We further prove that no other data‐integrated solution can dominate the homogeneous class in the sense of uniform optimality. Moreover, any solution is inferior to the homogeneous operational statistics in “average” performance. These results provide strong theoretical justification for the ODA data‐integration model of the canonical problem—All existing solutions are special cases of the homogeneous operational statistics, and it is not necessary to go beyond the homogeneous class when developing data‐integrated decisions.

The remainder of the paper is organized as follows. In the next section, we describe a generic decision‐making problem and the philosophy of the ODA framework. In Section 3, we summarize the existing approaches of data integration and identify their common property. Using this property, we show how the existing approaches can be generalized using the nonparametric ODA framework and the value of statistical knowledge in deriving a uniformly optimal solution in Section 4. Section 5 discusses the ODA solution in general contexts, and Section 6 concludes the study.

THE DATA‐DRIVEN DECISION PROBLEM

In this section, we present a generic decision‐making problem under uncertainty and introduce the general philosophy of the ODA framework.

The problem

Consider the following decision‐making problem: $\begin{matrix} y^{*} = arg \max {ϕ (y) = V [ψ (y, X)] : y \in Y}, \end{matrix}$ 1where the decision y is chosen from the feasible set $Y$ and X is a random variable (or vector) with support $supp (X) = X$ . The set $Y$ is often a subset of the Euclidean space and sometimes a subset of a well‐defined functional space. We can think of $ψ (y, x)$ as the payoff or profit when a decision y is made, and the realization of X is x. The value $V [ψ (y, X)]$ is what the decision‐maker uses for evaluating the decision performance. For example, if the interest is the expected profit, then $V = E$ is the expectation taken over the distribution of the uncertain events involved. Though ψ is known explicitly, we often lack the full characterization of V (i.e., ϕ is unknown due to a lack of knowledge of the probability measure to characterize X). Instead, only some partial knowledge, $K$ , of the probability measure of X (e.g., the parametric distribution family, the support $X$ , the first or second moment, or the marginal distributions) and a set of data D from the sample space $D$ are available.

Throughout our discussion, we focus on problems that exhibit some homogeneous property as stated in the assumption below. Assumption 1

For $Y = R_{+}; X = R_{+}$ and some fixed $γ, η \in R$ we have $ψ (c_{0}^{γ} y, c_{0} x) = c_{0}^{η} ψ (y, x), y \in Y; x \in X, \forall c_{0} \in R_{+} .$

The homogeneous property described in Assumption 1 is not uncommon. Many operating systems exhibit this property. We name a few examples. (1)

The newsvendor model. When the realized demand is x and an order quantity of y is chosen, the newsvendor profit is $ψ (y, x) = p \min {y, x} - cx$ , where p is the unit selling price and c is the unit production cost. Clearly, $ψ (c_{0} y, c_{0} x) = c_{0} ψ (y, x)$ .

(2)

The pricing problem. The product demand curve is $D (p) = {(x - β p)}^{γ}, γ > 0$ , where $p \in [\underset{̲}{p}, \bar{p}]$ is the product price and X is the random market potential. The revenue is $ψ (p, x) = p {(x - β p)}^{γ}$ . Clearly, $ψ (c_{0} p, c_{0} x) = c_{0}^{γ + 1} ψ (p, x)$ . The pricing problem is widely studied in the revenue management context (e.g., Besbes & Zeevi, 2009).

(3)

The quality choice. Consider a production process with random yield. A quality level of y leads to random output quality that generates a revenue $X y$ . The cost of choosing quality y is $c y^{1 + 1 / γ}, γ > 0$ , where c is some cost coefficient. Then, the profit of this decision‐making problem is $ψ (y, x) = x y - c y^{1 + 1 / γ}$ . It is easy to show that $ψ (c_{0}^{γ} y, c_{0} x) = c_{0}^{γ + 1} ψ (y, x)$ . This model has been applied in the marketing literature to study product positioning (e.g., Banker et al., 1998) and salesforce effort (e.g., Chu & Lai, 2013).

(4)

The queueing system. Consider a $G / G / c / k$ service system with c servers and k buffers. The interarrival time distribution is $F_{A} (\cdot | Λ)$ , and the interservice time distribution is $F_{S} (\cdot | μ)$ , where Λ is the unknown arrival rate and μ is the service rate. The time‐average profit is $ψ (μ, Λ) = (p - b) λ_{e} (μ, Λ) / (c μ) - w L_{q} (μ, Λ)$ , where $λ_{e} (μ, Λ)$ is the steady‐state effective arrival rate and $L_{q} (μ, Λ)$ is the steady‐state queue length. One can show that $ψ (c_{0} μ, c_{0} Λ) = ψ (μ, Λ)$ . The $G / G / c / k$ queue is a critical building block of service speed design (see, e.g., Anand et al., 2011; Burnetas, 2022).

As our discussion unfolds, it will become clear that the property of the decision‐making problem is essential to developing the data‐integration model and to derive an efficient decision.

An example: Estimation of the mean

To introduce the philosophy of the ODA framework, we take the example of estimating the mean of a random variable.

Example 1 Mean estimation

Consider a random variable X defined in space $(Ω, F, P)$ . We obtain n independently and identically distributed (i.i.d.) observations $D = \vec{X} = {X_{1}, X_{2}, …, X_{n}}$ from the cumulative distribution of X, denoted by $F_{X}$ , which we do not know. We would like to find the mean $E [X]$ . That is, the decision to choose an estimate for $E [X]$ . A commonly used estimator of $E [X]$ is the sample average $\begin{matrix} {\hat{μ}}_{\vec{X}} = \frac{1}{n} \sum_{i = 1}^{n} X_{i} . \end{matrix}$ This estimator is unbiased and consistent. There are, certainly, other candidate estimators (Halmos, 1946). How to choose an appropriate estimate depends on many considerations including the knowledge we have about $F_{X}$ and the ultimate goal of the estimation.

Our knowledge is specified as the statistical domain of validation. Here, we assume that $F_{X}$ has a finite mean and a finite variance, that is, $\begin{matrix} F_{v} & = & {F_{X} : R \to [0, 1] : \lim_{x \to - \infty} F_{X} (x) = 0, \\ \lim_{x \to \infty} F_{X} (x) & = & 1, | E [X] | < \infty, Var [X] < \infty} . \end{matrix}$ 2In addition, we may have some statistical knowledge $K$ of $F_{X}$ . For example, we may know the distribution of $X / E [X]$ or $X - E [X]$ , or the support $X$ . The knowledge is used to refine $F_{v}$ , leading to the domain of validation $\begin{matrix} D_{v} = F_{v} \cap K . \end{matrix}$ We recognize that any estimate is a statistic, that is, a function of the data. The potential choice set of a statistic is $\begin{matrix} \bar{G} = {y : R^{n} \to R} . \end{matrix}$ An appropriate selection of an element $y (\cdot)$ within $\bar{G}$ depends on the objective of the estimation. We may choose some function ψ that measures the risk (Wald, 1949) due to the deviation of the estimate from the true mean, that is, $E [ψ (y (\vec{X}), E [X])]$ . Suppose we apply the widely used squared error, that is, $ψ (y (\vec{X}), E [X]) = - E [L (y (\vec{X}), E [X])]$ , where $L (y, x) = {(y - x)}^{2}$ , as the validation model. Then the estimation problem becomes $\begin{matrix} \min {E [L (y (\vec{X}), E [X])], y \in \bar{G}} . \end{matrix}$ 3That is, the problem of estimating the mean boils down to choosing a statistic y from $\bar{G}$ , so that the expected loss is minimized. We should emphasize that the expected loss, that is, the first expectation in (3), is evaluated over the random sample $\vec{X}$ . It is apparent that we cannot directly compute the expected loss in the validation model as we do not know the exact value of $E [X]$ . Instead, we have to approximate the validation model by some validating model. One way is to replace $E [X]$ in (3) by the observed sample $\vec{X}$ and replace the expectation by the sample average. This leads to the empirical loss function: $\begin{matrix} \hat{L} (y (\vec{X}), \vec{X}) = \frac{1}{n} \sum_{i = 1}^{n} L (y (\vec{X}), X_{i}) = \frac{1}{n} \sum_{i = 1}^{n} {(y (\vec{X}) - X_{i})}^{2} . \end{matrix}$ 4Then, the estimation problem is approximated to $\begin{matrix} \min {\hat{L} (y (\vec{X}), \vec{X}), y \in \bar{G}} . \end{matrix}$ 5It is easy to see that the solution to this problem is $y_{ES} (\vec{X}) = {\hat{μ}}_{\vec{X}}$ .

When deriving the above solution, we have used the empirical loss to approximate the objective of the validation model to deal with the challenge of unknown $E [X]$ . An alternative to this approach is to restrict our attention to a smaller class of statistics: $\begin{matrix} G_{SCA} = \{y \in \bar{G} : y (\vec{x}) = α \frac{1}{n} \sum_{i = 1}^{n} x_{i} = α {\hat{μ}}_{\vec{x}}, α \in R_{+}\} \subset \bar{G} . \end{matrix}$ 6The choice of $G_{SCA}$ is intuitive—It gives the freedom of choosing a scaled average instead of fixing the scaling factor to be one. The corresponding validation model becomes $\begin{matrix} \min {E [L (y (\vec{X}), E [X])], y \in G_{SCA}}, \end{matrix}$ 7with the first expectation taken over both the random sample $\vec{X}$ . Although $y_{ES} (\vec{X}) = {\hat{μ}}_{\vec{X}} \in G_{SCA}$ , the optimal statistic for this problem is (a special case of the problem analyzed in Section 4.2) $\begin{matrix} y_{SCA} (\vec{X}) = \frac{1}{1 + \frac{Cv {[X]}^{2}}{n}} {\hat{μ}}_{\vec{X}}, \end{matrix}$ 8where $Cv [X]$ is the coefficient of variation of X. This is the estimate proposed by Searls (1964). The optimal scaled statistic y _SCA is in general not the same as $y_{ES} (\vec{X})$ . For example, if X follows exponential distribution, then the domain of validation becomes $D_{v} = {F_{X} \in F_{v} : F_{X} (x) = 1 - e^{- λ x}, x \in R_{+}, λ \in R_{+}}$ , and the optimal scaled statistic becomes $y_{SCA} (\vec{X}) = \frac{n}{n + 1} {\hat{μ}}_{\vec{X}} \neq {\hat{μ}}_{\vec{X}} = y_{ES} (\vec{X})$ . Thus, $y_{ES} (\vec{X})$ is inadmissible (i.e., inferior to some solution within $G_{SCA}$ in minimizing the expected loss), while $y_{SCA} (\vec{X})$ is uniformly optimal within the scaled class $G_{SCA}$ .

The above discussion is summarized in Figure 1. From this example, we observe that validating the solutions within the general space of statistics is usually impossible (i.e., the problem (3) cannot be solved) due to the lack of statistical knowledge. To derive a solution, we may either approximate the validation model (i.e., the objective) with some validating model, which leads to the solution $y_{ES} (\vec{X})$ , or restrict to a smaller class of statistics $G_{SCA}$ within which a uniformly optimal statistic can be identified. Though $y_{ES} (\vec{X}) \in G_{SCA}$ , the uniformly optimal solution in the class $G_{SCA}$ is not $y_{ES} (\vec{X})$ . It is interesting to note that the uniformly optimal solution $y_{SCA} (\vec{X}) \in G_{SCA}$ is a biased estimator for any finite sample, while the unbiased estimator is inadmissible (which is dominated by the biased estimator). The reader is further referred to Remark 1 in the Supporting Information for a discussion on how $y_{SCA} (\vec{X})$ compares against the well‐known Stein's unbiased risk estimate.

FIGURE 1

Mean estimation.

An overview of the ODA framework

Referring to the example of mean estimation, we can formalize the key elements in the ODA framework and highlight its differences from the conventional approaches. When the full characterization of the model defined in (1) is not available and data D are collected instead, we need to articulate (i) the domain of validation and (ii) the data‐generation model. The domain of validation $D_{v}$ is determined based on our knowledge of the system. It defines the candidate set of models within which we believe the true model lies. The data‐generation model explains how the observed data D is produced from the underlying model. In the mean‐estimation example, we believe that $D = \vec{X}$ is generated by taking i.i.d. draws from the true distribution $F_{X}$ .

The decision‐making problem boils down to selecting a statistic $y (D)$ of the data D. The decision $y (D)$ is called an operational statistic because it is an implementable statistic, and it is chosen from the functional space $\bar{G} = {y : D \to Y}$ . The ultimate goal of this decision‐making problem is to maximize the performance measure $ϕ (y) = V [ψ (y (D), X)]$ against the data D, that is, $\begin{matrix} \max {ϕ (y) : y \in \bar{G}} . \end{matrix}$ 9As suggested by the mean‐estimation example, it is generally impossible to solve the validation model defined in (9) because we cannot fully evaluate V as we do not know $F_{X}$ . Thus, we need to find ways to utilize the data D to substitute the missing characterization of $F_{X}$ .

The conventional approach is to approximate the payoff ϕ based on the observed D. In the mean‐estimation example, we have used the empirical loss $\hat{L}$ to approximate the objective in the validation model; recall Figure 1. Alternatively, we may choose the robust loss (i.e., ${E_{F} [X] : L (F, {\hat{F}}_{X | \vec{X}}) \leq β}$ , where ${\hat{F}}_{X | \vec{X}}$ is the empirical distribution based on the data $\vec{X}$ and $β > 0$ is some threshold) to replace the loss function L, and then derive the optimal solution of the approximated problem. Different approaches such as Bayesian estimation, retrospective optimization, sample average, regularization (including decision‐dependent penalty coefficient), and robust optimization (including decision‐dependent uncertainty set) fit into this framework. As we see from the mean‐estimation example, the resulting solution, though often consistent, can be inadmissible.

In an essential departure from the conventional approaches, the ODA approach recognizes that there may be some desired property of the operational statistics inherent in the decision‐making problem. Such a property allows us to define how data should be mapped into the decision, leading to the data‐integration model. In other words, the ODA data‐integration model identifies the potential statistics $G_{Y} \subset \bar{G}$ of interest, within which we would choose an “optimal” solution. As we see from the mean‐estimation example, the formulation of $G_{Y}$ should depend on our knowledge of the system, and thus the domain of validation. When implementing an operational statistic $y_{Y} \in G_{Y}$ , we would obtain a payoff $ψ (y_{Y} (D), X)$ . Ideally, we would like to choose the best $y_{Y} \in G_{Y}$ , that is, the one that solves the following optimization problem: $\begin{matrix} y_{Y}^{*} = arg \max {ϕ (y_{Y}) = V [ψ (y_{Y} (D), X)] : y_{Y} \in G_{Y}} . \end{matrix}$ 10This is the validation model. It is easy to see that the objective function, $ϕ (y_{Y}) = V [ψ (y_{Y} (D), X)]$ , is the actual value we obtain by implementing an operational statistic $y_{Y}$ . Note that the mapping V (e.g., the expectation when $V = E$ ) is taken over all possible realizations of the data D and the underlying random parameter X.

When we have sufficient knowledge $K$ of $F_{X}$ , we may appropriately choose the data‐integration model $G_{Y}$ so that a uniformly optimal solution $y_{Y}^{*}$ can be derived. In the mean‐estimation example, we have seen that y _SCA defined in (8) optimizes the validation model for the data‐integration model $G_{SCA}$ defined in (6), implying that the sample average is inadmissible. When we do not have sufficient knowledge, however, we have to appropriately select a validating model, say $\hat{ϕ} (y_{Y})$ , which we can evaluate with the knowledge we have to surrogate the validation model. As we have seen from the mean‐estimation example, it is important to note that the choices of the validating model $\hat{ϕ} (y_{Y})$ and the data‐integration model $G_{Y}$ must be coordinated to ensure the quality of the solution: $\begin{matrix} {\hat{y}}_{Y}^{*} = arg \max {\hat{ϕ} (y_{Y}) : y_{Y} \in G_{Y}} . \end{matrix}$ 11In summary, the ODA framework consists of two pillars: (i)

Data integration. We recognize that the decision is eventually a function or a statistic of the data. Given that the data are random, the solution belongs to a class of functionals of measures. The class $G_{Y}$ of operational statistics must fully explore the domain knowledge to capture the desirable structure connecting the data and the decision. The appropriate formulation of the data‐integration model is essential to ensure decision efficiency in a small sample regime.

(ii)

Decision validation. The decision must be validated against the ultimate performance measure (e.g., expected profit or utility). Ideally, we would like to choose the best operational statistic for the validation model. This may not be possible due to the lack of knowledge. Instead, we identify the appropriate validating models as approximations of the validation model. The validating models should be evaluated with the partial knowledge $K$ and the data D we have about the system.

The development of the ODA framework for specific applications should make sure that the data‐integration model and the decision validation are matched with each other, as we have seen from the mean‐estimation model. The ODA framework produces a solution $y_{Y} \in G_{Y}$ , which is a function over the space of observable samples. To implement this solution, we collect the data d (a specific realization of D) and execute the decision $y_{Y} (d)$ , as illustrated in Figure 2.

FIGURE 2

The ODA framework.

It is worth clarifying how the ODA framework differs from the conventional approaches (see Section 3). In a nutshell, the conventional approaches focus on identifying ways to modify or correct the objective function given the partial knowledge of the system, and then a decision is derived based on optimizing the adjusted objective. The ODA framework, in contrast, emphasizes the direct relationships between the data and the decision, restricts the data‐decision relationship based on partial knowledge, and then identifies the best function that maps the decision to the data through validating. The existing approaches can be thought of as special cases of the ODA approach and can be unified under the ODA approach. For example, we will see that, while the traditionally penalized robust optimization and robust satisficing are seen as different methods, within the framework of ODA, they become equivalent under mild conditions.

Another subtle, yet important, distinction of the ODA approach, compared with the conventional approaches, lies in the treatment of the solution space as a functional space (i.e., through the set of operational statistics). Though such a difference does not seem apparent in the form of the eventually implemented solutions, it determines the level of “flexibility” that one has in “optimizing” the decision. This is evident from the mean‐estimation example in the previous subsection.

THE EXISTING APPROACHES

In this section, we discuss several existing prescriptive solutions of (1) and summarize some common properties of the solutions derived.

We focus on the situation where X is a random variable. The data consist of n i.i.d. observations $\vec{X} = {X_{i}, i = 1, 2, …, n}$ of X. Let $\begin{matrix} {\hat{μ}}_{\vec{X}} = \frac{1}{n} \sum_{i = 1}^{n} X_{i} and {\hat{σ}}_{\vec{X}} = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {({\hat{μ}}_{\vec{X}} - X_{i})}^{2}} \end{matrix}$ be the sample average and sample standard deviation, respectively. We also use ${\hat{F}}_{X | \vec{X}}$ to denote the empirical distribution function of X, that is, ${\hat{F}}_{X | \vec{X}} (x) = \frac{1}{n} \sum_{i = 1}^{n} I_{{X_{i} \leq x}} \sim B i n o m i a l (n, F_{X} (x))$ , and the empirical probability mass function of X is $\begin{matrix} {\hat{f}}_{X | \vec{X}} (x) = \frac{\sum_{j = 1}^{n} I_{{x = X_{j}}}}{n} . \end{matrix}$ 12To make the idea explicit and to compare the results across different approaches, we will use a simple example described below throughout our discussion.

The running example: quality design

Suppose that a custom‐made product built with a quality level of y generates a random revenue $R = X y$ , where X is a nonnegative random variable with support $supp (X) = X \subseteq R_{+}$ , cumulative distribution function $F_{X}$ , density or mass function $f_{X}$ , mean $E_{F_{X}} [X] = μ_{X}$ , variance ${Var}_{F_{X}} [X] = \int_{x = 0}^{\infty} {(x - μ_{X})}^{2} d F_{X} (x) = σ_{X}^{2}$ , and coefficient of variation ${Cv}_{F_{X}} [X] = σ_{X} / μ_{X}$ . The profit generated from a product with quality y is $\begin{matrix} ψ (y, X) = X y - c y^{1 + \frac{1}{γ}}, y \geq 0, \end{matrix}$ where $c y^{1 + 1 / γ}$ (with $γ \geq 1$ ) is the production cost at quality level y. Note that the support $X$ could be continuous, discrete, or mixed. The expected profit is $\begin{matrix} ϕ (y, f_{X}) = E_{f_{X}} [ψ (y, X)] = μ_{X} y - c y^{1 + \frac{1}{γ}}, y \geq 0 . \end{matrix}$ We have sold n products to n customers at quality levels $y_{i}, i = 1, 2, …, n$ , and earned revenues $R_{i}, i = 1, 2, …, n$ , correspondingly. Then, $\vec{X} = (X_{i} = R_{i} / y_{i}, i = 1, 2, …, n)$ is a vector of n i.i.d. draws of X. Our goal is to find a decision y that maximizes the expected profit of a future product.

Basic statistical methods

A big portion of the existing data‐driven modeling applies some statistical or machine learning methods to estimate the distribution $F_{X}$ or the profit function ϕ based on the data. Then the decision is derived by optimizing the estimated profit function. Recent examples can be found in Biggs et al. (2023), Chuang et al. (2023), Lei et al. (2023), and Saldanha et al. (2023). In these studies, the development of forecasting or predictive models is critical for decision efficiency.

Predict and then Optimize (PTO).

The classical way is to first estimate the unknown parameters based on the data, and then optimize the value computed using the estimated parameters to derive the optimal decision. Specifically, the theoretically optimal decision with the knowledge of $F_{X}$ is $\begin{matrix} y^{} [F_{X}] = arg \max {ϕ (y) : y \in Y} . \end{matrix}$ Because we do not fully know $F_{X}$ , we need to use the data, $\vec{X}$ , to generate an estimator. Commonly used statistical methods include the minimum variance estimator, the maximum likelihood estimator, the method of moments, and the Bayesian estimator.

In our running example, $y^{} [F_{X}] = arg \max {E [X] y - c y^{1 + 1 / γ} : y \geq 0} = {(\frac{E [X]}{c} \frac{γ}{1 + γ})}^{γ}$ . To implement this solution, we need to know the mean $E [X]$ . If we use the sample average ${\hat{μ}}_{\vec{X}}$ to replace the mean $E [X]$ , then the implemented solution is $\begin{matrix} y_{PTO} (\vec{X}) = {(\frac{{\hat{μ}}_{\vec{X}}}{c} \frac{γ}{1 + γ})}^{γ} . \end{matrix}$ 13We may use alternative statistical methods to obtain an estimate, depending on our knowledge of $F_{X}$ . For example, suppose we know $F_{X}$ up to some parameter θ. That is, we know ${F_{X} (\cdot | θ), θ \in S_{Θ}}$ but not the exact value of $θ \in S_{Θ}$ . Then, we can apply the Bayesian estimation and derive the posterior density for a given prior of the parameter $f_{Θ} (\cdot)$ as $\begin{matrix} f_{X}^{Bayes} (x | \vec{X}) & = & \int_{η \in S_{Θ}} f_{X} (x | η) \frac{f_{Θ} (η) Π_{i = 1}^{n} f_{X} (X_{i} | η)}{\int_{α \in S_{Θ}} f_{Θ} (α) Π_{i = 1}^{n} f_{X} (X_{i} | α) d α} d η, \\ x & \in & X, \end{matrix}$ 14which is the Bayesian density of $f_{X}$ . Thus, the estimated profit function becomes $\begin{matrix} ϕ_{BO} (y, f_{Θ} | \vec{X}) = \int_{x \in X} ψ (y, x) f_{X}^{Bayes} (x | \vec{X}) d x, y \in Y . \end{matrix}$ The optimized solution with the Bayesian estimate is $y_{BO} (\vec{X}, f_{Θ}) = arg \max {ϕ_{BO} (y, f_{Θ} | \vec{X}) : y \in Y}$ .

Retrospective Optimization (RTO).

Taking the retrospective view, we ask the question: If we could go back in time and choose some decision y for the past n instances of the problem, how could we make the most profit? To answer this question, we need to estimate the profit as a function of y based on the data we have collected. If the realization of X is observable, for each realized $X_{i}$ and the corresponding decision $y_{i}$ , we obtain a profit of $ψ_{i} = ψ (y_{i}, X_{i})$ . In this case, we know $ψ (y, X_{i})$ explicitly given that we know ψ. If there is censoring, latency, or endogeneity, we may observe, instead of X, some value of $ξ (y, X)$ for a given decision y. In this case, the expectation of ψ as a function of the decision for the observed data cannot be explicitly computed. With the observations $ξ_{i} = ξ (y_{i}, X_{i}), i = 1, 2, …, n$ , we need to estimate $E [ψ (y, X) | (ψ_{i}, ξ_{i}, y_{i}) : i = 1, 2, …, n)]$ . A relevant situation is discussed in Section 5.1.

The retrospective objective for our problem with observable $\vec{X}$ , which is the average profit that we could have earned with a decision y, is $\begin{matrix} ϕ_{RTO} (y, \vec{X}) = \frac{1}{n} \sum_{i = 1}^{n} ψ_{i} = \frac{1}{n} \sum_{i = 1}^{n} ψ (y, X_{i}), y \geq 0, \end{matrix}$ and the optimal retrospective decision is $\begin{matrix} y_{RTO} (\vec{X}) = arg \max {ϕ_{RTO} (y, \vec{X}), y \geq 0} . \end{matrix}$ In our running example, we can observe $\vec{X}$ and thus the predictive model for customer i is ${\tilde{ψ}}_{i} (y) = ψ (y, X_{i})$ . Hence, the retrospective objective becomes $\begin{matrix} ϕ_{RTO} (y, \vec{X}) = {\hat{μ}}_{\vec{X}} y - c y^{1 + \frac{1}{γ}}, y \geq 0, \end{matrix}$ and the optimal retrospective quality level is $y_{RTO} (\vec{X}) = {(\frac{{\hat{μ}}_{\vec{X}}}{c} \frac{γ}{1 + γ})}^{γ}$ .

Empirical Optimization (Em).

We can use the available data to estimate the empirical distribution function in (12) and use it to replace the true distribution in computing the optimal decision (e.g., Kleywegt et al., 2002; Smith & Winkler, 2006). The empirical profit is $\begin{matrix} ϕ_{Em} (y, \vec{X}) = E_{{\hat{f}}_{X | \vec{X}}} ψ (y, X) . \end{matrix}$ We may also generate a smoothed empirical density using Kernel estimation. The smooth Parzen estimator of the density for X is $\begin{matrix} {\hat{f}}_{X | \vec{X}}^{P} (x) = \frac{1}{n λ} \sum_{i = 1}^{n} κ_{λ} (x, x_{i}), \end{matrix}$ where $λ > 0$ is the width and $κ_{λ} (x, x_{0}) = (1 / \sqrt{2 π}) e^{- \frac{| x_{0} {- x |}^{2}}{2 λ^{2}}}$ is the Gaussian kernel.

In our running example, the empirical optimization leads to the quality level $y_{Em} (\vec{X}) = {(\frac{{\hat{μ}}_{\vec{X}}}{c} \frac{γ}{1 + γ})}^{γ}$ .

Sample‐Average Approximation (SAA).

One way to come up with a solution for the problem is to use the sample average to approximate the expectation in the objective. This method has been widely applied (e.g., Charikar et al., 2005; Homem‐de‐Mello & Bayraksan, 2014; Levi et al., 2007; Swamy & Shmoys, 2005). Specifically, the sample average of the objective is $\begin{matrix} ϕ_{SAA} (y, \vec{X}) = \frac{1}{n} \sum_{i = 1}^{n} ψ (y, X_{i}) . \end{matrix}$ In the running example, we can obtain the optimal SAA solution as $\begin{matrix} y_{SAA} (\vec{X}) = {(\frac{{\hat{μ}}_{\vec{X}}}{c} \frac{γ}{1 + γ})}^{γ} . \end{matrix}$ 15We observe that in the running example, $y_{SAA} (\vec{X}) = y_{PTO} (\vec{X}) = y_{RTO} (\vec{X}) = y_{Em} (\vec{X})$ . In general, these solutions may be different. For example, an alternative estimate of the mean or an alternative predictive model can make y _RTO different from y _PTO and y _SAA. Also, when there is censoring, latency, or endogeneity in the data (e.g., Huh et al., 2009), y _RTO can be different from y _PTO, while the SAA estimate may not be properly defined.

A major shortcoming of the basic statistical methods lies in the risk associated with the random sample. Especially when the sample size is small, the statistical estimates can perform poorly. To address the issue of overfitting, two approaches are widely used, namely regularization and robust optimization, which we describe briefly in the next two subsections.

Regularization of empirical distribution (Re)

To regularize the estimation, a common way is to penalize solutions that lead to a high variance or standard deviation (e.g., Ho & Hanasusantoy, 2022). The variance of the empirical profit is $\begin{matrix} {Var}_{{\hat{f}}_{X | \vec{X}}} [ψ (y, X)] = \frac{1}{n} \sum_{i = 1}^{n} {(ψ (y, X_{i}) - ϕ_{Em} (y, \vec{X}))}^{2} . \end{matrix}$ Then, the regularized objective, for some suitably chosen $β \in R_{+}$ , can be one of the following: $\begin{matrix} ϕ_{Re : Var} (y, \vec{X}, β) & = & ϕ_{Em} (y, \vec{X}) - β {Var}_{{\hat{f}}_{X | \vec{X}}} [ψ (y, X)], \\ ϕ_{Re : Std} (y, \vec{X}, β) & = & ϕ_{Em} (y, \vec{X}) - β \sqrt{{Var}_{{\hat{f}}_{X | \vec{X}}} [ψ (y, X)]} . \end{matrix}$ In the running example, we have $ϕ_{Em} (y, \vec{X}) = {\hat{μ}}_{\vec{X}} y - c y^{1 + 1 / γ}$ and ${Var}_{{\hat{f}}_{X | \vec{X}}} [ψ (y, X)] = {\hat{σ}}_{\vec{X}}^{2} y^{2}$ . Therefore, the regularized objectives become $\begin{matrix} ϕ_{Re : Var} (y, \vec{X}, β) & = & {\hat{μ}}_{\vec{X}} y - c y^{1 + \frac{1}{γ}} - β {\hat{σ}}_{\vec{X}}^{2} y^{2}, \\ ϕ_{Re : Std} (y, \vec{X}, β) & = & {\hat{μ}}_{\vec{X}} y - c y^{1 + \frac{1}{γ}} - β {\hat{σ}}_{\vec{X}} y . \end{matrix}$ The regularized solutions are $\begin{matrix} y_{Re : Var} (\vec{X}, β) & \in & \{y : {\hat{μ}}_{\vec{X}} - c \frac{1 + γ}{γ} y^{\frac{1}{γ}} - 2 β {\hat{σ}}_{\vec{X}}^{2} y = 0\}, \end{matrix}$ 16 $\begin{matrix} y_{Re : Std} (\vec{X}, β) & = & {(\frac{{\hat{μ}}_{\vec{X}} - β {\hat{σ}}_{\vec{X}}}{c} \frac{γ}{1 + γ})}^{γ} . \end{matrix}$ 17

Robust optimization (Ro)

Although the empirical distribution ${\hat{F}}_{X | \vec{X}}$ could be a good guess for the true distribution $F_{X}$ , optimizing the payoff evaluated based on this guess may lead to inefficient decisions (Smith & Winkler, 2006). Robust optimization (Ro) (Gilboa & Schmeidler, 1989; Lim et al., 2006; Lu & Shen, 2021; Zhao et al., 2018) is a popular approach to address such an issue. The idea is to balance the information and the risk involved in the empirical distribution. For that, we need to define a measure for the closeness between the guess and the truth. This measure determines an uncertainty set for potential models (e.g., Ben‐Tal et al., 2006; Bertsimas et al., 2018; Liu et al., 2023). The uncertainty set reduces to the range of unknown parameters in the parametric setting (e.g., Cao & Gao, 2021; Y. Wang, Zhang, et al., 2023; Zhu et al., 2022). When we do not know the distribution family of $F_{X}$ , the uncertainty set contains a collection of distribution functions, which is the situation we discuss here.

We will consider two general approaches to define the closeness of a probability distribution function to the empirical distribution. One, using divergence measures, evaluates the distance of distribution functions over the support of the empirical distribution. The other, using the mass transport measures, computes the transfer of probability mass over the support of the true distribution.

Robust optimization with divergence measures

To define an uncertainty set of distributions that are close enough to the empirical distribution, we need to apply some measure of deviation between two distributions. For demonstration, we use the popular h‐divergence measure of f from ${\hat{f}}_{X | \vec{X}}$ defined as $\begin{matrix} d_{h - D} (f, {\hat{f}}_{X | \vec{X}}) = \sum_{x \in X_{\vec{X}}} h (\frac{f (x)}{{\hat{f}}_{X | \vec{X}} (x)}) {\hat{f}}_{X | \vec{X}} (x), \end{matrix}$ 18where h is a convex function with $h (1) = 0$ and $X_{\vec{X}}$ is a set such that $supp (f) \subseteq supp ({\hat{f}}_{X | \vec{X}}) \subseteq X_{\vec{X}}$ . We shall note that other distance measures like the Wasserstein metric (e.g., Mohajerin Esfahani & Kuhn, 2018) can also be applied to form the uncertainty set; see Section 3.4.

Given that h is convex, it is easy to see (by applying Jensen's inequality) that $\begin{matrix} d_{h - D} (f, {\hat{f}}_{X | \vec{X}}) \geq h (\sum_{x \in X_{\vec{X}}} \frac{f (x)}{{\hat{f}}_{X | \vec{X}} (x)} {\hat{f}}_{X | \vec{X}} (x)) = h (1) = 0 . \end{matrix}$ That is, the h‐divergence measure is positive and takes a value of zero only when $f = {\hat{f}}_{X | \vec{X}}$ . However, this measure is critically dependent on the values that the probability mass function is defined over.

For illustration, suppose that the support of ${\hat{f}}_{X | \vec{X}}$ takes k values (i.e., $supp ({\hat{f}}_{X | \vec{X}}) = {{\tilde{x}}_{1}, {\tilde{x}}_{2}, …, {\tilde{x}}_{k}} \subseteq X_{X}$ ), and we use the example of $h (x) = {(1 - x)}^{2}, x \geq 0$ , which gives the Chi‐square distance $\begin{matrix} d_{χ^{2}} (f, {\hat{f}}_{X | \vec{X}}) = \sum_{i = 1}^{k} {(1 - \frac{f ({\tilde{x}}_{i})}{{\hat{f}}_{X | \vec{X}} ({\tilde{x}}_{i})})}^{2} {\hat{f}}_{X | \vec{X}} ({\tilde{x}}_{i}) = \sum_{i = 1}^{k} δ_{i}^{2} {\hat{f}}_{X | \vec{X}} ({\tilde{x}}_{i}), \end{matrix}$ where $\begin{matrix} δ_{i} \equiv 1 - \frac{f ({\tilde{x}}_{i})}{{\hat{f}}_{X | \vec{X}} ({\tilde{x}}_{i})}, i = 1, 2, …, k . \end{matrix}$ 19We can verify that the domain of the vector $\vec{δ} = {δ_{i}}_{i = 1}^{k}$ is $Δ ({\hat{f}}_{X | \vec{X}})$ , where $\begin{matrix} Δ (f) & = & \{\vec{δ} \in R_{+}^{k} : - (1 / f ({\tilde{x}}_{i}) - 1) \leq δ_{i} \leq 1, \\ i & = & 1, 2, …, k, \sum_{i = 1}^{k} δ_{i} f ({\tilde{x}}_{i}) = 0\} \end{matrix}$ is defined for any probability mass function f over the domain ${{\tilde{x}}_{1}, {\tilde{x}}_{2}, …, {\tilde{x}}_{k}}$ . Then, the maximum Chi‐square distance with respect to such probability mass function f is $\begin{matrix} d_{χ^{2}}^{\max} (f) = \max \{\sum_{i = 1}^{k} δ_{i}^{2} f ({\tilde{x}}_{i}) : \vec{δ} \in Δ (f)\} = \frac{(1 - f^{\min})}{f^{\min}}, \end{matrix}$ where $f^{\min} = \min {f ({\tilde{x}}_{i}), i = 1, 2, …, k}$ . We can choose a value $d_{0} \in [0, d_{χ^{2}}^{\max} ({\hat{f}}_{X | \vec{X}})]$ to determine the set of distributions that are close to ${\hat{f}}_{X | \vec{X}}$ . Specifically, let $F_{v}$ denote the set of all densities and $\begin{matrix} F_{χ^{2}} ({\hat{f}}_{X | \vec{X}}, d_{0}) & = & \{f \in F_{v} : supp (f) \in X_{\vec{X}}, d_{χ^{2}} (f, {\hat{f}}_{X | \vec{X}}) \leq d_{0}\}, \\ 0 & \leq & d_{0} \leq d_{χ^{2}}^{\max} ({\hat{f}}_{X | \vec{X}}) . \end{matrix}$ In other words, we believe that the true distribution of X lies within d ₀, in terms of the Chi‐square divergence measure, from the empirical distribution.

It is worth noting that a shortcoming of this approach is that unless $X_{\vec{X}} = X$ (i.e., the data cover the entire domain of the true distribution), $F_{X} \notin F_{χ^{2}} ({\hat{f}}_{X | \vec{X}}, d_{0})$ no matter how we choose d ₀. With the closeness measure defined, there are two essentially equivalent forms of robust optimization commonly analyzed. One directly modifies the objective and the other imposes a constraint.

Penalized Robust Optimization (Ro:d‐P).

Under this approach, we penalize the objective with the deviation from the likely true distribution. Specifically, we solve for some $β \geq 0$ $\begin{matrix} \max_{y} \min_{f} {ϕ (y, f) + β d_{χ^{2}} (f, {\hat{f}}_{X | \vec{X}}) : \\ f \in F_{χ^{2}} ({\hat{f}}_{X | \vec{X}}, d_{χ^{2}}^{\max} ({\hat{f}}_{X | \vec{X}})), y \in Y} . \end{matrix}$ Now go back to the running example. Determining any potentially true distribution F with support $supp (f) \subseteq X_{\vec{X}}$ is equivalent to choosing a vector $\vec{δ}$ in (19) that satisfies $\sum_{i = 1}^{k} δ_{i} {\hat{f}}_{X | \vec{X}} ({\tilde{x}}_{i}) = 0$ . The penalized robust objective can then be written as $\begin{matrix} ϕ_{Ro : d - P} (y, \vec{δ}, \vec{X}, β) & = & ϕ_{Em} (y, \vec{X}) - y \sum_{i = 1}^{k} ({\hat{μ}}_{\vec{X}} - {\tilde{x}}_{i}) δ_{i} {\hat{f}}_{X | \vec{X}} ({\tilde{x}}_{i}) \\ + β \sum_{i = 1}^{k} δ_{i}^{2} {\hat{f}}_{X | \vec{X}} ({\tilde{x}}_{i}) . \end{matrix}$ The distribution minimizing this objective satisfies $δ_{i} (y, β) = \frac{{\hat{μ}}_{\vec{X}} - {\tilde{x}}_{i}}{2 β} y$ . Substituting this into the penalized robust objective function, we obtain $\begin{matrix} ϕ_{Ro : d - P} (y, \vec{δ} (y, β), \vec{X}, β) = ϕ_{Em} (y, \vec{X}) - \frac{1}{4 β} {\hat{σ}}_{\vec{X}}^{2} y^{2} . \end{matrix}$ Thus, the decision maximizing the above expression is $\begin{matrix} y_{Ro : d - P} (\vec{X}, β) \in \{y : {\hat{μ}}_{\vec{X}} - c \frac{1 + γ}{γ} y^{\frac{1}{γ}} - \frac{{\hat{σ}}_{\vec{X}}^{2}}{2 β} y = 0\} . \end{matrix}$ In this example, the penalized robust optimization leads to the same solution as that of the regularization using the variance.

Constrained Robust Optimization (Ro:d‐C).

Under this approach, we restrict the potential set of distribution within a certain distance, β, to the empirical distribution. Specifically, we solve $\begin{matrix} \max_{y} \min_{f} {ϕ (y, f) : f \in F_{χ^{2}} ({\hat{f}}_{X | \vec{X}}, β), y \in Y} . \end{matrix}$ In our running example, the objective is $\begin{matrix} ϕ_{Ro : d - C} (y, \vec{δ}, \vec{X}) = ϕ_{Em} (y, \vec{X}) - y \sum_{i = 1}^{k} ({\hat{μ}}_{\vec{X}} - {\tilde{x}}_{i}) δ_{i} {\hat{f}}_{X | \vec{X}} ({\tilde{x}}_{i}), \end{matrix}$ and the constraints are $\begin{matrix} \sum_{i = 1}^{k} δ_{i}^{2} {\hat{f}}_{X | \vec{X}} ({\tilde{x}}_{i}) \leq β, \\ 0 \leq (1 - δ_{i}) {\hat{f}}_{X | \vec{X}} ({\tilde{x}}_{i}) \leq 1, i = 1, 2, …, k . \end{matrix}$ Recognizing the fact that the optimal $\vec{δ}$ must make the first constraint binding, we derive $δ_{i} (y, β) = \sqrt{β} \frac{{\tilde{x}}_{i} - {\hat{μ}}_{\vec{X}}}{{\hat{σ}}_{\vec{X}}}$ . Substituting this expression to the objective, we obtain $\begin{matrix} ϕ_{Ro : d - C} (y, \vec{δ} (y, β), \vec{X}) = ϕ_{Em} (y, \vec{X}) - \sqrt{β} {\hat{σ}}_{\vec{X}} y . \end{matrix}$ Thus, the maximizer is $\begin{matrix} y_{Ro : d - C} (\vec{X}, β) = {(\frac{{\hat{μ}}_{\vec{X}} - \sqrt{β} {\hat{σ}}_{\vec{X}}}{c} \frac{γ}{1 + γ})}^{γ} . \end{matrix}$ In this example, the constrained robust optimization leads to the same solution as that of the regularization using the standard deviation.

Robust optimization with mass transport (Ro:m)

An alternative to using the divergence measure is to dispersing the probability mass around the empirical distribution. Let $β > 0$ be the budget of mass that can be moved. A candidate distribution $F : [0, \infty) \to [0, 1]$ deviates from ${\hat{F}}_{X | \vec{X}}$ by at most β amount of mass: $\begin{matrix} F_{m} ({\hat{F}}_{X | \vec{X}}, β) = \{F : \int_{0}^{1} | {\hat{F}}_{X | \vec{X}}^{i n v} (α) - F^{i n v} (α) | d α \leq β\}, \end{matrix}$ where we use $g^{i n v}$ to denote the inverse of a monotone function $g : R \to R$ .

With a mass transport budget of β, the constrained robust optimization problem is $\begin{matrix} \max_{y} \min_{F} {ϕ (y, F) : F \in F_{m} ({\hat{F}}_{X | \vec{X}}, β)} . \end{matrix}$ In our running example, it is easy to see that the worst distribution would simply use the budget to reduce the mean as much as possible, suggesting a mass transport to reduce the mean by β. Therefore, the objective function becomes $\begin{matrix} ϕ_{Ro : m} (y, \vec{X}, β) = ({\hat{μ}}_{\vec{X}} - β) y - c y^{1 + \frac{1}{γ}}, y \geq 0 . \end{matrix}$ We derive $y_{Ro} : m (\vec{X}, β) = {(\frac{{\hat{μ}}_{\vec{X}} - β}{c} \frac{γ}{1 + γ})}^{γ}, β \geq 0 .$

Robust satisficing (RS)

Taking a different view point from the robust optimization, Sim et al. (2022) study the approach of RS. They define an uncertainty set $F_{W} ({\hat{f}}_{X} | \vec{X}, d_{0}) = {f \in F_{v} : d_{W} (f, {\hat{f}}_{X | \vec{X}}) \leq d_{0}}$ using the Wasserstein distance: $\begin{matrix} d_{W} (f_{a}, f_{b}) & = & inf {E_{f_{X_{a}, X_{b}}} [| X_{a} - X_{b} |^{p}] : X_{a} \sim f_{a}, X_{b} \sim f_{b}}, \\ p & \in & (0, 1] . \end{matrix}$ The RS model is $\begin{matrix} \min & κ \end{matrix}$ 20 $\begin{matrix} s.t. τ - E_{f} [ψ (y, X)] & \leq & κ d_{W} ({\hat{f}}_{X | \vec{X}}, f), f \in F_{W} ({\hat{f}}_{X | \vec{X}}, d_{0}), \\ y & \geq & 0 . \end{matrix}$ 21Let $y_{RS} (\vec{X}, τ)$ denote the solution to this model and $κ_{RS} (τ)$ the corresponding objective. Long et al. (2023) observe that the RS model, with appropriately chosen values of τ, may attain a solution that improves Pareto efficiency from the solution of the robust optimization model. We demonstrate the relationship between the two models with our way of representation.

At the optimum, the constraint in (21) holds for any $f \in F_{W} ({\hat{f}}_{X | \vec{X}}, d_{0})$ . We must have $\begin{matrix} τ & \leq & \min_{f \in F_{W} ({\hat{f}}_{X | \vec{X}}, d_{0})} {E_{f} [ψ (y_{RS} (\vec{X}, τ), X)] + κ_{RS} (τ) d_{W} ({\hat{f}}_{X | \vec{X}}, f)} . \end{matrix}$ Because the right‐hand side is strictly increasing in κ, the above constraint must be binding. For fixed τ and $κ = κ_{RS} (τ)$ , $y_{RS} (\vec{X}, τ)$ must maximize the right‐hand side. To see that, suppose there exists a y ₀ such that $\begin{matrix} \min {E_{f} [ψ (y_{0}, X)] + κ_{RS} (τ) d_{W} ({\hat{f}}_{X | \vec{X}}, f) : f \in F_{W} ({\hat{f}}_{X | \vec{X}}, d_{0})} \\ > \min {E_{f} [ψ (y_{RS} (\vec{X}, τ), X)] + κ_{RS} (τ) d_{W} ({\hat{f}}_{X | \vec{X}}, f) : \\ f \in F_{W} ({\hat{f}}_{X | \vec{X}}, d_{0})} = τ . \end{matrix}$ Then, there exists a $κ_{0} < κ_{RS} (τ)$ such that the above inequality holds weakly. This contradicts the fact that $κ_{RS} (τ)$ is the optimal objective. Therefore, we must have $\begin{matrix} y_{RS} (\vec{X}, τ) & = & arg \max {\min_{f \in F_{W} ({\hat{f}}_{X | \vec{X}}, d_{0})} {E_{f} [ψ (y_{RS} (\vec{X}, τ), X)] \\ + κ_{RS} (τ) d_{W} ({\hat{f}}_{X | \vec{X}}, f)}, y \geq 0} . \end{matrix}$ Thus, y _RS solves the penalized robust optimization problem when the hyperparameter is $β = κ_{RS} (τ)$ . We shall note that this relationship does not depend on the explicit form of the distance measure $d_{W}$ .

Crossvalidation

So far, we have discussed several conventional approaches that derive some data‐integrated solution $y_{S} (\vec{X}, β), S \in {PTO$ , Re:Var, Re:Std, Ro:d‐P, Ro:d‐C, Ro:m} for some hyperparameter $β \geq 0$ . Alternatively, we can think of each approach as discussed in Sections 3.2 and 3.3 as a way to identify a specific class of operational statistics: $\begin{matrix} G_{S} = \{g : X^{\times n} \times R_{+} \to Y : g (\vec{x}, β) = y_{S} (\vec{x}, β), β \in B\}, \end{matrix}$ 22where $B \subseteq R_{+}$ is the range of the hyperparameter. The optimal operational statistic $y_{S}^{}$ for method S is obtained by optimizing the validation model: $\begin{matrix} \max {E [ψ (y (\vec{X}), X)] : y \in G_{S}} . \end{matrix}$ There are two commonly used approaches to empirically determine β and then determine the final operational statistic using cross‐validation.
k ‐Fold cross‐validation. We randomly partition the data $\vec{x}$ into k subsets of equal sizes, ${I_{1}, I_{2}, …, I_{k}}$ , with $\cup_{ℓ = 1}^{k} I_{ℓ} = {1, 2, …, n}$ and $I_{ℓ_{1}} \cap I_{ℓ_{2}} = \emptyset$ for $ℓ_{1} \neq ℓ_{2}$ . Define ${\vec{x}}^{ℓ} = {x_{i} : i \notin I_{ℓ}}$ and $I_{- ℓ} = {1, 2, …, n} ∖ I_{ℓ}$ . Then, the hyperparameter β computed by solving $\begin{matrix} \max_{β \in B} \sum_{ℓ = 1}^{k} \sum_{j \in I_{- ℓ}} ψ (y_{S} ({\vec{x}}^{ℓ}, β), x_{j}) . \end{matrix}$

Bootstrapping. From ${\hat{f}}_{X | \vec{X}}$ , we generate k samples, each consisting of n i.i.d. realizations, ${\vec{x}}^{(ℓ)} = {x_{1}^{(ℓ)}, x_{2}^{(ℓ)}, …, x_{n}^{(ℓ)}}, ℓ \in {1, 2, …, k}$ . The parameter β is computed by solving $\begin{matrix} \max_{β \in B} \sum_{ℓ = 1}^{k} \sum_{j = 1}^{n} ψ (y_{S} ({\vec{x}}^{(ℓ)}, β), x_{j}) . \end{matrix}$

An important principle here is to ensure that the data used to form the statistics and those used for validation should be made independent (to the extent possible). There have been many studies comparing the empirical effectiveness of k‐fold cross‐validation and bootstrapping (see, e.g., Arlot, 2010; Efron & Tibshirani, 1997; Kohavi, 1995). In many situations, bootstrapping exhibits better performance over k‐fold cross‐validation. Leave‐one‐out validation, a special case of k‐fold cross‐validation with $k = n$ , demonstrates good predictive and prescriptive performance. Our numerical experiments (see the next section) generally confirm these observations.

Once the value of $β^{} (\vec{X})$ is determined through cross‐validation, the implementable solution $y_{S}^{} (\vec{X}) = y_{S} (\vec{X}, β^{} (\vec{X}))$ is obtained. It is important to recognize that no matter which of cross‐validation method is applied, the solution derived using any of the conventional approaches discussed before satisfies the following property (see the details in Remark 2 in the Supporting Information): $\begin{matrix} y_{S}^{} (α \vec{X}) = α^{γ} y_{S}^{} (\vec{X}) for α > 0 and some constant γ > 0 . \end{matrix}$ 23As a direct consequence of Assumption 1, this relationship makes intuitive sense and is essential to develop the ODA solution.

GENERALIZATION WITH ODA

In the previous section, we have seen that the existing data‐integrated approaches can be viewed as special cases of the ODA framework with the data‐integration model being some specific functions parameterized by β and the validating model being the average profit used in cross‐validation.

Specifically, from Equation (23), all the solutions derived using the methods discussed in the previous section are specific elements of the following class of homogeneous operational statistics: $\begin{matrix} G_{H} = {g : X^{\times n} \to Y, g (α \vec{x}) = α^{γ} g (\vec{x}), α \in R_{+}} . \end{matrix}$ 24Thus, it makes sense to seek solutions within this class. Under the ODA framework, we need to choose a subclass from $G_{H}$ to form the data‐integration model. This choice depends on our past experience in operating the system and our knowledge of $F_{X}$ . We can unify the existing solutions with the ODA framework using sequential boosting.

Improvement with sequential boosting

Based on Assumption 1, the homogeneous property of the payoff function, it is expected that the decision should exhibit some scaling property with respect to the data. Specifically, if the data are scaled by c ₀ (i.e., define $Z =^{d} X / c_{0}$ ), then the decisions should be scaled correspondingly: $\begin{matrix} y^{} [F_{Z}] & = & arg \max {E [ψ (y, Z)], y \geq 0} \\ = & c_{0}^{γ} arg \max {E [ψ (y, X)], y \geq 0} = c_{0}^{γ} y^{} [F_{X}] . \end{matrix}$ This observation suggests that we may potentially improve the solutions obtained in the previous section by sequential boosting. Specifically, for some given $y_{S}^{} (\vec{X})$ , we define the boosted class $\begin{matrix} G_{B : S} = {g : X^{\times n} \times R_{+} \to Y, g (\vec{X}) = α y_{S}^{} (\vec{X}), α \in R_{+}} \subseteq G_{H} . \end{matrix}$ 25We shall remark that there can be other ways to boost a specific solution (e.g., Feng et al., 2022). Here we use scaling, a natural way to expand the solution for our problem, as a demonstration.

We take the solutions generated from the existing approaches, and look for the optimal boosting parameter $α^{*}$ through cross validation. We demonstrate the results for the running example through simulation and experiment with different gamma distributions that have the same mean, but different coefficients of variation. The expected profit obtained through implementing the conventional solution is denoted by $ϕ_{S}$ , and that through implementing the sequentially boosted solution is denoted by $ϕ_{S}^{B}$ . We report the average profit, the standard deviation of the profit, and the minimum profit out of 10,000 simulated samples for each problem instance.

As a general observation, we find that bootstrapping (Tables 1 and 5 in the Supporting Information) usually outperforms k‐fold cross‐validation (Tables 7 and 9 in the Supporting Information). Among the results generated by k‐fold cross‐validation, the best performance is observed when $k = n$ (Tables 6 and 8 in the Supporting Information). These observations are in line with the existing literature; recall Section 3.5.

Two observations are common from different cross‐validation approaches. First, sequential boosting generally improves the solution quality, especially when the sample size is small ( $n = 10$ ). More importantly, sequential boosting can reduce the variability of the profit—The standard deviation of the simulated profit under the boosted solution is smaller than that under the conventional solution. This suggests that boosting improves the performance robustness of the solution.

Second, the improvement over the average and variability of the profit obtained from sequential boosting is more significant when the inherent variability is higher (i.e., $Cv [X]$ is larger) or when the profit function is more concave in the decision (i.e., $γ = 2$ as opposed to $γ = 1$ ). Note that the concavity of the profit function amplifies the effect of variability in view of Jensen's inequality. Moreover, the effect of sample size should be viewed in relation to the level of variability—A larger sample size is needed to understand a system with higher variability. The observations here imply that sequential boosting is effective, especially in the small‐sample regime.

We note that the sequentially boosted solutions are nonparametric. That is, they do not require any knowledge of $F_{X}$ , other than it is within the domain $F_{v}$ of all distributions of nonnegative random variables with finite mean and variance. Our next step is to examine the value of additional statistical knowledge in improving the solution quality.

Value of statistical knowledge and the parametric ODA solution

In this subsection, we show that it is possible to derive a uniformly optimal ODA solution when additional statistical knowledge becomes available. This is a powerful result for two reasons. First, in many situations, the decision‐maker may have additional knowledge of the random component X. Distributional knowledge can often be obtained from the experience of operating similar domains in the past (see examples described by, e.g., Gallien et al., 2015; Hu et al., 2019). As we demonstrate below, the additional knowledge would allow us to expand the data‐integration model from the boosted class $G_{B : S}$ defined in (25) to the entire homogeneous class $G_{H}$ defined in (24). This expansion enhances the flexibility in solution validation. We establish below the uniform optimality of the parametric ODA solution, which dominates any solution derived from the existing approaches for any sample size. Moreover, our result suggests that the parametric ODA solution produces superior performance with very small samples. Second, the parametric ODA solution gives a natural benchmark for any solution derived under the nonparametric setting. One may refine the nonparametric approaches based on the properties of the parametric ODA solution.

In our problem, there can be two scenarios where additional statistical knowledge becomes useful for decision‐making. Assumption 2 Distributional information

(1)
$F_{X}$ is known up to the scale parameter. That is, $X = θ Z$ or $F_{X} (X | θ) = F_{Z} (\frac{X}{θ})$ with $F_{Z}$ known and $θ \in R_{+}$ unknown.
(2)
$F_{X}$ comes from a known distribution family with unknown parameters.

The general data‐integration model for our problem under Assumption 1 is the class of the operational statistics $G_{H}$ defined in (24). When implementing a solution $y \in G_{H}$ , we expect to make a profit of $\begin{matrix} E_{θ} [ψ (y (\vec{X}), X)] & = & \int_{\vec{x} \in R_{+}^{n}} (\int_{x = 0}^{\infty} ψ (y (\vec{x}), x) f_{X} (x | θ) d x) \\ \times (\prod_{i = 1}^{n} f_{X} (x_{i} | θ)) d \vec{x} . \end{matrix}$ 26Ideally, we would like to identify the operational statistic that maximizes the expected profit of implementing that solution: $\begin{matrix} \max {E_{θ} [ψ (y (\vec{X}), X)] : y \in G_{H}} . \end{matrix}$ 27This is the validation model of our problem.

It turns out that, under Assumption 2.1, it is possible to explicitly solve the validation model regardless of the value of θ (see Theorem 1). The operational statistic derived is uniformly optimal. With Assumption 2.2, however, directly validating (27) is not possible as there are additional unknown parameters. One may first estimate these unknown parameters and then use them to replace the true parameters in the uniformly optimal operational statistic. Alternatively, one may appropriately reduce the data‐integration model to a subset of $G_{H}$ and derive the optimal operational statistic within the subset. Next, we discuss these different scenarios in detail.

Under Assumption 2.1, Liyanage and Shanthikumar (2005) and Chu et al. (2008, 2023) have analyzed the problem in the context of newsvendor models. Because of the homogeneity of the operational statistics in $G_{H}$ , one can focus on solving the optimal decision y only over a subset of samples that have a unit norm. Then the solution for any other sample can be obtained through proper scaling. This approach allows for solving the problem for any observed sample as a scalar optimization problem. This is the major development by Chu et al. (2008, 2023). Here we present an alternative view of data integration in the ODA framework. In particular, any operational statistic in $G_{H}$ is associated with a distribution ${\hat{f}}_{X}^{OS}$ . Implementing an operational statistic from $G_{H}$ is equivalent to updating the distribution of X using the data $\vec{X}$ with a specific prior. With this updated distribution, we can estimate the expected profit function and derive the optimal decision for the observed sample. This observation is stated in the next theorem. Theorem 1 Parametric ODA estimation

Under Assumptions 1 and 2.1, for any $y_{OS} \in G_{H}$ , $\begin{matrix} E_{θ} [ψ (y_{OS} (\vec{X}), X)] = \int_{\vec{x} \in R_{+}^{n}} ϕ_{ODA} (y_{OS} (\vec{x}) | \vec{X}) \prod_{i = 1}^{n} f_{X} (x_{i} | θ) d \vec{x}, \end{matrix}$ where $\begin{matrix} ϕ_{ODA} (y | \vec{x}) & = & \int_{s = 0}^{\infty} ψ (y, s) {\hat{f}}_{X}^{OS} (s | \vec{x}) d s, \end{matrix}$ 28 $\begin{matrix} {\hat{f}}_{X}^{OS} (s | \vec{x}) & = & \frac{\int_{ϑ = 0}^{\infty} f_{X} (s | ϑ) [Π_{i = 1}^{n} f_{X} (x_{i} | ϑ)] \frac{1}{ϑ^{η + 1}} d ϑ}{\int_{α = 0}^{\infty} [Π_{i = 1}^{n} f_{X} (x_{i} | α)] \frac{1}{α^{η + 1}} d α} . \end{matrix}$ 29

In view of Theorem 1, the optimal operational statistic can be derived by maximizing $ϕ_{ODA} (y | \vec{x})$ , a scalar function. It is straightforward to see that the resulting solution is uniformly optimal over $G_{H}$ under Assumptions 1 and 2.1. Connecting Theorem 1 with the Bayesian estimate in (14), we observe that the optimal way to utilize the distributional knowledge is to update the distribution of X with an improper prior $f_{Θ} (θ) = 1 / θ^{η + 1}$ and then optimize the estimated profit. This result allows us to efficiently compute the solution through simulation (see Algorithm 1).

ALGORITHM 1

Simulation for (28) to obtain $ϕ_{ODA} (y, η | \hat{\vec{X}})$ .

1: Compute posterior $p (\cdot | \hat{\vec{X}})$ for Θ

2: $b = 0$

3: for $j \leftarrow 1, 2, …, n_{ϑ}$ do

4: $ϑ_{j} = (j / n_{ϑ}) (\underset{̲}{ϑ} - \bar{ϑ})$ , where $[\underset{̲}{ϑ}, \bar{ϑ}]$ is the range of the unknown parameter in computation

5: $a_{j} = ϑ_{j}^{- η - 1} \times \prod_{i = 1}^{n} f_{X} (x_{i} / ϑ_{j} | 1) / ϑ_{j}$

6: $b \leftarrow b + a_{j}$

7: for $j \leftarrow 1, 2, …, n_{ϑ}$ do $p (ϑ_{j} | \hat{\vec{X}}) = a_{j} / b$

8: Estimate the objective ϕ_ODA

9: Sample $x_{k}, k = 1, 2, …, n_{x}$ from $f_{X} (\cdot | 1)$

10: $v a l u e = 0$

11: for $j \leftarrow 1, 2, …, n_{ϑ}$ do

12: $v_{j} = \frac{1}{n_{x}} \sum_{k = 1}^{n_{x}} ψ (y, ϑ_{j} x_{k})$

13: $v a l u e = v a l u e + v_{j} \cdot p (ϑ_{j} | \hat{\vec{X}})$

14: $ψ_{ODA} (y, η | \hat{\vec{X}}) = v a l u e$

Moreover, Theorem 1 provides a way to estimate the profit function, $ϕ_{ODA} (\cdot | \vec{x})$ , allowing the decision‐maker to understand how the performance is affected by adjusting the decision. This is evident from Table 4 in the Supporting Information, where we demonstrate that the ODA approach produces a more accurate estimation of the objective function than the SAA does, especially when the inherent uncertainty is high.

The derivation of the optimal operational statistic in Theorem 1 relies on Assumption 2.1 that only θ is unknown. When there are other unknown parameters (i.e., the situation under Assumption 2.2), we may first estimate those parameters before applying the solution in Theorem 1 or reduce the class of operational statistics. We demonstrate the implementation of the ODA solution under different statistical knowledge using the running example. In the numerical example, we choose gamma distribution as the distribution family where $F_{X}$ comes from, though a similar analysis can be conducted for any known distribution family. For the purpose of comparison, we also examine the boosted solution derived from the basic statistical methods discussed in Section 3.1 but with the knowledge of the distributional family. In particular, we consider the functional space $\begin{matrix} G_{SCA} & = & \{g : R^{n} \times R_{+} \to R_{+}, g (α, \vec{x}) = α y_{SAA} (\vec{x}) \\ = & α {(\frac{{\hat{μ}}_{\vec{x}}}{c} \frac{γ}{1 + γ})}^{γ}\} . \end{matrix}$ 30It is straightforward to see that $G_{SCA} \subseteq G_{H}$ . The objective function in the validation model can then be computed as $\begin{matrix} E [ψ (y \vec{(X)}, X)] & = & E [α X {(\frac{{\hat{μ}}_{\hat{\vec{X}}}}{c} \frac{γ}{1 + γ})}^{γ} - c α^{1 + \frac{1}{γ}} {(\frac{{\hat{μ}}_{\hat{\vec{X}}}}{c} \frac{γ}{1 + γ})}^{γ + 1}] \\ = & \frac{γ^{γ}}{c^{γ} {(1 + γ)}^{γ}} E [α X {\hat{μ}}_{\hat{\vec{X}}}^{γ} - α^{1 + \frac{1}{γ}} \frac{γ}{1 + γ} {\hat{μ}}_{\hat{\vec{X}}}^{γ + 1}] . \end{matrix}$ The optimal boosting parameter can be obtained as $\begin{matrix} α_{SCA} = {(\frac{E [X] E [{(\frac{1}{n} \sum_{i = 1}^{n} X_{i})}^{γ}]}{E [{(\frac{1}{n} \sum_{i = 1}^{n} X_{i})}^{γ + 1}]})}^{γ} = {(\frac{E [{(\frac{1}{n} \sum_{i = 1}^{n} Z_{i})}^{γ}]}{E [{(\frac{1}{n} \sum_{i = 1}^{n} Z_{i})}^{γ + 1}]})}^{γ}, \end{matrix}$ 31where $Z_{i}$ are i.i.d. copies of $X / E [X]$ , which has mean one. If $F_{Z}$ follows $G a m m a (k, θ)$ , we have $T =^{d} \sum_{i = 1}^{n} Z_{i}$ follows $G a m m a (n k, θ)$ and it is straightforward to derive $E [T^{γ}] = θ^{γ} \frac{Γ (n k + γ)}{Γ (n k)}$ . Thus, $\begin{matrix} α_{SCA} = \frac{n}{θ (n k + γ)} = \frac{1}{1 + \frac{γ}{n} \frac{1}{k}} . \end{matrix}$ 32We note that the optimal boosting parameter α_SCA depends only on the shape parameter k and the sample size n, but not on the unknown parameter θ and the observed data $\vec{X}$ . In particular, when $γ = 1$ , we have $α_{SCA} = 1 / {(1 + (Cv [X])}^{2} / n)$ , which coincides with (8) in the mean‐estimation example.

When both the scale, θ, and the shape, k, of X are unknown, to derive the solution under either $G_{H}$ or $G_{SCA}$ , we need to estimate the shape parameter k. There are several ways to obtain the estimator $\hat{k}$ as detailed in Remark 3 in the Supporting Information. It is important to note the difference between $G_{SCA}$ and $G_{H}$ . With the former, the solution obtained through (32) only depends on $f_{X}$ through the parameter $Cv [X] = 1 / \sqrt{k}$ . The solution for the latter, in contrast, utilizes the distributional knowledge of $f_{X} (\cdot | θ)$ , as suggested by Theorem 1. Certainly, the more information is used, the better the solution performance is, a general observation from the numerical experiments reported in Table 2.

TABLE 1
Performance improvement with sequential boosting ( $γ = 1$ and $ϕ^{} = 1.125$ )..

ϕ_PTO $ϕ_{PTO}^{B}$ ϕ_Re:Var $ϕ_{Re : Var}^{B}$ ϕ_Re:Std $ϕ_{Re : Std}^{B}$ ϕ_Ro:M $ϕ_{Ro : M}^{B}$

$G a m m a (0.3, 5)$ $Cv [X] =$ 1.83

$n =$ 10 Ave 0.750 0.826 0.740 0.830 0.812 0.824 0.761 0.822

Stdev 0.749 0.464 0.746 0.464 0.528 0.441 0.714 0.462

Min −15.651 −8.661 −15.651 −8.661 −11.457 −8.661 −15.441 −8.661

$n =$ 20 Ave 0.937 0.958 0.891 0.971 0.954 0.956 0.941 0.957

Stdev 0.320 0.234 0.315 0.231 0.251 0.230 0.308 0.233

Min −5.341 −4.106 −5.341 −4.106 −4.299 −3.822 −5.341 −4.106

$G a m m a (0.5, 3)$ $Cv [X] =$ 1.41

$n =$ 10 Ave 0.902 0.931 0.904 0.933 0.924 0.929 0.907 0.930

Stdev 0.398 0.273 0.395 0.273 0.302 0.265 0.383 0.273

Min −6.547 −3.680 −6.547 −3.680 −5.221 −3.379 −6.547 −3.680

$n =$ 20 Ave 1.014 1.021 0.990 1.026 1.019 1.020 1.016 1.021

Stdev 0.180 0.144 0.182 0.142 0.150 0.143 0.174 0.144

Min −1.236 −0.847 −1.236 −0.847 −0.953 −0.844 −1.236 −0.847

$G a m m a (1, 1.5)$ $Cv [X] =$ 1.00

$n =$ 10 Ave 1.013 1.020 1.019 1.020 1.018 1.019 1.015 1.020

Stdev 0.174 0.147 0.166 0.147 0.155 0.146 0.169 0.146

Min −1.118 −0.979 −1.118 −0.979 −1.061 −0.979 −1.118 −0.979

$n =$ 20 Ave 1.068 1.070 1.063 1.070 1.069 1.070 1.069 1.070

Stdev 0.084 0.075 0.087 0.074 0.077 0.075 0.082 0.075

Min −0.098 −0.104 −0.098 −0.104 −0.061 −0.104 −0.061 −0.104

Note: $X \sim G a m m a (k, θ)$ with shape parameter k and scale parameter θ. $E [X] = k θ = 1.5$ , $γ = 1$ , and $c = 0.5$ so that $ϕ (y, X) = X y - 0.5 y^{2}$ . The true optimum is attained at $y^{} = 1.5$ with $ϕ^{} = 1.125$ . Bootstrapping is used for crossvalidation with $k = 100$ samples generated from the empirical distribution. The results are based on 10,000 random samples of $\vec{x}$ .

TABLE 2
Performance of the ODA solutions.

$G a m m a (0.1, 15)$ $G a m m a (0.3, 5)$ $G a m m a (0.5, 3.0)$ $G a m m a (1.0, 1.5)$

$Cv [X] = 3.16$ $Cv [X] = 1.83$ $Cv [X] = 1.41$ $Cv [X] = 1.00$

Ave Stdev Min Ave Stdev Min Ave Stdev Min Ave Stdev Min

$n = 5$ ϕ_PTO −1.25 8.79 −201.27 0.37 1.94 −65.16 0.69 0.86 −14.49 0.90 0.41 −5.35

Known $f_{Z}$ and k

ϕ_OS 0.42 0.37 0.00 0.73 0.32 0.00 0.86 0.26 0.01 0.97 0.17 0.17

ϕ_SCA 0.37 0.61 −15.16 0.68 0.53 −18.77 0.81 0.35 −5.22 0.93 0.25 −2.65

Known $f_{Z}$ and unknown k

ϕ_OS:M 0.24 2.08 −81.56 0.68 0.62 −20.59 0.81 0.42 −7.33 0.93 0.26 −4.30

ϕ_OS:LH 0.39 1.10 −83.54 0.68 0.55 −19.22 0.81 0.39 −7.21 0.93 0.26 −2.88

ϕ_OS:GG 0.39 1.14 −83.81 0.69 0.56 −20.80 0.82 0.40 −7.31 0.93 0.26 −2.91

$ϕ_{OS : GG - d}$ 0.37 0.69 −55.97 0.65 0.36 −7.56 0.76 0.30 −3.28 0.88 0.22 −2.41

Unknown $f_{Z}$ and k

ϕ_SCA 0.03 3.04 −81.75 0.62 0.95 −43.87 0.79 0.50 −7.41 0.92 0.29 −4.34

$n = 10$ ϕ_PTO 0.02 3.18 −84.13 0.76 0.70 −10.43 0.90 0.40 −9.14 1.01 0.18 −2.77

Known $f_{Z}$ and k

ϕ_OS 0.62 0.35 0.00 0.89 0.23 0.03 0.98 0.17 0.18 1.04 0.10 0.41

ϕ_SCA 0.57 0.57 −15.57 0.85 0.32 −4.09 0.94 0.24 −5.09 1.02 0.14 −1.76

Unknown $f_{Z}$ and unknown k

ϕ_OS:M 0.61 0.38 −6.53 0.87 0.27 −7.11 0.95 0.21 −3.35 1.03 0.13 −1.84

ϕ_OS:LH 0.62 0.38 −6.47 0.87 0.28 −6.71 0.95 0.21 −3.36 1.03 0.13 −1.85

ϕ_OS:GG 0.59 0.36 −5.45 0.86 0.25 −4.29 0.95 0.19 −3.16 1.02 0.12 −0.64

$ϕ_{OS : GG - d}$ 0.57 0.73 −17.82 0.86 0.34 −6.49 0.95 0.22 −3.38 1.02 0.13 −1.77

Unknown $f_{Z}$ and k

ϕ_SCA 0.48 1.20 −30.50 0.83 0.42 −7.58 0.93 0.27 −4.59 1.02 0.15 −2.18

$n = 20$ ϕ_PTO 0.57 1.20 −30.50 0.94 0.32 −5.98 1.01 0.18 −2.30 1.07 0.09 −0.22

Known $f_{Z}$ and k

ϕ_OS 0.80 0.29 0.01 1.00 0.15 0.17 1.04 0.10 0.32 1.08 0.06 0.58

ϕ_SCA 0.75 0.42 −10.40 0.99 0.21 −3.42 1.02 0.14 −1.39 1.07 0.07 0.02

Known $f_{Z}$ and unknown k

ϕ_OS:M 0.77 0.40 −8.44 0.97 0.18 −2.67 1.03 0.13 −0.53 1.07 0.07 −0.14

ϕ_OS:LH 0.79 0.29 −1.77 0.99 0.16 −0.64 1.03 0.12 −0.43 1.08 0.07 −0.09

ϕ_OS:GG 0.80 0.29 −1.75 0.99 0.16 −0.46 1.03 0.12 −0.49 1.08 0.07 −0.11

$ϕ_{OS : GG - d}$ 0.73 0.57 −12.15 0.96 0.22 −3.92 1.02 0.14 −1.58 1.07 0.08 −0.14

Unknown $f_{Z}$ and k

ϕ_SCA 0.79 0.29 −1.25 0.99 0.16 −0.30 1.03 0.11 −0.40 1.07 0.06 −0.10

Note: The true distribution is $f_{X} (x) = x^{k - 1} e^{- x / θ} / (Γ (k) θ^{k})$ , and $X =^{d} θ Z$ . $γ = 1$ and $c = 0.5$ such that the objective is $ψ (y, x) = x y - 0.5 y^{2}$ . The true optimal solution is $y^{} = 1.5$ , and the true optimal value is $ϕ^{} (θ) = 1.125$ . $ϕ_{j}$ is the optimal objective value under $G_{j}$ , $j \in {OS, SCA}$ . When k is unknown, OS:j estimates $\hat{k}$ using the moment estimation for $j = M$ , the likelihood estimation for $j = HL$ , the likelihood estimation with generalized gamma distributio‐n for $j = GG$ and the de‐biased maximum likelihood estimation with generalized gamma distribution for $j = GG - d$ . When $f_{Z}$ and k are unknown, $SCA$ estimates $Cv [X]$ using the method of moments. The result is based on 10,000 simulated samples of $\vec{X}$ .

The profit obtained from implementing the predict‐and‐then‐optimize solution in (13) is ϕ_PTO. From Table 2, it is clear that this solution is worse than any operational statistic derived by applying the ODA framework. The performance of ϕ_SCA with an unknown distribution family (i.e., known F), though the worst among all ODA solutions, is close enough to the uniformly optimal profit ϕ_OS with known $f_{Z}$ and k when $n = 20$ .

In all the instances reported, ϕ_OS under the known shape parameter k outperforms any other solutions. This is because the optimal solution obtained from Theorem 1 is uniformly optimal under Assumption 2.1. In particular, we observe that the average profit under this solution can be significantly closer to the true optimal profit than those under other solutions, especially when the sample size is extremely small (i.e., $n = 5$ ) or the inherent variability is high (i.e., $Cv [X]$ is high). More importantly, the standard deviation of the resulting profit is much smaller than those under other methods in the small‐sample regime. These observations suggest the effectiveness and robustness of the ODA approach as well as the value of statistical knowledge in improving decision quality.

In the event that the shape parameter k is unknown, interestingly, the maximum likelihood estimator using the generalized gamma distribution outperforms other estimators, even though this estimator is biased. This observation suggests that the unbiasedness of parameter estimation may not lead to effective decision‐making, underscoring the importance of validating against the true performance measure (i.e., the profit obtained from the implemented solution).

When the distribution family of $F_{X}$ is unknown, the solution under $G_{SCA}$ (with $Cv [X]$ estimated using the method of moments) significantly outperforms the predict‐and‐then‐optimize approach, ϕ_PTO. This suggests that utilizing even the partial property of the solution structure in the data‐integration model can significantly improve the decision quality, without additional statistical knowledge.

In the nonparametric case (i.e., when $f_{Z}$ and k are unknown), it is interesting to compare ϕ_SCA in Table 2 with the boosted solutions reported in Table 1. We find that ϕ_SCA generally outperforms the boosted solutions. The former identifies the structure in (31) of the optimal boosting parameter within the scaled class, while the data drive boosting through cross‐validation in the latter.

DISCUSSION

We have presented the ODA framework under different levels of statistical knowledge in a generic setting. In this section, we discuss several implications of the ODA framework in broader contexts.

Incorporating covariates

In reality, we may have additional contextual information about the system uncertainty. Suppose the uncertain component in the system, denoted by $\vec{X}$ , is a m‐dimensional random vector $\vec{X} = {X_{j}, j = 1, 2, …, m}$ . We can observe a set of covariates $\vec{v} \in V \subset R^{d}$ , where $V$ is the domain of the d‐dimensional vector. The data available consist of the pairs ${({\vec{V}}_{i}, {\vec{X}}_{i})}_{i = 1}^{n} = (V, X)$ . The data‐generation model specifies the stochastic functions ${X_{j} (\vec{v}) = g_{j} (\vec{v}) + Z_{j}, \vec{v} \in V}, j = 1, 2, …, m$ , with $E [Z_{j}] = 0$ and $g \in G$ . For ease of exposition, consider a linear payoff function ψ with the goal of maximizing the expected profit (Elmachtoub & Grigas, 2022), that is, $\begin{matrix} \max \{E [\sum_{j = 1}^{m} X_{j} (\vec{v}) y_{j}] : \vec{y} \in Y \subseteq R^{m}\}, \end{matrix}$ 33where $Y$ is some convex set of feasible decisions.

To solve this problem, we need to come up with an appropriate vector of statistics of the data $(V, X)$ . The classical approach is to find an estimate through minimizing some loss function L: $\begin{matrix} {\vec{g}}^{ES} (V, X) = arg \min \{\sum_{i = 1}^{n} L (\hat{\vec{g}} ({\vec{V}}_{i}), {\vec{X}}_{i}) : \hat{\vec{g}} \in G\} . \end{matrix}$ Then, the decisions are optimized in (33) with $g_{j} (\vec{v})$ replaced by $g_{j}^{ES} (V, X)$ .

The smart‐predict‐and‐optimize approach proposed by Elmachtoub and Grigas (2022), instead, evaluates the empirical risk of using $\hat{\vec{g}}$ to replace $\vec{g}$ as $\begin{matrix} R^{Em} (\hat{\vec{g}}, \vec{g}) = \sum_{i = 1}^{n} L (\sum_{j = 1}^{m} X_{i : j} y_{j}^{RTO} (\hat{g} ({\vec{V}}_{i})), \sum_{j = 1}^{m} X_{i : j} y_{j}^{RTO} ({\vec{X}}_{i})), \end{matrix}$ where ${\vec{y}}^{RTO} (\vec{x}) = arg \max {\sum_{j = 1}^{m} x_{j} y_{j} : \vec{y} \in Y}$ is assumed to be unique. It is easy to see that ${\vec{y}}^{RTO}$ is the retrospective optimal solution for a single observation, $\vec{x}$ . The smart prediction is then derived as $\begin{matrix} {\hat{\vec{g}}}_{SPO : (V, X)} (\vec{v}) = arg \min {R^{Em} (\hat{\vec{g}}, \vec{g}) : \hat{\vec{g}} \in G}, \end{matrix}$ and the implemented solution is $\begin{matrix} {\vec{y}}_{SPO} (\vec{v}, V, X) = {\vec{y}}^{RTO} ({\hat{\vec{g}}}_{SPO : (V, X)} (\vec{v})) . \end{matrix}$ To see that this solution fits into the ODA framework, we define the class of operational statistics: $\begin{matrix} Y_{OS : SPO} = \{{{\vec{y}}_{\vec{g}} (\vec{v}) = {\vec{y}}^{RTO} (\vec{g} (\vec{v})), \vec{v} \in V}, \vec{g} \in G\} . \end{matrix}$ Clearly, ${\vec{y}}_{SPO} \in Y_{OS : SPO}$ . A natural validating model is to assess the average profit obtained by implementing some ${\vec{y}}_{\vec{g}} \in Y_{OS : SPO}$ : $\begin{matrix} ϕ_{SPO} ({\vec{y}}_{\vec{g}}) = \sum_{i = 1}^{n} L (\sum_{j = 1}^{m} X_{i : j} y_{\vec{g} : j} ({\vec{V}}_{i}), \sum_{j = 1}^{m} X_{i : j} y_{j}^{RTO} ({\vec{X}}_{i})) . \end{matrix}$ The ODA approach would choose the following operational statistic: $\begin{matrix} {\vec{y}}_{OS : SPO} = arg \min \{ϕ_{SPO} ({\vec{y}}_{\vec{g}}) : {\vec{y}}_{\vec{g}} \in Y_{OS : SPO}\} . \end{matrix}$ Next, we demonstrate that ${\vec{y}}_{SPO}$ can be inadmissible under the ODA framework using the running example. For $γ = 1$ , we can recast the quality design problem in the form of (33) as $\begin{matrix} \max \{E [X (\vec{v})] y - \frac{1}{2} c y_{0} : y \geq 0; y_{0} \geq y^{2}\} . \end{matrix}$ Here we introduce the decision y ₀ to linearize the objective to fit into the formulation in (33). For a given observation $X = x$ , the retrospective optimal solution is $y^{RTO} (x) = \frac{x}{c}$ and $y_{0}^{RTO} (x) = {(\frac{x}{c})}^{2}$ . If we consider the squared error as the loss, that is, $L (s, t) = {(s - t)}^{2}$ . Then the empirical risk becomes $\begin{matrix} R^{Em} (\hat{g}, g) & = & \sum_{i = 1}^{n} {(X_{i} \frac{\hat{g} ({\vec{V}}_{i})}{c} - \frac{c}{2} \frac{{(\hat{g} ({\vec{V}}_{i}))}^{2}}{c^{2}} - \frac{1}{2} \frac{X_{i}^{2}}{c})}^{2} \\ = & \frac{1}{4 c^{2}} \sum_{i = 1}^{n} {(X_{i} - \hat{g} ({\vec{V}}_{i}))}^{4} . \end{matrix}$ It is easy to see that a solution that minimizes the empirical risk is an element of the following homogeneous class of operational statistics: $\begin{matrix} Y_{OS : H} & = & {y_{OS} (\vec{v}, V, α X) \\ = & α y_{OS} (\vec{v}, V, X), α \in R_{+}, V \in V^{n}, X \in R_{+}^{n}} . \end{matrix}$ Clearly, the solution $y_{SPO}^{}$ is inadmissible over the data‐integration model $Y_{OS : H}$ .

Theoretical justification of the homogeneous operational statistics

Our observations from Section 3 suggest that various existing solutions are special cases of the homogeneous operational statistics, $G_{H}$ , defined in (24). This observation has guided us to expand the existing solutions to obtain improved performance. Moreover, our analysis of the parametric ODA solution in Section 4.2 suggests that the uniform optimality can be attained within the homogeneous class. In this subsection, we offer additional theoretical support to justify that for any decision‐making problem satisfying Assumption 1, it is not necessary to look beyond the homogeneous operational statistics. Thus, the performance of any data‐integrated solution is upper‐bounded by that of the parametric ODA solution derived in Theorem 3.

Consider any solution $y \in \bar{G}$ of the problem; recall that $\bar{G}$ contains any function that maps the space of observed data to the space of feasible decisions. Note that we can always express the underlying random parameter X by scaling some unit random variable Z, that is, $X =^{d} θ Z$ for some $θ \in R_{+}$ and $E [Z] = 1$ . Then implementing y as a function of the random data leads to an expected profit of $\begin{matrix} E [ψ (y (\vec{X}), X)] = E [ψ (y (θ \vec{Z}), θ Z)], \end{matrix}$ where $\vec{Z} = (Z_{1}, Z_{2}, …, Z_{n}) = (X_{1} / θ, X_{2} / θ, …, X_{n} / θ)$ . Applying Assumption 1, the theoretically optimal profit is $\begin{matrix} \max_{y_{0} \in Y} {E [ψ (y_{0}, X)]} = θ^{η} \max_{y_{0} \in Y} {E [ψ (y_{0}, Z)]} \end{matrix}$ The best choice of $y \in \bar{G}$ , when the adversarial chooses the worst θ, maximizes the competitive ratio: $sup_{y \in \bar{G}} inf_{θ \in R_{+}} \frac{E [ψ (y (θ \vec{Z}), θ Z)]}{θ^{η} \max_{y_{0} \in Y} {E [ψ (y_{0}, Z)]}} .$ Theorem 2 Uniform dominance

		ϕ_PTO	$ϕ_{PTO}^{B}$	ϕ_Re:Var	$ϕ_{Re : Var}^{B}$	ϕ_Re:Std	$ϕ_{Re : Std}^{B}$	ϕ_Ro:M	$ϕ_{Ro : M}^{B}$
$G a m m a (0.3, 5)$	$Cv [X] =$ 1.83
$n =$ 10	Ave	0.750	0.826	0.740	0.830	0.812	0.824	0.761	0.822
	Stdev	0.749	0.464	0.746	0.464	0.528	0.441	0.714	0.462
	Min	−15.651	−8.661	−15.651	−8.661	−11.457	−8.661	−15.441	−8.661
$n =$ 20	Ave	0.937	0.958	0.891	0.971	0.954	0.956	0.941	0.957
	Stdev	0.320	0.234	0.315	0.231	0.251	0.230	0.308	0.233
	Min	−5.341	−4.106	−5.341	−4.106	−4.299	−3.822	−5.341	−4.106
$G a m m a (0.5, 3)$	$Cv [X] =$ 1.41
$n =$ 10	Ave	0.902	0.931	0.904	0.933	0.924	0.929	0.907	0.930
	Stdev	0.398	0.273	0.395	0.273	0.302	0.265	0.383	0.273
	Min	−6.547	−3.680	−6.547	−3.680	−5.221	−3.379	−6.547	−3.680
$n =$ 20	Ave	1.014	1.021	0.990	1.026	1.019	1.020	1.016	1.021
	Stdev	0.180	0.144	0.182	0.142	0.150	0.143	0.174	0.144
	Min	−1.236	−0.847	−1.236	−0.847	−0.953	−0.844	−1.236	−0.847
$G a m m a (1, 1.5)$	$Cv [X] =$ 1.00
$n =$ 10	Ave	1.013	1.020	1.019	1.020	1.018	1.019	1.015	1.020
	Stdev	0.174	0.147	0.166	0.147	0.155	0.146	0.169	0.146
	Min	−1.118	−0.979	−1.118	−0.979	−1.061	−0.979	−1.118	−0.979
$n =$ 20	Ave	1.068	1.070	1.063	1.070	1.069	1.070	1.069	1.070
	Stdev	0.084	0.075	0.087	0.074	0.077	0.075	0.082	0.075
	Min	−0.098	−0.104	−0.098	−0.104	−0.061	−0.104	−0.061	−0.104

		$G a m m a (0.1, 15)$	$G a m m a (0.3, 5)$	$G a m m a (0.5, 3.0)$	$G a m m a (1.0, 1.5)$
$n = 5$	ϕ_PTO	−1.25	8.79	−201.27	0.37	1.94	−65.16	0.69	0.86	−14.49	0.90	0.41	−5.35
Known $f_{Z}$ and k
	ϕ_OS	0.42	0.37	0.00	0.73	0.32	0.00	0.86	0.26	0.01	0.97	0.17	0.17
	ϕ_SCA	0.37	0.61	−15.16	0.68	0.53	−18.77	0.81	0.35	−5.22	0.93	0.25	−2.65
Known $f_{Z}$ and unknown k
	ϕ_OS:M	0.24	2.08	−81.56	0.68	0.62	−20.59	0.81	0.42	−7.33	0.93	0.26	−4.30
	ϕ_OS:LH	0.39	1.10	−83.54	0.68	0.55	−19.22	0.81	0.39	−7.21	0.93	0.26	−2.88
	ϕ_OS:GG	0.39	1.14	−83.81	0.69	0.56	−20.80	0.82	0.40	−7.31	0.93	0.26	−2.91
	$ϕ_{OS : GG - d}$	0.37	0.69	−55.97	0.65	0.36	−7.56	0.76	0.30	−3.28	0.88	0.22	−2.41
Unknown $f_{Z}$ and k
	ϕ_SCA	0.03	3.04	−81.75	0.62	0.95	−43.87	0.79	0.50	−7.41	0.92	0.29	−4.34
$n = 10$	ϕ_PTO	0.02	3.18	−84.13	0.76	0.70	−10.43	0.90	0.40	−9.14	1.01	0.18	−2.77
Known $f_{Z}$ and k
	ϕ_OS	0.62	0.35	0.00	0.89	0.23	0.03	0.98	0.17	0.18	1.04	0.10	0.41
	ϕ_SCA	0.57	0.57	−15.57	0.85	0.32	−4.09	0.94	0.24	−5.09	1.02	0.14	−1.76
Unknown $f_{Z}$ and unknown k
	ϕ_OS:M	0.61	0.38	−6.53	0.87	0.27	−7.11	0.95	0.21	−3.35	1.03	0.13	−1.84
	ϕ_OS:LH	0.62	0.38	−6.47	0.87	0.28	−6.71	0.95	0.21	−3.36	1.03	0.13	−1.85
	ϕ_OS:GG	0.59	0.36	−5.45	0.86	0.25	−4.29	0.95	0.19	−3.16	1.02	0.12	−0.64
	$ϕ_{OS : GG - d}$	0.57	0.73	−17.82	0.86	0.34	−6.49	0.95	0.22	−3.38	1.02	0.13	−1.77
Unknown $f_{Z}$ and k
	ϕ_SCA	0.48	1.20	−30.50	0.83	0.42	−7.58	0.93	0.27	−4.59	1.02	0.15	−2.18
$n = 20$	ϕ_PTO	0.57	1.20	−30.50	0.94	0.32	−5.98	1.01	0.18	−2.30	1.07	0.09	−0.22
Known $f_{Z}$ and k
	ϕ_OS	0.80	0.29	0.01	1.00	0.15	0.17	1.04	0.10	0.32	1.08	0.06	0.58
	ϕ_SCA	0.75	0.42	−10.40	0.99	0.21	−3.42	1.02	0.14	−1.39	1.07	0.07	0.02
Known $f_{Z}$ and unknown k
	ϕ_OS:M	0.77	0.40	−8.44	0.97	0.18	−2.67	1.03	0.13	−0.53	1.07	0.07	−0.14
	ϕ_OS:LH	0.79	0.29	−1.77	0.99	0.16	−0.64	1.03	0.12	−0.43	1.08	0.07	−0.09
	ϕ_OS:GG	0.80	0.29	−1.75	0.99	0.16	−0.46	1.03	0.12	−0.49	1.08	0.07	−0.11
	$ϕ_{OS : GG - d}$	0.73	0.57	−12.15	0.96	0.22	−3.92	1.02	0.14	−1.58	1.07	0.08	−0.14
Unknown $f_{Z}$ and k
	ϕ_SCA	0.79	0.29	−1.25	0.99	0.16	−0.30	1.03	0.11	−0.40	1.07	0.06	−0.10

No solution in $\bar{G}$ can dominate the optimal operational statistic within the homogeneous class $G_{H}$ in the competitive ratio. That is, $\begin{matrix} sup_{y \in \bar{G}} inf_{θ \in R_{+}} \frac{E [ψ (y (θ \vec{Z}), θ Z)]}{θ^{η} \max_{y_{0} \in Y} {E [ψ (y_{0}, Z)]}} \\ \leq sup_{y \in G_{H}} \frac{E [ψ (y (\vec{Z}), Z)]}{\max_{y_{0} \in Y} {E [ψ (y_{0}, Z)]}} . \end{matrix}$

Theorem 2 is a powerful result. It states that no data‐integrated decision can uniformly dominate the homogeneous class of operational statistics in terms of the competitive ratio to the theoretically optimal profit. In other words, when looking for the decision without much knowledge of θ, one need not look for alternative classes of data‐integration models as long as the problem satisfies Assumption 1.

Though it is not possible to find a solution that uniformly dominates the homogeneous class, for some specific problem instance (i.e., some specific distribution of X), it is possible that a solution y performs better than the optimal homogeneous operational statistics y _OS. In such a situation, we need to relax the requirement of uniform optimality. One way is to look at the average performance: $\begin{matrix} \lim_{a \to \infty} \frac{1}{a} \int_{0}^{a} E [ψ (y (θ \vec{Z}), θ Z)] d θ . \end{matrix}$

Theorem 3 Average dominance

Suppose $ψ (\cdot, x)$ is concave. Then for any $y \in \bar{G} ∖ G_{H}$ , there exists a $y_{H} \in G_{H}$ such that $\lim_{a \to \infty} \frac{1}{a} \int_{0}^{a} θ^{- η} E [ψ (y (θ \vec{Z}), θ Z)] d θ \leq E [ψ (y_{H} (\vec{Z}), Z)]$ .

The general view of the homogeneous profit function

The property of the profit function described in Assumption 1 plays a pivotal role in determining the structure of the ODA solution. In this subsection, we establish the connection of Assumption 1 to the Euler equation, directly leading to the derivation of the ODA solution for multidecision problems. Lemma 1

A twice‐differentiable function $ψ : R^{m} \to R$ satisfies $\begin{matrix} \sum_{i = 1}^{m} v_{i} x_{i} \frac{\partial}{\partial x_{i}} ψ (\vec{x}) = ψ (\vec{x}), \forall x \in R_{+}^{m} \end{matrix}$ 34for some fixed $\vec{v} \in R_{+}^{m}$ , if and only if, for any $c_{0} \in R_{+}$ , $\begin{matrix} ψ (c_{0}^{v_{1}} x_{1}, c_{0}^{v_{2}} x_{2}, …, c_{0}^{v_{m}} x_{m}) = c_{0} ψ (\vec{x}) . \end{matrix}$ 35

Note that Assumption 1 corresponds to the two‐dimensional ( $m = 2$ ) case of (35) if we normalize η to one by appropriately scaling c ₀. Condition (34) is a generalized form of the Euler equation, which sets $\vec{v} = \vec{1}$ . Lemma 1 suggests that the homogeneous assumption needed for our ODA solution can be viewed as a generalized version of the Euler equation. Moreover, Lemma 1 directly hinges on the application of the ODA framework to multidecision problems. To derive the ODA solution, we need to identify the associated vector $\vec{v}$ , which can be derived by solving the following system of linear equations (provided that a solution exists): $\begin{matrix} \vec{v} [\begin{matrix} x_{1} \frac{\partial^{2}}{\partial x_{1}^{2}} ψ & x_{1} \frac{\partial^{2}}{\partial x_{1} x_{2}} ψ & \dots & x_{1} \frac{\partial^{2}}{\partial x_{1} x_{n}} ψ \\ x_{2} \frac{\partial^{2}}{\partial x_{2} x_{1}} ψ & x_{2} \frac{\partial^{2}}{\partial x_{2}^{2}} ψ & \dots & x_{2} \frac{\partial^{2}}{\partial x_{2} x_{n}} ψ \\ ⋮ & ⋮ & ⋮ & ⋮ \\ x_{n} \frac{\partial^{2}}{\partial x_{n} x_{1}} ψ & x_{n} \frac{\partial^{2}}{\partial x_{n} x_{2}} ψ & \dots & x_{n} \frac{\partial^{2}}{\partial x_{n}^{2}} ψ \end{matrix}] \\ - \vec{v} [\begin{matrix} \frac{\partial}{\partial x_{1}} ψ & 0 & \dots & 0 \\ 0 & \frac{\partial}{\partial x_{2}} ψ & \dots & 0 \\ ⋮ & ⋮ & ⋮ & ⋮ \\ 0 & 0 & \dots & \frac{\partial}{\partial x_{n}} ψ \end{matrix}] = [\begin{matrix} \frac{\partial}{\partial x_{1}} ψ \\ \frac{\partial}{\partial x_{2}} ψ \\ ⋮ \\ \frac{\partial}{\partial x_{n}} ψ \end{matrix}] . \end{matrix}$

CONCLUDING REMARKS

In this paper, we provide a comprehensive summary of the ODA framework for data‐based decision‐making. It is clear from our discussion that the ODA framework can be used for both predictive and prescriptive modeling, as estimation is also a decision‐making problem (see, e.g., Feng et al., 2022). The design of the ODA framework features a delicate balance between the data‐integration model and the validating model. To achieve this balance, the ODA framework must leverage the domain knowledge of the operating system and capture the structure of the inherent optimization model. We demonstrate that the ODA framework unifies various existing data‐based approaches and produces efficient performance in the small‐sample regime.

We shall point out that, though the nonparametric ODA solution exhibits improvement over those derived from the existing approaches, the gap to the theoretically optimal solution (with full statistical knowledge of the system) can still be significant with small samples. One way to reduce this gap is to look for similar systems with ample data and design transfer‐learning solutions (Bastani et al., 2022; Weiss et al., 2016) based on the knowledge of the similar systems. Development of the transfer learning solution for our canonical model boils down to determining the boosting parameter using the data from similar systems. Moreover, with our ODA framework, one can leverage the structure of the parametric ODA solution when utilizing the transferred data to improve the decision efficiency (Feng et al., 2023).

As an introductory piece, we analyze a generic decision problem in this paper. We should point out that the philosophy of the ODA framework can be applied to richer contexts and more complex problems. For example, one may examine how the adoption of the ODA approach, as opposed to other existing approaches, for decision‐making may affect the interaction among supply chain members in vertical and horizontal relationships (e.g., Loots & den Boer, 2023; Miklós‐Thal & Tucker, 2019)

The ODA framework can be easily adapted to situations involving censoring, latency, endogeneity, or unknown structural characterization of the model by integrating various methods of objective learning (Lim et al., 2006). Another important direction to expand the ODA framework is to consider dynamic learning when the data are becoming available over time and are consequences of previous decisions (e.g., Besbes et al., 2014; B. Chen et al., 2021; B. Chen, Simchi‐Levi, Wang, et al., 2022; X. Chen, Simchi‐Levi, & Wang , 2022). This requires a careful balance between dynamic data‐integrated modeling and decision validation to achieve performance efficiency. Expanding the ODA framework to consider feature data for customized decision‐making (e.g., Hopp et al., 2018; X. Wang. Li, et al., 2023; Yin et al., 2023) is another important direction for future research.

Footnotes

ORCID

Qi Feng

References

Anand

K. S.

Paç

M. F.

Veeraraghavan

(2011). Quality–speed conundrum: Trade‐offs in customer‐intensive services. Management Science, 57(1), 40–56.

Arlot

Celisse

(2010). A survey of cross‐validation procedures for model selection. Statistics Surveys, 4, 40–79.

Banker

R. D.

Khosla

Sinha

K. K.

(1998). Quality and competition. Management Science, 44(9), 1179–1192.

Bastani

Simchi‐Levi

Zhu

(2022). Meta dynamic pricing: Transfer learning across experiments. Management Science, 68(3), 1865–1881.

Ben‐Tal

Boyd

Nemirovski

(2006). Extending scope of robust optimization: Comprehensive robust counterparts of uncertain problems. Mathematical Programming, 107(1), 63–89.

Bertsimas

Gupta

Kallus

(2018). Data‐driven robust optimization. Mathematical Programming, 167(2), 235–292.

Bertsimas

Kallus

(2020). From predictive to prescriptive analytics. Management Science, 66(3), 1025–1044.

Besbes

Gur

Zeevi

(2014). Stochastic multi‐armed‐bandit problem with non‐stationary rewards. In Advances in neural information processing systems (NIPS 2014: Proceedings of the 27th International Conference on Neural Information Processing Systems) (Vol. 1, pp. 199–207). MIT Press.

Besbes

Zeevi

(2009). Dynamic pricing without knowing the demand function: Risk bounds and near optimal algorithms. Operations Research, 57(6), 1407–1420.

10.

Biggs

Hariss

Perakis

(2023). Constrained optimization of objective functions determined from random forests. Production and Operations Management, 32(2), 397–415.

11.

Burnetas

(2022). Learning and data‐driven optimization in queues with strategic customers. Queueing Systems, 100, 517–519.

12.

Cao

Gao

(2021). Contextual decision‐making under parametric uncertainty and data‐driven optimistic optimization . Optimization Online. https://optimization‐online.org/2021/10/8634/

13.

Charikar

Chekuri

Pál

(2005). Sampling bounds for stochastic optimization. In Chekuri

Jansen

Rolim

J. D. P.

Trevisan

(Eds.), Approximation, randomization and combinatorial optimization. Algorithms and techniques (pp. 257–269). Proceedings of APPROX 2005 and RANDOM 2005. Springer.

14.

Chen

Chao

Shi

(2021). Nonparametric learning algorithms for joint pricing and inventory control with lost sales and censored demand. Mathematics of Operations Research, 46(2), 726–756.

15.

Chen

Simchi‐Levi

Wang

Zhou

(2022). Dynamic pricing and inventory control with fixed ordering cost and incomplete demand information. Management Science, 68(8), 5557–6354.

16.

Chen

Simchi‐Levi

Wang

(2022). Privacy‐preserving dynamic personalized pricing with demand learning. Management Science, 68(7), 4755–5555.

17.

Chu

Feng

Shanthikumar

J. G.

Shen

Z.‐J. M.

(2023). Solving the price‐setting newsvendor problem with parametric operational data analytics (ODA) [Working paper].

18.

Chu

L. Y.

Lai

(2013). Salesforce contracting under demand censorship. Manufacturing & Service Operations Management, 15(2), 320–334.

19.

Chu

L. Y.

Shanthikumar

J. G.

Shen

Z. J. M.

(2008). Solving operational statistics via a Bayesian analysis. Operations Research Letters, 36, 110–116.

20.

Chuang

Y.‐T.

Zargoush

Ghazalbash

Samiedaluie

Kuluski

Guilcher

(2023). From prediction to decision: Optimizing long‐term care placements among older delayed discharge patients. Production and Operations Management, 32(4), 1041–1058.

21.

Efron

Tibshirani

(1997). Improvements on cross‐validation: The 632+ bootstrap method. Journal of the American Statistical Association, 92(438), 548–560.

22.

Elmachtoub

A. N.

Grigas

(2022). Smart “predict, then optimize”. Management Science, 68(1), 9–26.

23.

Feng

Shanthikumar

J. G.

(2023). Transfer learning, cross learning and co‐learning across newsvendor systems [Working paper].

24.

Feng

Shanthikumar

J. G.

(2022). Developing operations management data analytics. Production and Operations Management, 31(12), 4544–4557.

25.

Feng

George Shanthikumar

Xue

(2022). Consumer choice models and estimation: A review and extension. Production and Operations Management, 31(2), 847–867.

26.

Gallien

Mersereau

A. J.

Garro

Mora

A. D.

Vidal

(2015). Initial shipment decisions for new products at Zara. Operations Research, 63(2), 269–286.

27.

Gilboa

Schmeidler

(1989). Maxmin expected utility with non‐unique prior. Journal of Mathematical Economics, 18, 141–153.

28.

Halmos

P. R.

(1946). The theory of unbiased estimation. The Annuals of Mathematical Statistics, 17(1), 34–43.

29.

C. P.

Hanasusantoy

G. A.

(2022). On data‐driven prescriptive analytics with side information: A regularized Nadaraya– Watson approach, 32(4), 1205–1222.

30.

Homem‐de‐Mello

Bayraksan

(2014). Monte Carlo sampling‐based methods for stochastic optimization. Surveys in Operations Research and Management Science, 19(1), 56–85.

31.

Hopp

Wang

(2018). Big data and precision medicine revolution. Production and Operations Management, 27(9), 1647–1664.

32.

Acimovic

Erize

Thomas

D. J.

Van Mieghem

J. A.

(2019). Forecasting new product life cycle curves: Practical approach and empirical analysis. Manufacturing & Service Operations Management, 21(1), 66–85.

33.

Huh

W. T.

Janakiraman

Muckstadt

H. A.

Rusmevichientong

(2009). An adaptive algorithm for finding the optimal base‐stock policy in lost sales inventory systems with censored demand. Mathematics of Operations Research, 34(2), 397–416.

34.

Kleywegt

A. J.

Shapiro

Homem‐de Mello

(2002). The sample average approximation method for stochastic discrete optimization. SIAM Journal on Optimization, 12(2), 479–502.

35.

Kohavi

(1995). A study of cross‐validation and bootstrap for accuracy estimation and model selection. In Proceedings of the 14th international joint conference on artificial intelligence (Vol. 2, pp. 1137–1143). Morgan Kaufmann.

36.

Lei

Geng

Zhang

Liu

Shen

Z.‐J. M.

(2023). New product life cycle curve modeling and forecasting with product attributes and promotion: A Bayesian functional approach. Production and Operations Management, 32(2), 655–673.

37.

Levi

Roundy

R. O.

Shmoys

D. B.

(2007). Provably near‐optimal sampling‐based policies for stochastic inventory control models. Mathematics of Operations Research, 32(4), 821–839.

38.

Lim

A. E. B.

Shanthikumar

J. G.

Shen

Z. J. M.

(2006). Model uncertainty, robust optimization and learning. INFORMS Tutorial, 2006, 66–94.

39.

Liu

Zheng

Liu

Zhang

Z.‐H.

(2023). From share of choice to buyers' welfare maximization: Bridging the gap through distributionally robust optimization. Production and Operations Management, 32(4), 1205–1222.

40.

Liyanage

Shanthikumar

J. G.

(2005). A practical inventory control policy using operational statistics. Operations Research Letters, 33, 341–348.

41.

Long

D. Z.

Sim

Zhou

(2023). Robust satisficing. Operations Research, 71(1), 61–82.

42.

Loots

denBoer

A. V.

(2023). Data‐driven collusion and competition in a pricing duopoly with multinomial logit demand. Production and Operations Management, 32(4), 1169–1186.

43.

Shen

Z.‐J. M.

(2021). A review of robust operations management with model uncertainty. Production and Operations Management, 30(6), 1927–1943.

44.

Miklós‐Thal

Tucker

(2019). Collusion by algorithm: Does better demand prediction facilitate better coordination between sellers? Management Science, 65(4), 1552–1561.

45.

Mĭsić

V. V.

Perakis

(2020). Data analytics in operations management: A review. Manufacturing & Service Operations Management, 22(1), 158–169.

46.

Mohajerin Esfahani

Kuhn

(2018). Data‐driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations. Mathematical Programming, 171(1), 1–52.

47.

Saldanha

J. P.

Price

B. S.

Thomas

D. J.

(2023). A nonparametric approach for setting safety stock levels. Production and Operations Management, 32, 1150–1168.

48.

Searls

D. T.

(1964). The utilization of a known coefficient of variation in the estimation procedure. Journal of American Statistical Association, 59, 1225–1226.

49.

Sim

Tang

Zhou

Zhu

(2022). The analytics of robust satisficing NUS Business School, National University of Singapore, Singapore.

50.

Simchi‐Levi

(2014). OM forum–OM research: From problem‐driven to data‐driven research. Manufacturing & Service Operations Management, 16(1), 2–10.

51.

Smith

J. E.

Winkler

R. L.

(2006). The optimizer's curse: Skepticism and postdecision surprise in decision analysis. Management Science, 52(3), 311–322.

52.

Swamy

Shmoys

D. B.

(2005). Sampling‐based approximation algorithms for multi‐stage stochastic optimization. In Proceedings of the 46th Annual IEEE Symposium on the Foundations of Computer Science (pp. 357–366). IEEE Press.

53.

Wald

(1949). Statistical decision functions. The Annals of Mathematical Statistics, 20(2), 165–205.

54.

Wang

Kopalle

P. K.

(2023). When does it pay to invest in pricing algorithms? Production and Operations Management . Advance online publication. https://doi.org/10.1111/poms.13924

55.

Wang

Zhang

Zhou

Tang

(2023). Feature‐driven robust surgery scheduling? Production and Operations Management, 32(6), 1921–1938.

56.

Weiss

Khoshgoftaar

T. M.

Wang

(2016). A survey of transfer learning. Journal of Big Data, 3, 9.

57.

Yin

Jiang

Zhou

S. X.

(2023). Effects of consumers' context‐dependent preferences on product bundling. Production and Operations Management, 32(6), 1674–1691.

58.

Zhao

Chakrabarti

Muthuraman

(2018). Unified classical and robust optimization for least squares NUS Business School, National University of Singapore, Singapore. https://ssrn.com/abstract=3182422

59.

Zhu

Xie

Sim

(2022). Joint estimation and robustness optimization. Management Science, 68(3), 1659–1677.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.21 MB

1:	Compute posterior $p (\cdot \| \hat{\vec{X}})$ for Θ
2:	$b = 0$
3:	for $j \leftarrow 1, 2, …, n_{ϑ}$ do
4:	$ϑ_{j} = (j / n_{ϑ}) (\underset{̲}{ϑ} - \bar{ϑ})$ , where $[\underset{̲}{ϑ}, \bar{ϑ}]$ is the range of the unknown parameter in computation
5:	$a_{j} = ϑ_{j}^{- η - 1} \times \prod_{i = 1}^{n} f_{X} (x_{i} / ϑ_{j} \| 1) / ϑ_{j}$
6:	$b \leftarrow b + a_{j}$
7:	for $j \leftarrow 1, 2, …, n_{ϑ}$ do $p (ϑ_{j} \| \hat{\vec{X}}) = a_{j} / b$
8:	Estimate the objective ϕ_ODA
9:	Sample $x_{k}, k = 1, 2, …, n_{x}$ from $f_{X} (\cdot \| 1)$
10:	$v a l u e = 0$
11:	for $j \leftarrow 1, 2, …, n_{ϑ}$ do
12:	$v_{j} = \frac{1}{n_{x}} \sum_{k = 1}^{n_{x}} ψ (y, ϑ_{j} x_{k})$
13:	$v a l u e = v a l u e + v_{j} \cdot p (ϑ_{j} \| \hat{\vec{X}})$
14:	$ψ_{ODA} (y, η \| \hat{\vec{X}}) = v a l u e$

		$G a m m a (0.1, 15)$			$G a m m a (0.3, 5)$			$G a m m a (0.5, 3.0)$			$G a m m a (1.0, 1.5)$
		$Cv [X] = 3.16$			$Cv [X] = 1.83$			$Cv [X] = 1.41$			$Cv [X] = 1.00$
		Ave	Stdev	Min	Ave	Stdev	Min	Ave	Stdev	Min	Ave	Stdev	Min
$n = 5$	ϕ_PTO	−1.25	8.79	−201.27	0.37	1.94	−65.16	0.69	0.86	−14.49	0.90	0.41	−5.35
Known $f_{Z}$ and k
	ϕ_OS	0.42	0.37	0.00	0.73	0.32	0.00	0.86	0.26	0.01	0.97	0.17	0.17
	ϕ_SCA	0.37	0.61	−15.16	0.68	0.53	−18.77	0.81	0.35	−5.22	0.93	0.25	−2.65
Known $f_{Z}$ and unknown k
	ϕ_OS:M	0.24	2.08	−81.56	0.68	0.62	−20.59	0.81	0.42	−7.33	0.93	0.26	−4.30
	ϕ_OS:LH	0.39	1.10	−83.54	0.68	0.55	−19.22	0.81	0.39	−7.21	0.93	0.26	−2.88
	ϕ_OS:GG	0.39	1.14	−83.81	0.69	0.56	−20.80	0.82	0.40	−7.31	0.93	0.26	−2.91
	$ϕ_{OS : GG - d}$	0.37	0.69	−55.97	0.65	0.36	−7.56	0.76	0.30	−3.28	0.88	0.22	−2.41
Unknown $f_{Z}$ and k
	ϕ_SCA	0.03	3.04	−81.75	0.62	0.95	−43.87	0.79	0.50	−7.41	0.92	0.29	−4.34
$n = 10$	ϕ_PTO	0.02	3.18	−84.13	0.76	0.70	−10.43	0.90	0.40	−9.14	1.01	0.18	−2.77
Known $f_{Z}$ and k
	ϕ_OS	0.62	0.35	0.00	0.89	0.23	0.03	0.98	0.17	0.18	1.04	0.10	0.41
	ϕ_SCA	0.57	0.57	−15.57	0.85	0.32	−4.09	0.94	0.24	−5.09	1.02	0.14	−1.76
Unknown $f_{Z}$ and unknown k
	ϕ_OS:M	0.61	0.38	−6.53	0.87	0.27	−7.11	0.95	0.21	−3.35	1.03	0.13	−1.84
	ϕ_OS:LH	0.62	0.38	−6.47	0.87	0.28	−6.71	0.95	0.21	−3.36	1.03	0.13	−1.85
	ϕ_OS:GG	0.59	0.36	−5.45	0.86	0.25	−4.29	0.95	0.19	−3.16	1.02	0.12	−0.64
	$ϕ_{OS : GG - d}$	0.57	0.73	−17.82	0.86	0.34	−6.49	0.95	0.22	−3.38	1.02	0.13	−1.77
Unknown $f_{Z}$ and k
	ϕ_SCA	0.48	1.20	−30.50	0.83	0.42	−7.58	0.93	0.27	−4.59	1.02	0.15	−2.18
$n = 20$	ϕ_PTO	0.57	1.20	−30.50	0.94	0.32	−5.98	1.01	0.18	−2.30	1.07	0.09	−0.22
Known $f_{Z}$ and k
	ϕ_OS	0.80	0.29	0.01	1.00	0.15	0.17	1.04	0.10	0.32	1.08	0.06	0.58
	ϕ_SCA	0.75	0.42	−10.40	0.99	0.21	−3.42	1.02	0.14	−1.39	1.07	0.07	0.02
Known $f_{Z}$ and unknown k
	ϕ_OS:M	0.77	0.40	−8.44	0.97	0.18	−2.67	1.03	0.13	−0.53	1.07	0.07	−0.14
	ϕ_OS:LH	0.79	0.29	−1.77	0.99	0.16	−0.64	1.03	0.12	−0.43	1.08	0.07	−0.09
	ϕ_OS:GG	0.80	0.29	−1.75	0.99	0.16	−0.46	1.03	0.12	−0.49	1.08	0.07	−0.11
	$ϕ_{OS : GG - d}$	0.73	0.57	−12.15	0.96	0.22	−3.92	1.02	0.14	−1.58	1.07	0.08	−0.14
Unknown $f_{Z}$ and k
	ϕ_SCA	0.79	0.29	−1.25	0.99	0.16	−0.30	1.03	0.11	−0.40	1.07	0.06	−0.10