Time-varying decision boundaries: insights from optimality analysis

Time-varying decision boundaries: insights from optimality analysis Psychon Bull Rev (2018) 25:971–996 DOI 10.3758/s13423-017-1340-6 THEORETICAL REVIEW Time-varying decision boundaries: insights from optimality analysis 1 2 1 3 Gaurav Malhotra · David S. Leslie · Casimir J. H. Ludwig · Rafal Bogacz Published online: 20 July 2017 © The Author(s) 2017. This article is an open access publication Abstract The most widely used account of decision- can choose to opt out of decisions. The theoretical model making proposes that people choose between alternatives presented here provides an important framework to under- by accumulating evidence in favor of each alternative until stand how, why, and whether decision boundaries should this evidence reaches a decision boundary. It is frequently change over time in experiments on decision-making. assumed that this decision boundary stays constant dur- ing a decision, depending on the evidence collected but Keywords Decision-making · Decreasing bounds · not on time. Recent experimental and theoretical work has Optimal decisions · Reward rate challenged this assumption, showing that constant decision boundaries are, in some circumstances, sub-optimal. We introduce a theoretical model that facilitates identification Introduction of the optimal decision boundaries under a wide range of conditions. Time-varying optimal decision boundaries for In many environmental settings, people frequently come our model are a result only of uncertainty over the difficulty across decision-making problems where the speed of mak- of each trial and do not require decision deadlines or costs ing decisions trades off with their accuracy. Consider for example the following problem: a financial advisor is associated with collecting evidence, as assumed by previ- ous authors. Furthermore, the shape of optimal decision employed by a firm to make buy/sell recommendations on their portfolio of assets. All assets seem identical but the boundaries depends on the difficulties of different decisions. When some trials are very difficult, optimal boundaries value of some assets is stochastically rising while the value of others is falling. For each correct recommendation (advi- decrease with time, but for tasks that only include a mixture of easy and medium difficulty trials, the optimal boundaries sor recommends buy and the asset turns out to be rising or vice-versa), the advisor receives a fixed commission and for increase or stay constant. We also show how this simple model can be extended to more complex decision-making each incorrect recommendation (advisor recommends buy and the asset turns out to be falling or vice-versa) they pay a tasks such as when people have unequal priors or when they fixed penalty. In order to make these recommendations, the advisor examines the assets sequentially and observes how the value of each asset develops over time. Each observation Gaurav Malhotra takes a finite amount of time and shows whether the value of gaurav.malhotra@bristol.ac.uk the asset has gone up or down over this time. Before recom- mending whether the firm should buy or sell the asset, the School of Experimental Psychology, University of Bristol, advisor can make as many of these up/down observations as 12a Priory Road, Bristol BS8 1TU, UK they like. However, there is an opportunity cost of time as Department of Mathematics and Statistics, Lancaster the advisor wants to maximize the commission every month University, Lancaster, UK by making as many correct recommendations as possible. 3 How many (up/down) observations should the advisor make MRC Brain Networks Dynamics Unit, University of Oxford, Oxford, UK for each asset before giving a recommendation? 972 Psychon Bull Rev (2018) 25:971–996 Sequential decision problems boundary. Also like the SPRT, the standard sequential sam- pling account assumes that this decision boundary remains The type of problem described above is at the heart of constant during a decision. In fact, Bogacz et al. (2006) sequential analysis and has been investigated by researchers showed that, under certain assumptions, including the from Bernoulli (1713) and Laplace, (1774, 1812) to mod- assumption that all decisions in a sequence are of the same ern day statisticians (for a review, see Ghosh, 1991). This difficulty, the decision-maker can maximize their reward problem is also directly relevant to the psychology and rate by employing the SPRT and maintaining an appro- neuroscience of decision-making. Many decision-making priately chosen threshold that remains constant within and problems, including perceptual decisions (how long to sam- across trials. In the above example, this means that if the ple sensory information before choosing an option) and financial advisor chose the stopping criterion, stop sampling foraging problems (how long to forage at the current patch if you observe three more ups than downs (or vice-versa), before moving to the next patch), can be described in the they stick with this criterion irrespective of whether they form above. The decision-maker has to make a series of have observed ten values of an asset or a hundred. choices and the information needed to make these choices A number of recent studies have challenged this account is spread out over time. The decision-maker wants to max- from both an empirical and a theoretical perspective, argu- imize their earnings by attempting as many decisions as ing that in many situations decision-makers decrease the possible in the allocated time. Sampling more information decision boundary with time and that it is optimal for them (up/down observations) allows them to be more accurate in to do so (Drugowitsch, Moreno-Bote, Churchland, Shadlen, their choices, at the expense of the number of decision prob- & Pouget, 2012; Huang & Rao, 2013; Thura, Cos, Trung, & lems that can be attempted. Therefore the speed of decisions Cisek, 2014;Moran, 2015). The intuition behind these stud- trades off with their accuracy and the decision-maker must ies is that instead of determining the decision boundaries solve (i) the stopping problem, i.e., decide how much infor- based on minimizing the average sample size at a desired mation to sample before indicating their decision, and (ii) level of accuracy (as some formulations of SPRT do), the decision problem, i.e., which alternative to choose, in decision-makers may want to maximize the expected reward such a way that they are able to maximize their earnings. earned per unit time, i.e., the reward rate. Psychological The stopping problem was given a beautifully simple studies and theories of decision-making generally give lit- solution by Wald (1945b), who proposed the following tle consideration to the reward structure of the environment. sequential procedure: after each sample (up/down obser- Participants are assumed to trade-off between accuracy and vation), compute the likelihood ratio, λ ,ofthesamples reaction time in some manner that is consistent with the— (X ,...,X ) and choose the first alternative (buy)if λ ≥ typically vague—experimenter instructions (e.g., “try to be 1 n n A and second alternative (sell)if λ ≤ B, otherwise con- as fast and accurate as possible”). Models integrating to a tinue sampling for n = 1, 2,...,where A and B are two fixed threshold often work well for these situations, giving suitably chosen constants. This procedure was given the good accounts for participants’ accuracy and reaction time name the sequential probability ratio test (SPRT). Wald distributions. However, it has been shown that using models (1945a, b) and Wald and Wolfowitz (1948) showed that integrating to a fixed threshold leads to sub-optimal reward the SPRT is optimal in the sense that it can guarantee a rate in heterogeneous environments—i.e., when decisions required level of accuracy (both Type 1 and Type 2 errors vary in difficulty (Moran, 2015). This leads to the natu- are bounded) with a minimum average sample size (number ral question: how should the decision-maker change their of up /down observations made). decision-boundary with time if their aim was to maximize This sequential procedure of continuing to sample evi- the reward rate. dence until a decision variable (likelihood ratio for SPRT) has crossed a fixed threshold also forms the basis for Optimal decision boundaries in sequential decision the most widely used psychological account of decision- problems making. This account consists of a family of models, which are collectively referred to as sequential sampling mod- A number of models have been used to compute the optimal els (Stone, 1960; LaBerge, 1962; Laming, 1968; Link & decision boundaries in sequential decision-making. These Heath, 1975; Vickers, 1970; Ratcliff, 1978) and have been models differ in (a) how the decision problem is formu- applied to a range of decision tasks over the last 50 years lated, and (b) whether the decision boundary is assumed to (for reviews, see Ratcliff & Smith, 2004; Bogacz, Brown, be fixed across trials or vary from one trial to next. Moehlis, Holmes, & Cohen, 2006). Like the SPRT, sequen- tial sampling models propose that decision-makers solve the stopping problem by accumulating evidence in favor 1 Throughout this article, we use ‘threshold’ to refer to a decision of each alternative until this evidence crosses a decision boundary that remains constant within and across trials. Psychon Bull Rev (2018) 25:971–996 973 Rapoport and Burkheimer (1971) modeled the deferred The present analysis decision-making task (Pitz, Reinhold, & Geller, 1969) where the maximum number of observations were fixed The principal aim of this paper is to identify the minimal in advance (and known to the observer) and making each conditions needed for time-varying decision boundaries, observation carried a cost. There was also a fixed cost under the assumption that the decision-maker is trying to for incorrect decisions and no cost for correct decisions. maximize the reward rate. We will develop a generic pro- Rapoport and Burkheimer used dynamic programming cedure that enables identification of the optimal decision (Bellman, 1957; Pollock, 1964) to compute the policy that boundaries for any discrete, sequential decision problem minimized the expected loss and found that the optimal described at the beginning of this article. In contrast to the boundary collapsed as the number of observations remain- problems considered by Rapoport and Burkheimer (1971) ing in a trial decreased. Busemeyer and Rapoport (1988) and Frazier and Yu (2007), we will show that the pressure found that, in such a deferred decision-making task, though of an approaching deadline is not essential for a decrease in people did not appear to follow the optimal policy, they decision boundaries. did seem to vary their decision boundary as a function of In contrast to Drugowitsch et al. (2012), we do not number of remaining observations. assume any explicit cost for making observations and A similar problem was considered by Frazier and Yu show that optimal boundaries may decrease even when (2007) but instead of assuming that the maximum num- making observations carries no explicit cost. Furthermore, ber of observations was fixed, they assumed that this unlike the very general setup of Drugowitsch et al. (2012) number was drawn from a known distribution and there and Huang and Rao (2013), we make several simplifying was a fixed cost for crossing this stochastic deadline. assumptions in order to identify how the shape of optimal Like Rapoport and Burkheimer (1971), Frazier and Yu decision boundaries changes with the constituent difficul- showed that under the pressure of an approaching deadline, ties of the task. In particular, in the initial exposition of the the optimal policy is to have a monotonically decreasing model, we restrict the difficulty of each decision to be one decision-boundary and the slope of boundaries increased of two possible levels (though see the Discussion for a sim- with the decrease in the mean deadline and an increase in its ple extension to more than two difficulties). In doing so, we variability. reveal three key results: (i) we show that optimal boundaries Two recent studies analyzed optimal boundaries for a must decrease to zero if the mixture of difficulties involves decision-making problem that does not constrain the max- some trials that are uninformative, (ii) the shape of optimal imum number of observations. Drugowitsch et al. (2012) boundaries depends on the inter-trial interval for incorrect considered a very general problem where the difficulty of decisions but not correct decisions (provided the latter is each decision in a sequence is drawn from a Gaussian smaller) and (iii) we identified conditions under which the or a general symmetric point-wise prior distribution and optimal decision boundaries increase (rather than decrease) accumulating evidence comes at a cost for each observa- with time within a trial. In fact, we show that optimal deci- tion. Using the principle of optimality (Bellman, 1957), sion boundaries decrease only under a very specific set Drugowitsch et al. showed that under these conditions, the of conditions. This analysis particularly informs the ongo- reward rate is maximized if the decision-maker reduces their ing debate on whether people and primates decrease their decision boundaries with time. Similarly, Huang and Rao decision boundaries, which has focused on analyzing data (2013) used the framework of partially observed Markov from existing studies to infer evidence of decreasing bound- decision processes (POMDP) to show that expected future aries (e.g., Hawkins, Forstmann, Wagenmakers, Ratcliff, & reward is maximized if the decision-maker reduces the Brown, 2015; Voskuilen, Ratcliff, & Smith, 2016). The evi- decision boundary with time. dence on this point is mixed. Our study suggests that such In contrast to the dynamic programming models men- inconsistent evidence may be due to the way decision dif- tioned above, Deneve (2012) considered an alternative theo- ficulties in the experiment are mixed, as well as how the retical approach to computing decision boundaries. Instead reward structure of the experiment is defined. of assuming that decision boundaries are fixed (though Next, we extend this analysis to two situations which time-dependent) on each trial, Deneve (2012) proposed that are of theoretical and empirical interest: (i) What is the the decision boundary is set dynamically on each trial based influence of prior beliefs about the different decision alter- on an estimate of the trial’s reliability. This reliability is used natives on the shape of the decision boundaries? (ii) What to get an on-line estimate of the signal-to-noise ratio of the is the optimal decision-making policy when it is possible sensory input and update the decision boundary. By sim- to opt-out of a decision and forego a reward, but be spared ulating the model, Deneve found that decision boundaries the larger penalty associated with an incorrect choice? In maximize the reward rate if they decrease during difficult each case, we link our results to existing empirical research. trials, but increase during easy trials. When the decision-maker has unequal prior beliefs about 974 Psychon Bull Rev (2018) 25:971–996 the outcome of the decision, our computations show that the is given by the pair (t, X). Note that, in contrast to pre- optimal decision-maker should dynamically adjust the con- vious accounts that use dynamic programming to establish tribution of prior to each observation during the course of optimal decision boundaries (e.g., Drugowitsch et al., 2012; a trial. This is in line with the dynamic prior model devel- Huang & Rao, 2013), we compute optimal policies directly oped by Hanks, Mazurek, Kiani, Hopp and Shadlen (2011) in terms of evidence and time, rather than (posterior) belief but contrasts with the results observed by Summerfield and time. The reasons for doing so are elaborated in the Dis- and Koechlin (2010) and Mulder, Wagenmakers, Ratcliff, cussion. In any state, (t, X), the decision-maker can take Boekel and Forstmann (2012). Similarly, when it is possible one of two actions:(i) wait and accumulate more evidence to opt-out of a decision, the optimal decision-making pol- (observe asset value goes up/down), or (ii) go and choose icy shows that the decision-maker should choose this option the more likely alternative (buy/sell). only when decisions involve more than one difficulty (i.e., If action wait is chosen, the decision-maker observes the the decision-maker is uncertain about the difficulty of a outcome of a binary random variable, δX,where P(δX = decision) and only when the benefit of choosing this option 1) = u = 1 − P(δX =−1).The up-probability, u, is carefully calibrated. depends on the state of the world. We assume throughout that u ≥ 0.5 if the true state of the world is rising,and u ≤ 0.5 if the true state is falling. The parameter u also A theoretical model for optimal boundaries determines the trial difficulty. When u is equal to 0.5, the probability of each outcome is the same (equal probabil- ity of asset value going up/down); consequently, observing Problem definition an outcome is like flipping an unbiased coin, providing the We now describe a Markov decision process to model the decision-maker absolutely no evidence about which hypoth- stopping problem described at the beginning of this arti- esis is correct. On the other hand, if u is close to 1 or 0 cle. We consider the simplest possible case of this problem, (asset value almost always goes up/down), observing an out- where we: (i) restrict the number of choice alternatives to come provides a large amount of evidence about the correct two (buy or sell), (ii) assume that observations are made at hypothesis, making the trial easy. After observing δX the discrete (and constant) intervals, (iii) assume that observa- decision-maker transitions to a new state (t + 1,X + δX), tions consist of binary outcomes (up or down transitions), as a result of the progression of time and the accumula- and (iv) restrict the difficulty of each decision to one of two tion of the new evidence δX. Since the decision-maker does possible levels (assets could be rising (or falling) at one of not know the state of the world, and consequently does not know u, the distribution over the possible successor states two different rates). The decision-maker faces repeated decision-making (t +1,X ±1) is non-trivial and calculated below. In the most opportunities (trials). On each trial the world is in one of two general formulation of the model, an instantaneous cost (or possible states (asset is rising or falling), but the decision- reward) would be obtained on making an observation, but maker does not know which at the start of the trial. At a throughout this article we assume that rewards and costs series of times steps t = 1, 2, 3,... the decision-maker are only obtained when the decision-maker decides to select can choose to wait and accumulate evidence (observe if a go action. Thus, in contrast to some approaches (e.g., value of asset goes up or down). Once the decision-maker Drugowitsch et al., 2012), the cost of making an observation feels sufficient evidence has been gained, they can choose is 0. to go, and decide either buy or sell. If the decision is correct If action go is chosen then the decision-maker transitions (advisor recommends buy and asset is rising or advisor rec- to one of two special states, C or I, depending on whether the decision made after the go action is correct or incorrect. ommends sell and asset is falling), they receive a reward. If the decision is incorrect they receive a penalty. Under both As with transitions under wait, the probability that the deci- outcomes the decision-maker then faces a delay before start- sion is correct depends in a non-trivial way on the current ing the next trial. If we assume that the decision-maker will state, and is calculated below. From the states C and I,there undertake multiple trials, it is reasonable that they will aim is no action to take, and the decision-maker transitions to to maximize their average reward per unit time. A behav- the initial state (t, X) = (0, 0). From state C the decision- ioral policy which achieves the optimal reward per unit time maker receives a reward R and suffers a delay of D ; from C C will be found using average reward dynamic programming state I they receive a reward (penalty) of R and suffers a (Howard, 1960; Ross, 1983; Puterman, 2005). delay of D . In much of the theoretical literature on sequential sam- We formalize the task as follows. Let t = 1, 2,... be discrete points of time during a trial, and let X denote the pling models, it is assumed, perhaps implicitly, that the previous evidence accumulated by the decision-maker at decision-maker knows the difficulty level of a trial. This those points in time. The decision-maker’s state in a trial corresponds to knowledge that the up-probability of an Psychon Bull Rev (2018) 25:971–996 975 observation is u = 0.5 +  when the true state is rising, for fixed candidate policies. To do so, we must first deter- mine the state-transition probabilities under either action and u = 0.5 −  when the true state is falling.However,in (wait/go) from each state for a given set of drifts (Eqs. 6 ecologically realistic situations, the decision-maker may not and 7 below). These state-transition probabilities can then know the difficulty level of the trial in advance. This can be used to compare the wait and go actions in any given be modeled by assuming that the task on a particular trial state using the expected reward under each action in that is chosen from several different difficulties. In the example state. above, it could be that up / down observations come from different sources and some sources are noisier than oth- Computing state-transition probabilities ers. To illustrate the simplest conditions resulting in varying decision boundaries, we model the situation where there Computing the transition probabilities is trivial if one knows are only two sources of observations: an easy source with the up-probability, u, of the process generating the out- 1 1 u ∈ U ={ −  , +  } and a difficult source with e e e 2 2 comes: the probability of transitioning from (t, x) to (t + 1 1 1 u ∈ U ={ −  , +  },where  , ∈[0, ] are the d d d e d 2 2 2 1,x + 1) is u,and to (t + 1,x − 1) is 1 − u.However,when drifts of the easy and difficult stimuli, with  < . Thus, d e each trial is of an unknown level of difficulty, the observed during a difficult trial, u is close to 0.5, while for an easy outcomes (up/down) during a particular decision provide trial u is close to 0 or 1. We assume that these two types information not only about the correct final choice but also of tasks can be mixed in any fraction, with P(U ∈ U ) the about the difficulty of the current trial. Thus, the current probability that the randomly selected drift corresponds to state provides information about the likely next state under a an easy task in the perceptual environment. For now, we wait action, through information about the up-probability, assume that within both of U and U , u is equally likely to e d u. Therefore, the key step in determining the transition prob- be above or below 0.5—i.e., there is equal probability of the abilities is to infer the up-probability, u, based on the current assets rising and falling. In the section titled “Extensions of state and use this to compute the transition probabilities. the model” below, we will show how our results generalize As already specified, we model a task that has trials to the situation of unequal prior beliefs about the state of the drawn from two difficulties (it is straightforward to gener- world. alize to more than two difficulties): easy trials with u in the Figure 1a depicts evidence accumulation as a random 1 1 set U ={ −  , +  } and difficult trials with u in the set e e e 2 2 walk in two-dimensional space with time along the x-axis 1 1 U ={ − , + } (note that this does not preclude a zero d d d 2 2 and the evidence accumulated, X ,...,X , based on the 1 t drift condition,  = 0). To determine the transition proba- series of outcomes, +1, +1, −1, +1, along the y-axis. The bilities under the action wait, we must marginalize over the figure shows both the current state of the decision-maker set of all possible drifts, U = U ∪ U : e d at (t, X ) = (4, 2) and their trajectory in this state-space. wait In this current state, the decision-maker has two available p = P(X = x +1|X = x) t +1 t (t,x)→(t +1,x+1) actions: wait or go. As long as they choose to wait they = P(X = x +1|X = x, U = u) · P(U = u|X = x) t +1 t t will make a transition to either (5, 3) or (5, 1), depending u∈U wait wait on whether the next δX outcome is +1or −1. Figure 1b p = 1 − p (1) (t,x)→(t +1,x−1) (t,x)→(t +1,x+1) shows the transition diagram for the stochastic decision pro- where U is the (unobserved) up-probability of the current cess that corresponds to the random walk in Fig. 1a once trial. P(X = x +1|X = x, U = u) is the probability that t +1 t the go action is introduced. Transitions under go take the δX = 1 conditional on X = x and theup-probability being decision-maker to one of the states C or I, and subsequently u; this is simply u (the current evidence level X is irrelevant back to (0, 0) for the next trial. when we also condition on U = u). Allthatremains is to Our formulation of the decision-making problem has calculate the term P(U = u|X = x). stochastic state transitions, decisions available at each state, This posterior probability of U = u at the current state and transitions from any state (t, X) depending only on can be inferred using Bayes’ law: the current state and the selected action. This is there- P(X = x|U = u) · P(U = u) P(U = u|X = x) = fore a Markov decision process (MDP) (Howard, 1960; P(X = x|U =˜ u) · P(U =˜ u) u ˜∈U Puterman, 2005), with states (t, x) and the two dummy (2) states C and I corresponding to the correct and incorrect choice. A policy is a mapping from states (t, x) of this where P(U = u) is the prior probability of the up- MDP to wait/go actions. An optimal policy that maxi- probability being equal to u. The likelihood term, P(X = mizes the average reward per unit time in this MDP can be x|U = u), can be calculated by summing the probabili- determined by using the policy iteration algorithm (Howard, ties of all paths that would result in state (t, x).Weuse the standard observation about random walks that each of the 1960; Puterman, 2005). A key component of this algorithm t +x paths that reach (t, x) contains upward transitions and is to calculate the average expected reward per unit time 2 976 Psychon Bull Rev (2018) 25:971–996 (a) (b) Fig. 1 a Evidence accumulation as a random walk. Gray lines show current trajectory and black lines show possible trajectories if the decision- maker chooses to wait. b Evidence accumulation and decision-making as a Markov decision process: transitions associated with the action go are shown in dashed lines, while transitions associated with wait are shown in solid lines. The rewarded and unrewarded states are shown as C and I, respectively (for Correct and Incorrect) t −x downward transitions. Thus, the likelihood is given by where the term P(U = u|X = x) is given by Eq. 5. the summation over paths of the probability of seeing this Equation 6 gives the decision-maker the probability of an number of upward and downward moves: increase or decrease in evidence in the next time step if they choose to wait. (t +x)/2 (t −x)/2 (t +x)/2 (t −x)/2 P(X = x|U = u) = u (1−u) = n u (1−u) . t paths Similarly, we can work out the state-transition probabil- paths (3) ities under the action go. Under this action, the decision- maker makes a transition to either the correct or incorrect Here n is the number of paths from state (0, 0) to state paths state. The decision-maker will transition to the Correct state (t, x), which may depend on the current decision-making if they choose buy and the true state of the world is rising, policy. Plugging the likelihood into (2)gives 1 1 i.e., true u is in U ={ +  , +  }, or if they choose + e d 2 2 (t +x)/2 (t −x)/2 n u (1−u) P(U = u) paths sell and the true state of the world is falling, i.e., true u is in P(U = u|X = x) = . (t +x)/2 (t −x)/2 1 1 n u ˜ (1−˜ u) P(U =˜ u) paths u ˜∈U U ={ −  , −  } (assuming  > 0; see the end of this − e d d 2 2 (4) section for how to handle  = 0). The decision-maker will choose the more likely Some paths from (0, 0) to (t, x) would have resulted in a decision to go (based on the decision-making policy), and alternative–they compare the probability of the unobserved therefore could not actually have resulted in the state (t, x). drift U coming from the set U versus coming from the set Note, however, that the number of paths n is identical in paths U , given the data observed so far. The decision-maker will both numerator and denominator, so can be cancelled. respond buy when P(U ∈ U |X = x) > P(U ∈ U |X = + t − t x) and respond sell when P(U ∈ U |X = x) < P(U ∈ (t +x)/2 (t −x)/2 + t u (1 − u) P(U = u) P(U = u|X = x) =  . t U |X = x). The probability of these decisions being cor- − t (t +x)/2 (t −x)/2 u ˜ (1−˜ u) P(U =˜ u) u ˜∈U rect is simply the probability of the true states being rising (5) and falling respectively, given the information observed so Using Eq. 1, the transition probabilities under the action far. Thus when P(U ∈ U |X = x) > P(U ∈ U |X = x) + t − t wait can therefore be summarized as: the probability of a correct decision is P(U ∈ U |X = x), + t wait wait and when P(U ∈ U |X = x) < P(U ∈ U |X = x) the + t − t p = u · P(U = u|X = x) = 1 − p (t,x)→(t +1,x+1) (t,x)→(t +1,x−1) probability of a correct answer is P(U ∈ U |X = x); over- u∈U − t (6) all, the probability of being correct is the larger of P(U ∈ Psychon Bull Rev (2018) 25:971–996 977 U |X = x) and P(U ∈ U |X = x), meaning that the texts on stochastic dynamic programming such as Howard + t − t state transition probabilities for the optimal decision-maker (1960), Ross (1983) and Puterman (2005). The technique for the action go in state (t, x) are: searches for the optimal policy amongst the set of all poli- cies by iteratively computing the expected returns for all go { } p = max P(U ∈ U |X = x), P(U ∈ U |X = x) + t − t states for a given policy (step 1) and then improving the (t,x)→C policy based on these expected returns (step 2). go go p = 1 − p . (7) (t,x)→I (t,x)→C Step 1: Compute values of states for given π Assuming that the prior probability for each state of the world is the same, i.e., P(U ∈ U ) = P(U ∈ U ), + − To begin, assume that we have a current policy, π,which the posterior probabilities satisfy P(U ∈ U |X = x) > + t maps states to actions, and which may not be the optimal P(U ∈ U |X = x) if and only if the likelihoods satisfy − t policy. Observe that fixing the policy reduces the Markov P(X = x|U ∈ U )> P(X = x|U ∈ U ). In turn, this t + t − decision process to a Markov chain. If this Markov chain is inequality in the likelihoods holds if and only if x> 0. allowed to run for a long period of time, it will return an Thus, in this situation of equal prior probabilities, the opti- average reward ρ per unit time, independently of the initial mal decision-maker will select buy if x> 0and sell if state (Howard, 1960; Ross, 1983). However, the short-run go x< 0 so that the transition probability p is equal to expected earnings of the system will depend on the current (t,x)→C P(U ∈ U |X = x) when x> 0and P(U ∈ U |X = x) + t − t state, so that each state, (t, x), can be associated with a rel- when x< 0. ative value, v , that quantifies the relative advantage of (t,x) Note that when  = 0, a situation which we study below, being in state (t, x) under policy π. the sets U and U intersect, with being a member of + − Following the standard results of Howard (1960), the both. This corresponds to the difficult trials having an up- relative value of state v is the expected value over suc- (t,x) probability of for the true state of the world being either cessor states of the following three components: (i) the rising and falling. Therefore, in the calculations above, we instantaneous reward in making the transition, (ii) the rela- need to replace P(U ∈ U |X = x) in the calculation of the + t tive value of the successor state and (iii) a penalty term equal go transition probability p with P(U = +  |X = e t to the length of delay to make the transition multiplied by (t,x)→C 2 1 1 the average reward per unit time. From a state (t, x), under x) + P(U = |X = x) and P(U ∈ U |X = x) with t − t 2 2 1 1 1 action wait, the possible successor states are (t + 1,x + 1) P(U = −  |X = x) + P(U = |X = x). e t t 2 2 2 and (t + 1,x − 1) with transition probabilities given by Eq. 6; under action go, the possible successor states are C Finding optimal actions and I with transition probabilities given by Eq. 7; the delay In order to find the optimal policy, a dynamic programming for all of these transitions is one time step, and no instanta- procedure called policy iteration is used. The remainder of neous reward is received. Both C and I transition directly to this section provides a sketch of this standard procedure (0, 0), with reward R or R , and delay D or D respec- C I C I as applied to the model we have constructed. For a more tively. The general dynamic programming equations reduce detailed account, the reader is directed towards standard to the following wait π wait π π p v + p v − ρ if π(t, x) = wait (t,x)→(t +1,x+1) (t +1,x+1) (t,x)→(t +1,x−1) (t +1,x−1) v = (t,x) go go π π p v + p v − ρ if π(t, x) = go C I (t,x)→C (t,x)→I π π π v = R + v − D ρ C C C (0,0) π π π v = R + v − D ρ (8) I I I (0,0) π π The unknowns of the system are the relative values v , v terms will produce an alternative solution to the equations. (t,x) π π So we identify the solutions by fixing v = 0 and inter- and v , and the average reward per unit time ρ . The sys- (0,0) π π preting all other v terms as being values relative to state tem is underconstrained, with one more unknown (ρ )than (0, 0). equations. Note also that adding a constant term to all v We assume all policies considered will eventually go, so that the sys- We will revise this assumption in the section titled “Extensions of the tem is ergodic and the limiting state probabilities are independent of model”below. the starting state. 978 Psychon Bull Rev (2018) 25:971–996 new Step 2: Improve π → π to wait an arbitrarily large time before taking the action go. However due to computational limitations, we limit the So far, we have assumed that the policy, π, is arbitrarily largest value of time in a trial to a fixed value t by forcing max chosen. In the second step, we use the relative values of the decision-maker to make a transition to the incorrect state wait states, determined using Eq. 8, to improve this policy. This at t +1; that is, for any x, p = 1. In the policies max (t ,x)→I max improvement can be performed by applying the principle of computed below, we set t to a value much larger than the max optimality (Bellman, 1957): in any given state on an opti- interval of interest (time spent during a trial) and verified mal trajectory, the optimal action can be selected by finding that the value of t does not affect the policies in the cho- max the action that maximizes the expected return and assuming sen intervals. The code for computing the optimal policies that an optimal policy will be followed from there on. as well as the state-transition probabilities is contained in a When updating the policy, the decision-maker thus Toolbox available on the Open Science Framework (https:// selects an action for a state which maximizes the expecta- osf.io/gmjck/). tion of the immediate reward plus the relative value of the successor state penalized by the opportunity cost, with suc- cessor state values and opportunity cost calculated under Predicted optimal policies the incumbent policy π. In our model, actions need only be selected in states (t, x), and we compare the two possible The theory developed above gives a set of actions (a pol- evaluations for v in Eq. 8. Therefore the decision-maker icy) that optimizes the reward rate. We now use this the- (t,x) new sets π (t, x) = wait if ory to generate optimal policies for a range of decision problems of the form discussed at the beginning of this wait π wait π p v + p v article. The transition probabilities and state values com- (t,x)→(t +1,x+1) (t +1,x+1) (t,x)→(t +1,x−1) (t +1,x−1) go go π π puted in Eqs. 6, 7 and 8 are a function of the set of >p v + p v (9) C I (t,x)→C (t,x)→I up-probabilities (U) and inter-trial delays (D and D ). C I Hence, the predicted policies will also be a function of the and selects go otherwise. Note also that, by Eq. 8 and the given set of up-probabilities (i.e., the difficulties) and inter- identification v = 0, the relative values of the cor- (0,0) π π rect and incorrect states satisfy v = R − D ρ and trial delays. We show below how changing these variables C C π π leads to a change in the predicted optimal policies and how v = R − D ρ . We therefore see the trade-off between I I choosing to wait, receiving no immediate reward and sim- these policies correspond to decision boundaries that may or may not vary with time based on the value of these ply transitioning to a further potentially more profitable variables. state, and choosing go, in which there is a probability of receiving a good reward but a delay will be incurred. It will go Single difficulty only be sensible to choose go if p is sufficiently (t,x)→C high, in comparison to the average reward ρ calculated We began by computing optimal policies for single diffi- under the current policy π. Intuitively, since ρ is the aver- culty tasks. For the example at the beginning of this article, age reward per time step, deciding to go and incur the this means all rising assets go up during an observation delays requires that the expected return from doing so out- ¯ ¯ weighs the expected opportunity cost Dρ (where D is a period with the same probability, + ,and all falling assets suitably weighted average of D and D ). The new policy go up with the probability − . Figure 2 shows opti- C I new π π can be shown to have a better average reward ρ than ρ mal policies for three different tasks with drifts  = 0.45, (Howard, 1960; Puterman, 2005).  = 0.20 and  = 0, respectively. Panel (a) is a task This policy iteration procedure can be initialized with an that consists exclusively of very easy trials, panel (b) con- arbitrary policy and iterates over steps 1 and 2 to improve sists exclusively of moderately difficult trials and panel (c) new the policy. The procedure stops when the policy π is consists exclusively of impossible (zero drift) trials. The unchanged from π, which occurs after a finite number of inter-trial delay in each case was D = D = 150 (that C I iterations, and when it does so it has converged on an opti- is, the inter-trial delay was 150 times as long as the delay mal policy, π . This optimal policy determines the action between two consecutive up/down observations). The state in each state that maximizes the long-run expected average space shown in Fig. 2 is organized according to number of reward per unit time. samples (time) along the horizontal axis and cumulative evi- For computing the optimal policies shown in this article, dence (X ) along the vertical axis. Each square represents a we initialized the policy to one that maps all states to the possible state and the color of the square represents the opti- action go then performed policy iteration until the algorithm mal action for that state, with black squares standing for go converged. The theory above does not put any constraints and light grey squares standing for wait. The white squares on the size of the MDP—the decision-maker can continue are combinations of evidence and time that will never occur Psychon Bull Rev (2018) 25:971–996 979 Optimal Policy Optimal Policy Optimal Policy 25 25 25 20 20 20 15 15 15 10 10 10 5 5 5 0 0 0 −5 −5 −5 −10 −10 −10 −15 −15 −15 −20 −20 −20 −25 −25 −25 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 number of samples number of samples number of samples (a) (b) (c) Fig. 2 Each panel shows the optimal actions for different points in computations were D = D = 150 and all trials in a task had C I the state space after convergence of the policy iteration. Gray squares the same difficulty. The up-probability for each decision in the task indicate that wait is the optimal action in that state while black was drawn, with equal probability from (a) u ∈{0.05, 0.95},(b) squares indicate that go is optimal. The inter-trial delays for all three u ∈{0.30, 0.70} and (c) u = 0.50 during a random walk (e.g., (t, x) = (1, 0)) and do not always remained close to the maximum simulated time. In order correspond to a state of the MDP. to prevent confusion and exclude this boundary effect from We can observe from Fig. 2 that, in each case, the optimal other effects, all the figures for optimal policies presented policy constitutes a clear decision boundary: the optimal below are cropped at t = 50: simulations were performed decision is to wait until the cumulative evidence crosses a for t ≥ 70, but results are displayed until t = 50. max specific bound. For all values of evidence greater than this In agreement with previous investigations of optimal bound (for the current point of time), it is optimal to guess bounds (Bogacz et al., 2006), computations also showed the more likely hypothesis. In each panel, the bound is deter- that the decision boundaries depended non-monotonically mined by the cumulative evidence, x, which was defined on the task difficulty, with very high drifts leading to nar- above as the difference between number of up and down row bounds and intermediate drifts leading to wider bounds. observations, |n − n |. Note that, in all three cases, the Note that the height of the decision boundary is |n − n |= u d u d decision bound stays constant for the majority of time and 5for  = 0.20 in Fig. 2b, but decreases on making the collapses only as the time approaches the maximum sim- task more easy (as in Fig. 2a) as well as more difficult (as ulated time step, t . We will discuss the reason for this in Fig. 2c). Again, this makes intuitive sense: the height of max boundary effect below, but the fact that decision bounds the decision boundary is low when the task consists of very remain fixed prior to this boundary effect shows that it is optimal easy trials because each outcome conveys a lot of informa- to have a fixed decision bound if the task difficulty is fixed. tion about the true state of the world; similarly, decision In Fig. 2a and b, the optimal policy dictates that the boundary is low when the task consists of very difficult trials decision-maker waits to accumulate a criterion level of because the decision-maker stands to gain more by making evidence before choosing one of the options. In contrast, decisions quickly than observing very noisy stimuli. Fig. 2c dictates that the optimal decision-maker should make a decision immediately (the optimal action is to go Mixed difficulties in state (0, 0)), without waiting to see any evidence. This makes sense because the up-probability for this computa- Next, we computed the optimal policies when a task con- tion is u = ; that is, the observed outcomes are completely tained mixture of two types of decisions with different random without evidence in either direction. So the theory difficulties. For the example at the beginning of this article, suggests that the decision-maker should not wait to observe this means some rising assets go up during an observation any outcomes and choose an option immediately, saving period with the probability +  while others go up with time and thereby increasing the reward rate. the probability +  . Similarly, some falling assets go up In panels (a) and (b), we can also observe a collapse of 2 with the probability − while others go up with probabil- the bounds towards the far right of the figure, where the e ity −  . Figure 3 shows the optimal policy for two single boundary converges to |n −n |= 0. This is a boundary effect u d d and arises because we force the model to make a transition difficulty tasks, as well as a mixed difficulty task (Fig. 3c), to the incorrect state if a decision is not reached before the in which trials can be either easy or difficult with equal probability (P(U ∈ U ) = ). The drift of the easy task is very last time step, t (in this case, t = 70). Increasing max max e t moved this boundary effect further to the right, so that it  = 0.20 and the difficult task is  = 0. e d max evidence evidence evidence 980 Psychon Bull Rev (2018) 25:971–996 Optimal Policy Optimal Policy Optimal Policy 25 25 25 20 20 20 15 15 15 10 10 10 5 5 5 0 0 0 −5 −5 −5 −10 −10 −10 −15 −15 −15 −20 −20 −20 −25 −25 −25 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 number of samples number of samples number of samples (a) (b) (c) Fig. 3 Optimal actions for single and mixed difficulty tasks. The inter- drawn from u ∈{0.30, 0.70}; b Single difficulty task with u = ; c trial intervals used for computing all three policies are D = D = Mixed difficulty task with u ∈{0.30, 0.50, 0.70}, with both easy and C I 150. a Single difficulty task with up-probability for each decision difficult trials equally likely, i.e., P(U ∈ U ) = The optimal policies for the single difficulty tasks We also explored cases with a mixture of decision dif- (Fig. 3a and b) are akin to the optimal policies in Fig. 2. ficulties, but where the difficult decisions had a positive The most interesting aspect of the results is the optimal pol- drift ( > 0). Figure 4 shows optimal policies for the icy for mixed difficulty condition (Fig. 3c). In contrast to same parameters as Fig. 3c, except the drift of the diffi- single difficulty conditions, we see that the decision bound- cult decisions has been changed to  = 0.02, 0.05, and ary under this condition is time-dependent. Bounds are wide 0.10, respectively. The drift for the easy decisions remained at the start of the trial (|n − n |= 4) and narrow down  = 0.20. Bounds still decrease with time when  = 0.02 u d e d as time goes on (reaching |n − n |= 0at t = 44). In and 0.05 but the amount of decrease becomes negligible u d other words, the theory suggests that the optimal decision- very rapidly. In fact, when  = 0.10, the optimal pol- maker should start the trial by accumulating information icy (at least during the first 50 time-steps) is exactly the and trying to be accurate. But as time goes on, they should same as the single difficulty task, with  = 0.20 (compare decrease their accuracy and guess. In fact, one can ana- with Fig. 3a). We explored this result using several differ- lytically show that the decision boundaries will eventually ent values of inter-trial intervals and consistently found that collapse to |n − n |= 0 if there is a non-zero probability decision boundaries show an appreciable collapse for only a u d that one of the tasks in the mixture has zero drift ( = 0) small range of decision difficulties and, in particular, when (see Appendix A). one type of decision is extremely difficult or impossible. Optimal Policy Optimal Policy Optimal Policy 25 25 25 20 20 20 15 15 15 10 10 10 5 5 5 0 0 0 −5 −5 −5 −10 −10 −10 −15 −15 −15 −20 −20 −20 −25 −25 −25 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 number of samples number of samples number of samples Fig. 4 Optimal actions for mixed difficulty tasks with different dif- (b) u ∈{0.30, 0.45, 0.55, 0.70},and (c) u ∈{0.30, 0.40, 0.60, 0.70} ficulty levels. Each panel shows mixed difficulty task with up-pro- with equal probability. All other parameters remain the same as in bability for each decision drawn from (a) u ∈{0.30, 0.48, 0.52, 0.70}, computations shown in Fig. 3 above evidence evidence evidence evidence evidence evidence Psychon Bull Rev (2018) 25:971–996 981 Optimal Policy Optimal Policy Optimal Policy 25 25 25 20 20 20 15 15 15 10 10 10 5 5 5 0 0 0 −5 −5 −5 −10 −10 −10 −15 −15 −15 −20 −20 −20 −25 −25 −25 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 number of samples number of samples number of samples (a) (b) (c) Probability (Correct) Probability (Correct) Probability (Correct) 1 1 1 0.9 0.9 0.9 0.8 0.8 0.8 0.7 0.7 0.7 0.6 0.6 0.6 0.5 0.5 0.5 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 number of samples number of samples number of samples (d) (e) (f) Fig. 5 Optimal actions for a mixture of difficulties when the easy respectively. Panel (c) shows optimal policy in mixed difficulty task task has narrower bounds than the difficult task. The inter-trial delays with up-probability chosen from u ∈{0.05, 0.40, 0.60, 0.95} and for all three computations are D = D = 150. Panels (a)and (b) P(U ∈ U ) = P(U ∈ U ) = . Panels (d–f) show the change in pos- C I e d show optimal policies for single difficulty tasks with up-probability terior probabilities P(U ∈ U |X = x) with time at the upper decision + t of each decision chosen from u ∈{0.05, 0.95} and u ∈{0.40, 0.60}, boundary for conditions (a–c), respectively An intuitive explanation for collapsing bounds in Figs. 3c difficulty tasks and a mixed difficulty task that combines and 4a, b could be as follows: the large drift (easier) task these two difficulties. However, in this case, the two single (Fig. 3a) has wider bounds than the small drift task (Fig. 3b); difficulty tasks are selected so that the bounds for the large with the passage of time, there is a gradual increase in the drift (easy) task (Fig. 5a) are narrower than the small drift probability that the current trial pertains to the difficult task (difficult) task (Fig. 5b), reversing the pattern used in the if the boundary has not yet been reached. Hence, it would set of tasks for Fig. 3. Figure 5c shows the optimal actions make sense to start with wider bounds and gradually nar- in a task where these two difficulty levels are equally likely. row them to the bounds for the more difficult task as one In contrast to Fig. 3c, the optimal bounds for this mixture becomes more certain that the current trial is difficult. If this are narrower at the beginning, with |n − n |= 4and then u d explanation is true, then bounds should decrease for a mix- get wider, reaching |n − n |= 6 and then stay constant. u d ture of difficulties only under the condition that the easier Thus, the theory predicts that inter-mixing difficulties does task has wider bounds than the more difficult task. not necessarily lead to monotonically collapsing bounds. In order to get an insight into why the optimal boundary Increasing bounds increases with time in this case, we computed the posterior probability of making the correct decision at the optimal The next set of computations investigated what happens to boundary. An examination of this probability showed that decision boundaries for mixed difficulty task when the eas- although the optimal boundary is lower for the easy task ier task has narrower bounds than the more difficult task. (Fig. 5a) than for the difficult task (Fig. 5b), the posterior Like Fig. 3,Fig. 5 shows optimal policies for two single P(U ∈ U |X = x) at which the choice should be made is + t evidence evidence evidence 982 Psychon Bull Rev (2018) 25:971–996 Optimal Policy Optimal Policy Optimal Policy 25 25 25 20 20 20 15 15 15 10 10 10 5 5 5 0 0 0 −5 −5 −5 −10 −10 −10 −15 −15 −15 −20 −20 −20 −25 −25 −25 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 number of samples number of samples number of samples (a) (b) (c) Fig. 6 Optimal actions for single and mixed difficulty tasks when inter-trial intervals are reduced to D = D = 50. All other parameters are C I the same as Fig. 3 higher for the easy task (Fig. 5d) than for the difficult task increase and then asymptote towards the wider of the two (Fig. 5e). For the mixed difficulty task, although the opti- bounds. mal boundary increases with time (Fig. 5c), the probability of making a correct choice decreases with time (Fig. 5f). Effect of inter-trial intervals This happens because the posterior probability of the current trial being difficult increases with time. This fits well with The computations so far have focused on how the diffi- the intuitive explanation of time-varying decision bound- culty level (drift) affects the optimal policy. Therefore, all aries given for collapsing bounds. At the start of the trial, computations shown so far used the same inter-trial inter- the decision-maker does not know whether the trial is easy vals (D = D = 150) but varied the drift. However, our C I or difficult and starts with a decision boundary somewhere conclusions about policies in single and mixed difficulty between those for easy and difficult single difficulty tasks. conditions are not restricted to a particular choice of inter- As time progresses and a decision boundary is not reached, trial delay. Figure 6, for example, shows how optimal policy the probability of the trial being difficult increases and the changes when this delay is changed. To generate these poli- decision boundaries approach the boundaries for the diffi- cies, the inter-trial delay was decreased to D = D = C I cult task. Since the decision boundaries for the difficult task 50. All other parameters were the same as those used for ( = 0.10) are wider than the easy task ( = 0.45) in computing policies in Fig. 3. d e Fig. 5, this means that the decision boundaries increase with A comparison of Figs. 3aand 6a, which have the same time during the mixed difficulty task. drifts but different inter-trial intervals, shows that the opti- We computed the optimal policies for a variety of mix- mal bounds decrease from |n − n |= 5 when the inter-trial u d ing difficulties and found that bounds increase, decrease delay is 150 to |n − n |= 3 inter-trial delay is reduced to u d or remain constant in a pattern that is consistent with this 50. Intuitively, this is because decreasing the inter-trial inter- intuitive explanation: when the task with smaller drift (the vals alters the balance between waiting and going (Eq. 8), more difficult task) has narrower bounds than the task with making going more favorable for certain states. When the larger drift (as in Fig. 3), mixing the two tasks leads to inter-trial interval decreases, an error leads to a compara- either constant bounds in-between the two bounds, or to tively smaller drop in the reward rate as the decision-maker monotonically decreasing bounds that asymptote towards quickly moves on to the next reward opportunity. There- the narrower of the two bounds. In contrast, when the task fore, the decision-maker can increase their reward rate by with the smaller drift has wider bounds than the task with lowering the height of the boundary to go. A compari- larger drift (as in Fig. 5), mixing the two tasks leads to either son of Figs. 3cand 6c shows that a similar result holds constant bounds in-between the two bounds or bounds that for the mixed difficulty condition: decision boundaries still decrease with time, but the boundary becomes lower when inter-trial delay is decreased. The sawtooth (zigzag) pattern in Fig. 5(d–f) is a consequence of the Thus far, we have also assumed that the inter-trial inter- discretization of time and evidence. For example, moving from left to vals for correct and error responses (D and D , respec- C I right along the boundary in Fig. 5d, the value of evidence oscillates tively) are the same. In the next set of computations, we between |n − n |= 6and |n − n |= 7, leading to the oscillation in u d u d the value of the posterior probability P(U ∈ U |X = x). investigated the shape of decision boundaries when making + t evidence evidence evidence Psychon Bull Rev (2018) 25:971–996 983 Optimal Policy Optimal Policy Optimal Policy 25 25 25 20 20 20 15 15 15 10 10 10 5 5 5 0 0 0 −5 −5 −5 −10 −10 −10 −15 −15 −15 −20 −20 −20 −25 −25 −25 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 number of samples number of samples number of samples (a) (b) (c) Fig. 7 Optimal actions remain the same if D + D remain the same. Each panel shows the optimal policy for up-probability drawn from the C p same set as Fig. 3, but for an inter-trial delay of D = 75 for correct guesses and D = 150 for errors C I an error carried an additional time-penalty, D , so that the Prior beliefs about the world delay is D after correct response and D +D after errors. C C p An unintuitive result from previous research (Bogacz First, consider the assumption that both states of the world et al., 2006) is that different combinations of D and D are equally likely. A key question in perceptual decision- C p that have the same sum (D + D ), lead to the same bound- making is how decision-makers combine this prior belief C p ary. So, for example, the optimal boundaries are the same with samples (cues) collected during the trial. The effect when both correct and incorrect decisions lead to an equal of prior information on decision-making can be static, delay of 150 time steps as when the correct decisions lead i.e., remain constant during the course of a decision, or to a delay of 75 time steps but the incorrect decisions lead dynamic, i.e., change as the decision-maker samples more to an additional 75 time steps. information. Correspondingly, sequential sampling models Results for computations shown in Fig. 7 indicate that can accommodate the effect of prior in either the starting this property generalizes to the case of mixed difficulties. point if the effect of prior is static or in the drift or thresh- The optimal policy for single and mixed difficulty tasks in old if the effect is dynamic (Ashby, 1983; Ratcliff, 1985; this figure are obtained for up-probability drawn from the Diederich, & Busemeyer, 2006; Hanks et al., 2011). same set as in Fig. 3, but with delays of D = 75 and Experiments with humans and animals investigating D = D + D = 150. Comparing Figs. 3 and 7, one can whether the effect of prior beliefs is constant or changes I C p see that changing the delays has not affected the decision with time, have led to mixed results. A number of recent boundaries at all. This is because even though D = 75 for experiments have shown that shifting the starting point is Fig. 7, D + D was the same as Fig. 3. Moreover, not only more parsimonious with the accuracy and reaction time C p are the boundaries the same for the single difficulty condi- of participants (Summerfield and Koechlin, 2010; Mulder tions (as previously shown), they are also the same for the et al., 2012). However, these experiments only consider a corresponding mixed difficulty conditions. single task difficulty. In contrast, when Hanks et al. (2011) considered a task with a mixture of difficulties, they found that data from the experiment can be better fit by a time- Extensions of the model dependent prior model. Instead of assuming that the effect of a prior bias is a shift in starting point, this model assumes The theoretical model outlined above considers a simplified that the prior dynamically modifies the decision variable— decision-making task, where the decision-maker must choose i.e., the decision variable at any point is the sum of the drift from two equally likely options. We now show how the and a dynamic bias signal that is a function of the prior and above theory generalizes to situations where: (a) the world increases monotonically with time. is more likely to be in one state than the other (e.g., assets We examined this question from a normative are more likely to be falling than rising), and (b) the decision- perspective—should the effect of a prior belief be time- maker can give up on a decision that appears too difficult dependent if the decision-maker wanted to maximize (make no buy or sell recommendation on an asset). In each reward rate? Edwards (1965) has shown that when the relia- case, the normative model illuminates how the sequential bility of the task is known and constant, the optimal strategy sampling models should be adapted for these situations. is to shift the starting point. More recently, Huang, Hanks, evidence evidence evidence 984 Psychon Bull Rev (2018) 25:971–996 25 25 25 20 20 15 15 15 10 10 5 5 5 0 0 −5 −5 −5 −10 −10 −10 −15 −15 −15 −20 −20 −20 −25 −25 −25 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 number of samples number of samples number of samples (a) (b) (c) Fig. 8 Change in optimal policy during single difficulty tasks with increasingly biased prior beliefs. a P(U ∈ U ) = 0.50; b P(U ∈ U ) = 0.70; + + c P(U ∈ U ) = 0.97. For all three computations, up-probability is drawn from u ∈{0.30, 0.70} and the inter-trial intervals are D = D = 150 + C I Shadlen, Friesen and Rao (2012) argued that instead of Figure 8 shows how the optimal policy changes when modeling the data in terms of a sequential sampling model the prior belief changes from both states of the world being with adjustment to starting point or drift, the decisions in equally probable (assets are equally likely to rise and fall) experiments such as Hanks et al. (2011) can be adequately to one state being more probable than the other (assets are described by a POMDP model that assumed priors to be more likely to rise than fall). All policies in Fig. 8 are for sin- distributed according to a (piecewise) Normal distribution gle difficulty tasks where the difficulty (drift) is fixed and and maximized the reward. We will now show that a nor- known ahead of time. mative model that maximizes the reward rate, such as the We can observe that a bias in the prior beliefs shifts the model proposed by Huang et al. (2012), is in fact, consis- optimal boundaries: when the prior probabilities of the two tent with sequential sampling models. Whether the effect of states of the world were the same (P(U ∈ U ) = P(U ∈ prior in such a model is time-dependent or static depends U )), the height of the boundary for choosing each alter- on the mixture of difficulties. As observed by Hanks et al. native was |n − n |= 5(Fig. 8a). Increasing the prior u d (2011), in mixed difficulty situations, the passage of time probability of the world being in the first state to P(U ∈ itself contains information about the reliability of stimuli: U ) = 0.70 reduces the height of the boundary for choosing the longer the trial has gone on, the more unreliable the the first alternative to (n − n ) = 4, while it increases the u d source of stimuli is likely to be and decision-makers should height of the boundary for choosing the other alternative to increasingly trust their prior beliefs. (n −n ) =−6(Fig. 8b). Thus, the optimal decision-maker u d For the MDP shown in Fig. 1b, in any state (t, x),the will make decisions more quickly for trials where the true effect of having biased prior beliefs is to alter the transi- state of the world matches the prior but more slowly when tion probabilities for wait as well as go actions. We can the true state and the prior mismatch. Furthermore, note that see that changing the prior in Eq. 5 will affect the posterior the increase in boundary in one direction exactly matches probability P(U = u|X = x), which, in turn, affects the the decrease in boundary in the other direction, so that the wait transition probability p in Eq. 6. Similarly, change in boundaries is equivalent to a shift in the starting (t,x)→(t +1,x+1) a change in the prior probabilities changes the posteriors point, as proposed by Edwards (1965). Increasing the bias P(U ∈ U |X = x) and P(U ∈ U |X = x) in Eq. 7, in prior further (Fig. 8c) increased this shift in boundaries, + t − t go in turn changing the transition probability p .We with the height of the boundary for choosing the first alter- (t,x)→C argued above that when priors are equal, P(U ∈ U ) = native reduced to (n − n ) = 0when P(U ∈ U ) = 0.97. + u d + P(U ∈ U ), the optimal decision-maker should recommend In this case, the decision-maker has such a strong prior (that buy or sell based solely on the likelihoods: i.e., buy when- asset values are rising) that it is optimal for them to choose ever x> 0 and recommend sell whenever x< 0. This will the first alternative (buy an asset) even before making any no longer be the case when the priors are unequal. In this observations. case, the transition probabilities under the action go will be Thus, the optimal policy predicted by the above theory given by the more general formulation in Eq. 7, i.e., buy concurs with shifting the starting point when the task dif- whenever the posterior probability for rising is larger than ficulty is fixed and known ahead of time. Let us now look falling (P(U ∈ U |X = x) > P(U ∈ U |X = x))and sell at the mixed difficulty condition. Figure 9 shows the opti- + t − t otherwise. mal policy for a mixed difficulty task with up-probability evidence evidence evidence Psychon Bull Rev (2018) 25:971–996 985 25 25 20 20 20 15 15 10 10 10 5 5 5 0 0 0 −5 −5 −5 −10 −10 −10 −15 −15 −15 −20 −20 −20 −25 −25 −25 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 number of samples number of samples number of samples 1 1 1 25 25 25 20 20 20 0.8 0.8 0.8 15 15 15 10 10 10 0.6 0.6 0.6 5 5 5 0 0 0 0.4 0.4 0.4 −5 −5 −5 −10 −10 −10 0.2 0.2 0.2 −15 −15 −15 −20 −20 −20 −25 0 −25 0 −25 0 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 number of samples number of samples number of samples 1 1 1 25 25 25 20 20 20 0.8 0.8 0.8 15 15 15 10 10 10 0.6 0.6 0.6 5 5 5 0 0 0 0.4 0.4 0.4 −5 −5 −5 −10 −10 −10 0.2 0.2 0.2 −15 −15 −15 −20 −20 −20 0 0 0 −25 −25 −25 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 number of samples number of samples number of samples Fig. 9 Optimal policy during mixed difficulty trials with biased prior row shows optimal policies, the second row shows the posterior prob- beliefs. For all computations, the mixture of drifts involves  = 0.20, ability for the trial being easy given the state and the third row shows = 0and P(U ∈ U ) = . Three different priors are used: the left the posterior probability for the trial having up-probability > 0.50 d e column uses P(U ∈ U ) = 0.52, the middle column uses P(U ∈ given the state. For all three computations, the inter-trial intervals are U ) = 0.55, and the right column uses P(U ∈ U ) = 0.70. The first D = D = 150 + + C I drawn from the set u ∈{0.30, 0.50, 0.70} and three differ- at longer durations (again most evident in the third column ent degrees of prior all biased towards the world being in the of Fig. 9, where it becomes optimal to choose the first alter- first state to varying degrees. native with the passage of time, even when the cumulative Like the single difficulty case, a prior bias that the world evidence is negative). is more likely to be in the first state (asset values are more These optimal policies are in agreement with the dynamic likely to rise) decreases the boundary for the first alternative prior model developed by Hanks et al. (2011), which pro- (buy) and increases the boundary for the second alternative poses that the contribution of prior increases with time. To (sell). However, unlike the single difficulty case, this shift see this, consider the assets example again. Note that the in boundaries is not constant, but changes with time: the prior used for generating the optimal policy in the third col- optimal policies in Fig. 9 are not simply shifted along the umn of Fig. 9 corresponds to assets being more likely to rise evidence axis (compare with Fig. 8); rather, there are two than fall (P(U ∈ U ) = 0.70). As time goes on, the cumu- components of the change in boundary. First, for all val- lative evidence required to choose buy keeps decreasing ues of time, the distance to the upper boundary (for buy)is while the cumulative evidence required to choose sell keeps same-or-smaller than the equal-prior case (e.g., in the third increasing. In other words, with the passage of time, increas- column of Fig. 9, (n − n ) = 2evenat t = 2), and the dis- ingly larger evidence is required to overcome the prior. This u d tance to the lower boundary (for sell) is same-or-larger than is consistent with a monotonically increasing dynamic prior the equal prior case. Second, the shift in boundaries is larger signal proposed by Hanks et al. (2011). evidence evidence evidence evidence evidence evidence evidence evidence evidence 986 Psychon Bull Rev (2018) 25:971–996 pass The third column in Fig. 9 also shows another interest- immediate reward or penalty: r = 0. The policy itera- ij ing aspect of optimal policies for unequal prior. In this case, tion is carried out in the same way as above, except Eqs. 8 the bias in prior is sufficiently strong and leads to a ‘time- and 9 are updated to accommodate this alternative. varying region of indecision’: instead of the collapsing Kiani and Shadlen (2009) introduced an option similar bounds observed for the equal-prior condition computations to this pass action (they call it “opt-out”) in an experiment show optimal bounds that seem parallel but change mono- conducted on rhesus monkeys. The monkeys were trained tonically with time. So, for example, when P(U ∈ U ) = to make saccadic eye movements to one of two targets that 0.70, it is optimal for the decision-maker to keep waiting indicated the direction of motion of a set of moving dots on for more evidence even at large values of time, provided the the screen (one of which was rewarded). In addition to being current cumulative evidence lies in the grey (wait)region able to choose one of these targets, on a random half of the of the state-space. trials, the monkeys were presented a third saccadic target (a The intuitive reason for this time-varying region of inde- “sure target”) that gave a small but certain reward. This “opt- cision is that, for states in this region, the decision-maker is out” setting is similar to our extended model with a pass neither able to infer if the trial is easy nor able to infer the action with one distinction. Since Kiani and Shadlen (2009) true state of the world. To see this, we have plotted the pos- did not use a fixed-time block paradigm where there was a terior probability of the trial being easy in the second row trade-off between the speed and accuracy of decisions, they of Fig. 9 and the posterior for the first state (rising)being had to explicitly reward the “opt-out” action with a small the true state of the world in the third row. The posterior reward. In contrast, we consider a setting where there is that the trial is easy does not depend on the prior about the an implicit cost of time. Therefore it is sufficient to reduce state of the world: all three panels in second row are iden- the delay for the pass option without associating it with tical. However, the posterior on the true state of the world an explicit reward. Kiani and Shadlen (2009) found that the does depend on the prior beliefs: as the prior in favor of the monkeys chose the sure target when their chance of making world being in the first state (rising) increases, the region of the correct decision about motion direction was small; that intermediate posterior probabilities is shifted further down is, when the uncertainty of the motion direction was high. with the passage of time. The ‘region of indecision’ corre- Figure 10 shows the optimal policy predicted by extend- sponds to an area of overlap in the second and third rows ing the above theory to include a pass option. For the single where the posterior probability that the trial is easy is close difficulty task (Fig. 10a), it is never optimal to choose the to 0.5 and the posterior probability that the true state of the pass option. This is because choosing to pass has a cost world is rising is also close to 0.5 (both in black). Hence, associated with it (the inter-trial delay on passing) and no the optimal thing to do is to wait and accumulate more benefit—the next trial is just as difficult, so the same amount evidence. of information would need to be accumulated. More interestingly, Fig. 10b and c show the optimal pol- Low confidence option icy for the mixed difficulty task, with up-probability for each decision chosen from the set u ∈{0.30, 0.50, 0.70}.In So far, we have considered two possible actions at every agreement with the findings of Kiani and Shadlen (2009), time step: to wait and accumulate more information, or to the theory predicts that the pass action is a function of go and choose the more likely alternative. Of course this both evidence and time and is taken only in cases where is not true in many realistic decision-making situations. For the decision-maker has waited a relatively long duration and instance, in the example at the beginning of the article, accumulated little or no evidence favoring either hypoth- the decision-maker may choose to examine the next asset esis. An inspection of the posterior probabilities, P(U ∈ in the portfolio without making a recommendation if they U |X = x), reveals why it becomes optimal to choose + t are unsure about their decision after making a sequence the pass option with the passage of time. It can be seen in of up/down observations. We now show how the theory Fig. 10e and f that for a fixed evidence x, as time increases, outlined above can be extended to such a case: the decision- P(U ∈ U |X = x) decreases (this is in contrast for the + t maker has a third option (in addition to wait and go), which single difficulty case, Fig. 10d). Thus, with the increase is to pass and move to the next trial with a reduced inter- in time, the confidence of the decision-maker in the same trial delay. In this case, the MDP in Fig. 1(b) is changed to amount of cumulative evidence should diminish and the pass include a third option—pass—with a delay D but no expected value of choosing the pass action becomes larger than the expected value of wait or go. The computations also reveal how the optimal policies depend on the incentive provided for the pass option. In This pattern seemed to hold even when we increased t to 100. max Fig. 10b, the inter-trial interval for the pass action is nearly Further research would be required to investigate if this is analytically the case when t →∞. an eighth of the interval for incorrect decisions while in max Psychon Bull Rev (2018) 25:971–996 987 25 25 25 20 20 20 15 15 15 10 10 10 5 5 5 0 0 0 −5 −5 −5 −10 −10 −10 −15 −15 −15 −20 −20 −20 −25 −25 −25 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 number of samples number of samples number of samples (a) (b) (c) 1 1 1 25 25 25 20 20 20 0.8 0.8 0.8 15 15 15 10 10 10 0.6 0.6 0.6 5 5 5 0 0 0 0.4 0.4 0.4 −5 −5 −5 −10 −10 −10 0.2 0.2 0.2 −15 −15 −15 −20 −20 −20 0 0 0 −25 −25 −25 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 number of samples number of samples number of samples (d) (e) (f) Fig. 10 Optimal actions for all states when actions include the pass (b), the inter-trial interval for pass action is 20 time steps while for option. Gray = wait; black = go; red = pass. For all computations, (c) it is 40 time steps. Panels (d–f) show the corresponding posterior = 0.20,  = 0, D = D = 150. (a) The single difficulty case probabilities of a drift > 0.50, P(U ∈ U |X = x), for the conditions e d C I + t with P(U ∈ U ) = 1. For (b)and (c), P(U ∈ U ) = .For (a)and in panels (a–c) e e Fig. 10c the inter-trial interval for the pass action is approx- Next, by restricting the levels of difficulties to two, we imately a fourth of the interval for incorrect decision. Since were able to explore the effect of different difficulties on the all paths to the pass action is blocked by go action in shape of the decision boundaries. Our computations showed Fig. 10c, the theory predicts that decreasing the incentive that optimal decision bounds do not necessarily decrease slightly should result in the optimal decision-maker never in the mixed difficulty condition and may, in fact, increase choosing the pass option. or remain constant. Computations using a variety of differ- ent difficulty levels revealed the following pattern: optimal bounds decreased when difficult trials (in mixed blocks) had Discussion lower optimal bounds than easy trials; they increased when the pattern was reversed, i.e., when difficult trials had higher Key insights optimal bounds than easy trials. In addition to computing optimal boundaries, we also Previous research has shown that when the goal of the computed posterior probabilities for various inferences dur- decision-maker is to maximize their reward rate, it may ing the course of a decision. These computations provided be optimal for them to change their decision boundary insight into the reason for the shape of boundaries under with time. In this article, we have systematically outlined different conditions. Optimal boundaries change with time only in the mixed difficulty condition and not in the sin- a dynamic programming procedure that can be used to compute how the decision boundary changes with time. gle difficulty condition because observations made during Several important results were obtained by using this proce- the mixed difficulty condition provide the decision-maker dure to compute optimal policies under different conditions. two types of information: in addition to providing evidence Firstly, by removing the assumptions about deadlines of about the true state of the world, observations also help the decisions and the cost of making observations, we found decision-maker infer the difficulty level of the current trial. that neither of these were a pre-requisite for an optimal time- At the start of the trial, the difficulty level of the current dependent decision boundary. Instead, what was critical, trial is determined by the decision-maker’s prior beliefs— was a sequence of decisions with inter-mixed difficulties. e.g., that both easy and difficult trials are equally likely. So evidence evidence evidence evidence evidence evidence 988 Psychon Bull Rev (2018) 25:971–996 the optimal decision-maker starts with decision boundaries The majority of experiments included in these meta- that reflect these prior beliefs. As the trial progresses, the analyses consider mixed difficulty blocks with a larger decision-maker uses the cumulative evidence as well as the number of difficulty levels than we have considered so time spent to gather this evidence to update the posterior far. For example, one of the studies considered by both on the difficulty level of the trial. They use this posterior Hawkins et al. (2015) and Voskuilen et al. (2016)is to then update the decision boundary dynamically. In cases Experiment 1 from Ratcliff and McKoon (2008), who where the decision boundary for the difficult trials is lower use a motion-discrimination task with motion coherence (higher) than easy trials, the decision-maker can maximize that varies from trial to trial across six different levels their reward rate by decreasing (increasing) the decision (5%, 10%, 15%, 25%, 35%, 50%). It is unclear what the boundary with time. shape of optimal bounds should be for this mixture of diffi- Similarly, the model also provided insight into the rela- culties, especially because we do not know what participants tionship between the shape of optimal decision boundaries were trying to optimize in this study. However, even if par- and priors on the state of the world. When priors are ticipants were maximizing reward rate, we do not know unequal, observations in mixed difficulty trials provide three whether they should decrease their decision boundaries types of information. They can be used to perform the two under these conditions. inferences mentioned above—about the true state of the It is possible to extend the above framework to more world and the difficulty of the trial—but additionally, they than two difficulties and make predictions about the shape can also be used to compute the weight of the prior. Com- of optimal boundaries in such settings. However, one prob- putations showed that it is optimal for the decision-maker lem is that this framework assumes an exact knowledge of to increase the weight of the prior with time, when deci- different drifts, , used in the mixture of difficulties. In sions have inter-mixed difficulties and the decision-maker the experiments considered by Hawkins et al. (2015)and has unequal priors. A possible explanation for this counter- Voskuilen et al. (2016), we do not know the exact values intuitive finding is that the optimal decision-maker should these drifts take since the paradigms used in these stud- consider the reliability of signals when calculating how ies (motion coherence, numerosity judgment, etc.) involve much weight to give the prior. As the number of observa- implicit sampling of evidence. tions increase, the reliability of the evidence decreases and One advantage of the expanded judgment paradigm is the optimal decision-maker should give more weight to the that the experimenter is able to directly observe the drift prior. Note that this is the premise on which Hanks et al. of samples shown to the participant and compare the deci- (2011) base their “dynamic bias” model. Our computations sion boundary used by participants with the one predicted show how the dynamic bias signal should change with time by reward-rate maximization. We have recently conducted when the goal of the decision-maker is to maximize the a large series of experiments in which we adopted this reward rate. approach, adopting a very explicit reward structure and creating conditions for which the model predicts that bound- Implications for empirical research aries should change when different difficulty levels are mixed (Malhotra, Leslie, Ludwig, & Bogacz, in press). Using the dynamic programming procedure to predict opti- We found that participants indeed modulated the slope mal policies provides a strong set of constraints for observ- of their decision boundaries in the direction predicted by ing different boundary shapes. In particular, we found that maximization of reward rate. optimal boundaries decreased appreciably under a lim- In order to understand why our findings contrast with ited set of conditions and only if one type of decision is those of Hawkins et al. (2015) and Voskuilen et al. (2016), extremely difficult. This observation is particularly rele- we extended the model to accommodate any number of dif- vant to a number of recent studies that have investigated ficulty levels. Instead of assuming that the up-probability the shape of decision boundaries adopted by participants in comes from the set U ∪ U , we assumed that u ∈ U, e d decision-making experiments. where U is a set of up-probabilities with n different drifts, Hawkins et al. (2015) performed a meta-analysis of { ,..., }. We then attempted to model the set of stu- 1 n reaction-time and error-rate data from eight studies using a dies analyzed by Hawkins et al. (2015) and Voskuilen et al. variety of different paradigms and found that, overall, these (2016). data favored a fixed bounds model over collapsing bounds As mentioned above, one problem is that we do not know models in humans. Similarly, Voskuilen et al. (2016) car- what the actual drift of the evidence was in these studies. Our strategy was to match the observed error rates for a ried out a meta-analysis using data from four numerosity discrimination experiments and two motion-discrimination experiments and found that data in five out of six experi- 6 The authors’ pre-print accepted for publication is available at https:// ments favored fixed boundaries over collapsing boundaries. osf.io/2rdrw/. Psychon Bull Rev (2018) 25:971–996 989 Table 1 Set of studies and drifts used to generate optimal policies Study Paradigm Conditions Distribution Drifts ({ ... }) 1 n { { PHS 05 Motion 0%, 3.2%, 6.4%, Uniform 0, 0.03, 0.05, 12.8%, 25.6%, 51.2%} 0.10, 0.20, 0.40} RTM 01 Distance 32 values Uniform 17 values Range: [1.7, 2.4] cm Range: [0, 0.50] R 07 Brightness {Bright : 2%, 35%, 45%, Uniform {0.05, 0.10, 0.20} Dark:55%, 65%, 98%} RM 08 Motion {5%, 10%, 15%, Uniform {0.04, 0.07, 0.10 25%, 35%, 50%} 0.15, 0.20, 0.30} MS 14 Color {35%, 42%, 46%, Uniform {0, 0.05, 0.10, 0.20} 50%, 54%, 58%, 65%} VRS 16: E1 Numerosity Range: [21, 80] Piecewise {0, 0.02, 0.04, 0.06 Uniform 0.30, 0.32, 0.34, 0.36} VRS 16: E2 Numerosity Range: [3, 98] Approximately {0, 0.05,..., 0.50} Gaussian VRS 16: E3 Numerosity Range: [31, 70] Uniform {0, 0.02,..., 0.20} VRS 16: E4 Numerosity Range: [3, 98] Uniform {0, 0.02,..., 0.48} Notes. Each row shows the set of conditions used in the experiment, the distribution of these conditions across trials and the set of drift parameters used to compute optimal policies in Fig. 11. The value given to a condition refers to the motion coherence for the motion discrimination task, to the separation of dots for the distance judgment task, to the proportion of black pixels for the brightness discrimination task, to the percentage of cyan to magenta checkers for the color judgment task and to the number of asterisks for the numerosity judgment task. For the computation VRS 16: E2, the probability of each drift value  was equal to N (; μ, σ ),where N (·) is the probability density of the normal distribution with μ = 0, and standard deviation σ = 0.21 and Z is a normalization factor ensuring that the probabilities add up to 1. The names of studies are abbreviated as follows PHS 05: (Palmer, Huk, & Shadlen, 2005), RTM 01: (Ratcliff, Thapar, & McKoon, 2001), R 07: (Ratcliff, Hasegawa, Hasegawa, Smith, & Segraves, 2007), RM 08: (Ratcliff & McKoon, 2008), MS 14: (Middlebrooks & Schall, 2014), VRS 16: (Voskuilen et al., 2016) with E1...E4 standing for Experiments 1 ... 4, respectively. set of difficulties in the original experiment with the error levels used in these experiments with the distribution of rates for the optimal bounds predicted by a corresponding drifts used for our computations. We chose this set of exper- set of drifts (see Appendix B for details). We found that a iments so that they cover the entire range of mixture of range of reasonable mappings between the difficulties used difficulties considered across experiments considered by in an experiment and the set of drifts { ,..., } gave fairly Hawkins et al. (2015) and Voskuilen et al. (2016). 1 n similar shapes of optimal boundaries. Figure 11 shows the optimal policies obtained by using Another problem is that it is unclear how the inter-trial the dynamic programming procedure for each mixture of interval used in the experiments maps on to the inter-trial drifts in Table 1. While the shape of any optimal bound- interval used in the dynamic programming procedure. More ary depends on the inter-trial interval (as discussed above), precisely, in the dynamic programming procedure the inter- we found that the slope of optimal boundaries remained trial interval is specified as a multiple of the rate at which similar for a range of different inter-trial intervals and the the evidence is delivered. However, due to the implicit sam- inset in each figure shows how this slope changes with a pling in the original experiment, we do not know the relation change in inter-trial interval. The insets also compare this between the (internal) evidence sampling rate and the inter- slope (solid, red line) with the flat boundary (dotted, black trial intervals. Therefore, we computed the optimal policies line) and the optimal slope for a mixture of two difficul- for a wide range of different inter-trial intervals. As we ties,  ∈{0, 0.20}, which leads to rapidly decreasing bounds show below, even though the optimal policy changes with a (dashed, blue line). A (red) dot in each inset indicates the change in inter-trial interval, the slope of the resulting opti- value of inter-trial interval used to plot the policies in the mal decision boundaries remain fairly similar across a wide main plot. All policies have been plotted for the same value range of intervals. of inter-trial interval used in the computations above, i.e., Table 1 summarizes the conditions used in experiments, D = D = 150, except Fig. 11b, which uses D = D = C I C I the distribution of these conditions across trials and a cor- 300 to highlight a property of optimal policies observed responding set of drifts used in the dynamic programming when a task consists of more than two difficulty levels (see procedure. We also matched the distribution of difficulty below). 990 Psychon Bull Rev (2018) 25:971–996 (a) (b) (c) (d) (e) (f) (h) (i) (g) Fig. 11 Optimal policies for mixture of difficulties used in experi- and compares it to flat boundaries (dotted, black line) and the mixture ments considered by Hawkins et al. (2015) and Voskuilen et al. (2016).  ∈{0.20, 0.50}, which gives a large slope across the range of inter- Insets show the slope of the optimal boundary (measured as tangent trial intervals (dashed, blue line). The dot in the inset along each solid of a line fitting the boundary) across a range of inter-trial intervals (red) line indicates the value of inter-trial interval used to generate the for the mixture of drifts that maps to the experiment (solid, red line) optimal policy shown in the main figure Extending the framework to more than two difficulties evidence as a function of number of samples observed dur- reveals two important results. First, the optimal bounds are ing a trial. Optimal bounds do seem to decrease slightly nearly flat across the range of mixed difficulty tasks used in for some mixed difficulty tasks when they include a sub- these experiments. In order to monitor the long-term trend stantial proportion of “very difficult” trials (e.g., MS_14 for these slopes, we have plotted each of the policies in and VRS_16: E2, E3). However, even in these cases, the Fig. 11 till time step t = 100 (in contrast to t = 50 above). amount of decrease is small (compare the solid, red line In spite of this, we observed very little change in optimal to the dashed, blue line which corresponds to a mixture Psychon Bull Rev (2018) 25:971–996 991 that gives a large decrease) and it would be very diffi- & Maddox, 2003; Myung & Busemeyer, 1989), especially cult to distinguish between constant or decreasing bounds with an increase in the difficulty of trials (Balci et al., 2011; based on the reaction-time distributions obtained from these Starns and Ratcliff, 2012) and with the increase in speed of bounds. Second, for some mixtures of difficulties, the opti- the decision-making task (Simen et al., 2009). To explain mal bounds are a non-monotonic function of time where the this behavior, a set of studies have investigated alternative optimal boundary first increases, then remains constant for objective functions (Bohil & Maddox, 2003; Bogacz et al., some time and finally decreases (see, for example, Fig. 11b). 2006; Zacksenhouse et al., 2010). For example, Bogacz, This non-monotonic pattern occurred only when number of Hu, Holmes and Cohen (2010) found that only about 30% trial difficulties was greater than two. of participants set the boundaries to the level maximizing Clearly these computations and in particular their map- reward rate. In contrast, the bounds set by the majority of ping onto the original studies have to be interpreted with participants could be better explained by maximization of a caution, due to the difficulties in translating continuous, modified reward rate which includes an additional penalty implicit sampling paradigms to the discrete expanded judg- (in a form of negative reward) after each incorrect trial, ment framework. Nevertheless, our wide-ranging explo- although no such penalty was given in the actual experi- ration of the parameter space (mixtures of difficulties and ment (Bogacz et al., 2010). Analogous virtual penalties for inter-trial intervals) suggests that the optimal boundaries in errors imposed by the participants themselves can be eas- these experiments may be very close to flat boundaries. ily incorporated in the proposed framework by making R In that case, even if participants maximized reward rate more negative (Eq. 8). in these experiments, it may be difficult to identify subtly However, understanding the behavior that maximizes decreasing boundaries on the basis of realistic empirical evi- reward rate is important for several reasons. Firstly, recent dence. Of course, we do not know what variable participants evidence indicates that the decision boundaries adopted by were trying to optimize in these studies. The computations human participants approach reward-rate optimizing bound- above highlight just how crucial an explicit reward structure aries, in single difficulty tasks, provided participants get is in that regard. enough training and feedback (Evans & Brown, 2016). This suggests that people use reward rate to learn the decision Reward-rate maximization boundaries over a sequence of trials. Secondly, the shape of the reward landscape may explain A related point is that many decision-making tasks why people adopt more cautious strategies than warranted (included those considered in the meta-analyses mentioned by maximizing reward rate. In a recent set of experiments, above) do not carefully control the reward structure of the we used an expanded judgment task to directly infer the experiment. Many studies instruct the participant simply to decision boundaries adopted by participants and found that be “as fast and accurate as possible”. The model we consider participants may be choosing decision boundaries that trade in this study is unable to make predictions about the optimal off between maximizing reward rate and the cost of errors shape of boundaries in these tasks, because it is not clear in the boundary setting (Malhotra et al., in press). That what the participant is optimizing. It could be that when is, we considered participants’ decision boundaries on a the goal of the participant is not precisely related to their “reward landscape” that specifies how reward rate varies as performance, they adopt a strategy such as “favor accuracy a function of the height and slope of the decision boundary. over speed” or “minimize the time spent in the experi- We noted that these landscapes were asymmetrical around ment and leave as quickly as possible, without committing the maximum reward rate, so that an error in the “wrong” a socially unacceptable proportion of errors” (Hawkins, direction would incur a large cost. Participants were gen- Brown, Steyvers, & Wagenmakers, 2012). On the other erally biased away from this “cliff edge” in the reward hand, it could also be that when people are given instruc- landscape. Importantly, across a range of experiments, par- tions that precisely relate their performance to reward, the ticipants were sensitive to experimental manipulations that cost required to estimate the optimal strategy is too high and modified this global reward landscape. That is, participants people simply adopt a heuristic – a fixed threshold – that shifted their decision boundaries in the direction predicted does a reasonable job during the task. by the optimal policies shown in Fig. 3 when the task More generally, one could question whether people switched from single to mixed difficulties. This happened indeed try to maximize reward rate while making sequential even when the task was fast-paced and participants were decisions and hence the relevance of policies that maximize given only a small amount of training on each task. Thus, even though people may not be able to maximize reward reward rate for empirical research. After all, a number of studies have found that people tend to overvalue accuracy rate, they are clearly sensitive to reward-rate manipulations and set decision boundaries that are wider than warranted and respond adaptively to such manipulations by changing by maximizing reward rate (Maddox & Bohil, 1998; Bohil their decision boundaries. 992 Psychon Bull Rev (2018) 25:971–996 Lastly, the optimal policies predicted by the dynamic Another simplifying assumption is that the decision-maker’s programming procedure above provides a normative target environment remains stationary over time. In more ecolog- for the (learned or evolved) mechanism used by people to ically plausible situations, parameters such as the drift rate, make decisions. Thus, these normative models provide a inter-stimulus interval and reward per decision will vary framework for understanding empirical behavior; if people over time. For example, the environment may switch from deviate systematically from these optimal decisions, it will being plentiful (high expectation of reward) to sparse (low be insightful to understand why, and under what conditions, expectation of reward). In these situations, each trial will they deviate from a policy that maximizes the potential inform the decision-maker about the state of the environment reward and how these alternative objective functions relate and the normative decision-maker should adapt the boundary to reward-rate maximization. from trial-to-trial based on the inferred state. Such an adaptive model was first examined by Vickers (1979), who proposed Assumptions and generalizations a confidence-based adjustment of decision-boundary from trial-to-trial. Similarly, Simen, Cohen and Holmes (2006) We have made a number of assumptions in this study with proposed a neural network model that continuously adjusts the specific aim of establishing the minimal conditions for the boundary based on an estimate of the current reward time-varying decision boundaries and exploring how prop- rate. erties of decision (such as difficulty) affects the shape of Recent experiments have shown that participants indeed decision bounds. respond to environmental change by adapting the gain of Firstly, note that in contrast to previous accounts that each stimulus (Cheadle et al., 2014) or the total amount use dynamic programming to establish optimal decision of evidence examined for each decision (Lee, Newell, & boundaries (e.g., Drugowitsch et al., 2012; Huang & Rao, Vandekerckhove, 2014). Lee et al. (2014) found that sequen- 2013), we compute optimal policies directly in terms of tial sampling models can capture participant behavior in evidence and time, rather than (posterior) belief and time. such environments by incorporating a regulatory mecha- There are two motivations for doing this. Firstly, our key nism like confidence, i.e. a confidence-based adjustment of goal here is to understand the shape of optimal decision decision boundary. However, they also found large indivi- boundaries for sequential sampling models which define dual differences in the best-fitting model and in the parame- boundaries in terms of evidence. Indeed, most studies ters chosen for the regulatory mechanism. An approach that which have aimed to test whether decision boundaries col- combines mechanistic models such as those examined by lapse, do so by fitting sequential sampling or accumulator Lee et al. (2014) and normative models such as the one dis- models to reaction time and error data (Ditterich, 2006; cussed above could explain why these individual differences Drugowitsch et al., 2012; Hawkins et al., 2015; Voskuilen occur and how strategies differ with respect to a common et al., 2016). Secondly, we do not want to assume that currency, such as the average reward. the decision-making system necessarily computes poste- rior beliefs. This means that the decision-making process Acknowledgements This research was carried out as part of the that aims to maximize reward rate can be implemented by project ‘Decision-making in an unstable world’, supported by the a physical system integrating sensory input. For an alter- Engineering and Physical Sciences Research Council (EPSRC), Grant native approach, see Drugowitsch et al. (2012), who use Reference EP/1032622/1. The funding source had no other role other dynamic programming to compute the optimal boundaries than financial support. Additionally, RB was supported by Medi- cal Research Council grant MC UU 12024/5 and GM and CL were in belief space and then map these boundaries to evidence supported by EPSRC grant EP/M000885/1. space. All authors contributed to the development of the theory, carrying Next, a key assumption is that policies can be compared out the computations and writing of the manuscript. All authors have on the basis of reward rate. While reward rate is a sensible read and approved the final manuscript. All authors state that there are no conflicts of interest that may inap- scale for comparing policies, it may not always be the eco- propriately impact or influence the research and interpretation of the logically rational measure. In situations where the number findings. of observations are limited (e.g., Rapoport & Burkheimer, 1971; Lee & Zhang, 2012) or the time available for making a decision is limited (e.g., Frazier & Yu, 2007), the decision- maker should maximize the expected future reward rather Open Access This article is distributed under the terms of the than the reward rate. If the number of decisions are fixed and Creative Commons Attribution 4.0 International License (http:// creativecommons.org/licenses/by/4.0/), which permits unrestricted time is not a commodity, then the decision-maker should use, distribution, and reproduction in any medium, provided you give maximize the accuracy. In general, the ecological situation appropriate credit to the original author(s) and the source, provide a or the experiment’s design will determine the scale on which link to the Creative Commons license, and indicate if changes were policies can be compared. made. Psychon Bull Rev (2018) 25:971–996 993 Appendix A: Eventually it is optimal to go at zero the decision-maker will choose the action go after τ<  steps, achieving a value of Q (go) and (T +τ,x) In this appendix we show that, under a mild condition, it will incurring an additional waiting cost of ρ τ,or be optimal to guess a hypothesis when the evidence level x the decision-maker will still be waiting until time T + is 0 for sufficiently large t. In other words, the bounds do , achieving value v but incurring an additional (t +,x) eventually collapse to 0. The situation we envisage is one in waiting cost of ρ . which some of the trials in the mixed condition have zero Therefore the value of waiting at (T , 0) is a convex combi- drift, i.e.,  = 0, so that nation of the time-penalized value of these future outcomes, so 1 1 1 1 − p p := P U = >0, P U = +  =P U = −  = . 0 e e 2 2 2 2 π π π π π Q (wait ) ≤max max {Q (go)−ρ τ },v −ρ  . (T ,0) (T +τ,x) (t +,x) 1≤τ<,|x|≤τ We also assume that the decision-maker gets a unit reward (12) for making correct decisions and no reward for making π π incorrect decisions and there is a fixed inter-trial delay D Note that by Eq. 11, v ≤ 1 − ρ D. Also note that (t +,x) between taking a go action and returning to the zero-value Q (go) is the expected instantaneous reward from (T +τ,x) state (0, 0). the action, plus a time penalty of ρ D. We will show We start with some observations. First note that the below that, for any η> 0, we can choose T sufficiently decision-maker could use a policy which always guesses at large that this expected instantaneous reward is less than or 1 π 1 π t = 0. This policy scores on average per trial, and tri- equal to + η. Therefore Q (go) ≤ + η − ρ D. 2 (T +τ,x) 2 als take D time units (since there is no time spent gathering Hence evidence). Hence the average reward per unit time of this guessing policy is . The optimal policy π ˆ will therefore 2D π π π π Q (wait ) ≤max max { +η−ρ (τ +D)},1−ρ D −ρ  . π ˆ (T ,0) 1≤τ< 2 have average reward per unit time, ρ ≥ . Similarly, an 2D oracle policy can guess correctly at time 0, and achieve an For an interval  = 2D average reward per unit time of ; since the performance of π π π π any policy is bounded above by this perfect instant guessing Q (wait ) ≤ max + η − ρ , 1 − 2ρ D − ρ D. (T ,0) π ˆ policy, ρ ≤ .Wehaveshown that Now 1 − 2ρ D ≤ 0byEq. 10 and if we choose η (which is 1 1 1 π ˆ an arbitrary constant) to be such that η< ,then η< ρ ≤ ρ ≤ . (10) 2D 2D D and it follows that Along similar lines, and recalling that we have fixed v = π π π (0,0) Q (wait ) < − ρ D = Q (go). (T ,0) (T ,0) 0, note that the maximum possible reward resulting from choosing a hypothesis is equal to 1, and there will be delay The optimal action at (T , 0) is therefore to go. at least D in transitioning from (t, x) to (0, 0),sofor any π, It remains to show that, if T is sufficiently large, the x and t expected instantaneous reward is bounded by + η.The expected instantaneous reward in any state is equal to the π π v ≤ 1 − ρ D. (11) (t,x) size of the reward times the probability of receiving it. Since we assume that the reward size is one unit and decision- We now prove that for a sufficiently large T , the opti- makers receive a reward only for correct decisions, the go malactioninthe state (T , 0) is go. This will be true if the expected instantaneous reward in a state (t, x) is p . (t,x)→C value of taking the action go, in state (T , 0), is larger than From Eq. 7, we know that taking the action wait. If we denote the value of taking go p = max {P(U ∈ U |X = x), P(U ∈ U |X = x)} , action a in state (t, x) under policy π by Q (a), we can + t − t (t,x)→C (t,x) π π write this condition as Q (go) > Q (wait ).Note (T ,0) (T ,0) and in the special case where  = 0, P(U ∈ U |X = x) d + t π π that Q (go) = − ρ D, since the selected hypothesis 1 1 (T ,0) 2 can be replaced by P(U = +  |X = x) + P(U = e t 2 2 is correct with probability and then there is a delay of D |X = x). Recall that each of the paths that reach X = 2 t t t +x t −x before returning to the zero-value state (0, 0). Therefore, we x contain n = upward transitions and n = u d 2 2 π 1 would like to prove that Q (wait ) < − ρ D. downward transitions. From Eq. 5 we have that (T ,0) 2 Now, consider a time window of duration  after T .The n n u d u (1 − u) P(U = u) value of waiting at (T , 0) will depend on one of two future P(U = u | X = x) =  . n n u ˜ (1 −˜ u) P(U =˜ u) u ˜∈U outcomes during this time window. Either: 994 Psychon Bull Rev (2018) 25:971–996 Hence by Eq. 7 the expected instantaneous reward under the action go when x ≥ 0 (and the hypothesis + is selected) is therefore 1 1 1 go p = P U = +  |X = x + P U = |X = x e t t (t,x)→C 2 2 2 n n n n u d u d 1−p 1 1 0 1 1 1 +  −  + p 2 2 2 2 2 2 n n n n n n u d u d u d 1 1 1−p 1 1 1 1 1−p 0 0 +  −  + p + −  + 2 2 2 2 2 2 2 2 t −x 2 x (1 − 4 ) (1 + 2) (1 − p ) + p 0 0 = . t −x 2 x x (1 − 4 ) [(1 + 2) + (1 − 2) ](1 − p ) + 2p 0 0 t −x For fixed x, (1 − 4 ) → 0as t →∞,sothat accuracy levels in the original study with the accuracy lev- the expected reward from going at x converges to as t els for each drift for the simulated decisions. We rejected becomes large. Since the maximization in Eq. 12 is over any given set of drifts that underestimated or overestimated |x|≤ τ< , we can take T sufficiently large that the the empirically observed range of accuracies for the chosen expected instantaneous reward from the action go in any range of inter-trial delays. This left us with a set of drifts, state (T + τ, x) with 1 ≤ τ<  and |x|≤ τ is less than shown in Table 1, that approximately matched the level of + η.Sofor any η we can say the following: for any x,for a accuracies in the original study. sufficiently large t ≥ T , the instantaneous reward for going For example, Fig. 12 shows the accuracies for deci- at (t, x) is less than + η. An identical calculation holds for sions simulated in the above manner by computing optimal x ≤ 0. bounds for two different sets of drifts: {0, 0.03, 0.05, 0.10, 0.20, 0.40} and {0.03, 0.04, 0.06, 0.08, 0.11, 0.15}. Each mixture contains six different difficulties, just like the orig- Appendix B: Mapping experimental conditions inal study conducted by Palmer et al. (2005). We performed to drifts these simulations for a range of inter-trial delays and Fig. 12 shows three such delays. The (yellow) bar on the left of each We now describe how we selected a set of drifts correspond- panel shows the empirically observed range of accuracies. It ing to each study in Table 1. Since these studies do not use is clear that the range of difficulties in Fig. 12b considerably an expanded judgment paradigm, we do not explicitly know underestimates the empirically observed range of accuracies the values of drift parameter; instead, these studies specify a and therefore is not an appropriate approximation of the dif- measurable property of the stimulus, such as motion coher- ficulties used in the original study. On the other hand, the ence, that is assumed to correlate with the drift. However, range of difficulties in Fig. 12a captures the observed range we know how the participants performed in this task – i.e. of accuracies for a variety of inter-trial delays. their accuracy – during each of these coherence conditions. Figure 12 also illustrates that the mapping between drift These accuracy levels constrain the possible range of drift rates and error rates is complex since the parameter space is rates that correspond to the motion coherences and can be highly multi-dimensional—with accuracy a function of the used to infer an appropriate range of drifts. inter-trial delay as well as the n values in the set { ... }. 1 n Specifically, we used the following method to determine In order to choose an appropriate mapping, we explored whether any given set of drift rates, { ,..., }, approxi- a considerable range of delays and sets of drifts. While 1 n mated the conditions for a study: (i) we used the dynamic the complexity of this parameter space makes it difficult programming procedure described in the main text to com- to be absolutely certain that there is no combination of a pute the optimal bounds for a mixed difficulty task with set of drifts and delays for which more strongly decreasing difficulties drawn from the given set of drifts and for a range boundaries are seen, our observation was that the optimal of inter-trial delays, D ; (ii) we simulated decisions by inte- boundaries shown in Fig. 11 were fairly typical of each grating noisy evidence to these optimal bounds, with the study for reasonable choice of parameters. Where more drift rate of each trial chosen randomly from the given set strongly decreasing boundaries were seen, they were (a) still of drifts; (iii) we determined the accuracy levels, a ,...,a , much shallower than the optimal boundaries for the mixture 1 n of these simulated decisions; (iv) finally, we compared the of two difficulties ( ∈{0, 0.20}, as conveyed in insets for Psychon Bull Rev (2018) 25:971–996 995 1.0 1.0 delay delay 0.8 100 0.8 100 200 200 300 300 0.6 0.6 0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.6 0.7 0.8 0.9 1.0 up−probability up−probability (a) (b) Fig. 12 A comparison of accuracies between decisions simulated [0.03, 0.15]. Squares, circles,and triangles show these accuracies for from optimal bounds and decision performed by participants. The two an inter-trial delay, D , of 100, 200, and 300 time units, respectively. panels show the accuracy of 10,000 simulated decisions during two The yellow bar on the left of each panel shows the range of accura- mixed difficulty tasks, each of which use a mixture of six difficulty cies observed in Experiment 1 of Palmer et al. (2005) which used six levels but differ in the range of difficulties. Panel (a) uses a large range different motion coherence levels: {0%, 3.2%, 6.4%, 12.8%, 25.6%, of drifts [0, 0.40], while panel (b) uses a comparatively smaller range 51.2%} Fig. 11), and (b) the predicted accuracy levels did not match Diederich, A., & Busemeyer, J. R. (2006). Modeling the effects of payoff on response bias in a perceptual discrimination task: those that were empirically observed. Bound-change, drift-rate-change, or two-stage-processing hypoth- esis. Perception & Psychophysics, 68(2), 194–207. Ditterich, J. (2006). Stochastic models of decisions about motion References direction: behavior and physiology. Neural Networks, 19(8), 981– Drugowitsch, J., Moreno-Bote, R., Churchland, A. K., Shadlen, M. N., Ashby, F. G. (1983). A biased random walk model for two choice & Pouget, A. (2012). The cost of accumulating evidence in per- reaction times. Journal of Mathematical Psychology, 27(3), 277– ceptual decision-making. Journal of Neuroscience, 32, 3612– Balci, F., Simen, P., Niyogi, R., Saxe, A., Hughes, J. A., Holmes, Edwards, W. (1965). Optimal strategies for seeking information: P., & Cohen, J. D. (2011). Acquisition of decision-making crite- Models for statistics, choice reaction times, and human informa- ria: reward rate ultimately beats accuracy. Attention, Perception, & tion processing. Journal of Mathematical Psychology, 2(2), 312– Psychophysics, 73(2), 640–657. Bellman, R. (1957). Dynamic programming Princeton. Princeton N.J.: Evans, N. J., & Brown, S. D. (2016). People adopt optimal policies in University Press. simple decision-making, after practice and guidance. Psychonomic Bernoulli, J. (1713). Ars conjectandi. Impensis Thurnisiorum, fratrum. Bulletin & Review, pp. 1–10. doi:10.3758/s13423-016-1135-1. Bogacz, R., Brown, E., Moehlis, J., Holmes, P., & Cohen, J. D. Frazier, P., & Yu, A. J. (2007). Sequential hypothesis testing under (2006). The physics of optimal decision-making: a formal analysis stochastic deadlines. In Advances in neural information process- of models of performance in two-alternative forced choice tasks. ing systems (pp. 465–472). Psychological Review, 113(4), 700–765. Ghosh, B. K. (1991). A brief history of sequential analysis. Handbook Bogacz, R., Hu, P. T., Holmes, P. J., & Cohen, J. D. (2010). Do humans of Sequential Analysis,1. produce the speed–accuracy trade-off that maximizes reward rate? Hanks, T. D., Mazurek, M. E., Kiani, R., Hopp, E., & Shadlen, M. N. The Quarterly Journal of Experimental Psychology, 63(5), 863– (2011). Elapsed decision time affects the weighting of prior prob- ability in a perceptual decision task. The Journal of Neuroscience, Bohil, C. J., & Maddox, W. T. (2003). On the generality of optimal 31(17), 6339–6352. versus objective classifier feedback effects on decision criterion Hawkins, G. E., Brown, S. D., Steyvers, M., & Wagenmakers, E.-J. learning in perceptual categorization. Memory & Cognition, 31(2), (2012). An optimal adjustment procedure to minimize experiment 181–198. time in decisions with multiple alternatives. Psychonomic Bulletin Busemeyer, J. R., & Rapoport, A. (1988). Psychological models of &Review, 19(2), 339–348. deferred decision-making. Journal of Mathematical Psychology, Hawkins, G. E., Forstmann, B. U., Wagenmakers, E.-J., Ratcliff, R., 32(2), 91–134. & Brown, S. D. (2015). Revisiting the evidence for collapsing Cheadle, S., Wyart, V., Tsetsos, K., Myers, N., De Gardelle, V., boundaries and urgency signals in perceptual decision-making. Castan˜on, ´ S. H., & Summerfield, C. (2014). Adaptive gain control The Journal of Neuroscience, 35(6), 2476–2484. during human perceptual choice. Neuron, 81(6), 1429–1441. Howard, R. A. (1960). Dynamic programming and Markov processes. Deneve, S. (2012). Making decisions with unknown sensory reliability. New York: Wiley. Frontiers in Neuroscience,6. Accuracy (percent correct) Accuracy (percent correct) 996 Psychon Bull Rev (2018) 25:971–996 Huang, Y., Hanks, T., Shadlen, M., Friesen, A. L., & Rao, R. P. (2012). Ratcliff, R. (1978). A theory of memory retrieval. Psychological How prior probability influences decision-making: a unifying Review, 83, 59–108. probabilistic model. In Advances in neural information processing Ratcliff, R. (1985). Theoretical interpretations of the speed and accu- systems (pp. 1268–1276). racy of positive and negative responses. Psychological Review, Huang, Y., & Rao, R. P. (2013). Reward optimization in the primate 92(2), 212–225. brain: a probabilistic model of decision-making under uncertainty. Ratcliff, R., Hasegawa, Y. T., Hasegawa, R. P., Smith, P. L., & PloS one, 8(1), e53344. Segraves, M. A. (2007). Dual diffusion model for single-cell Kiani, R., & Shadlen, M. N. (2009). Representation of confidence recording data from the superior colliculus in a brightness- associated with a decision by neurons in the parietal cortex. discrimination task. Journal of Neurophysiology, 97(2), 1756– Science, 324, 759–764. 1774. LaBerge, D. (1962). A recruitment theory of simple behavior. Psy- Ratcliff, R., & McKoon, G. (2008). The diffusion decision model: the- chometrika, 27(4), 375–396. ory and data for two-choice decision tasks. Neural Computation, Laming, D. R. J. (1968). Information theory of choice-reaction times. 20(4), 873–922. London: Academic Press. Ratcliff, R., & Smith, P. L. (2004). A comparison of sequential sam- Laplace, P.-S. (1774). Memoire ´ sur les suites recurro-r ´ ecurrentes ´ et sur pling models for two-choice reaction time. Psychological Review, leurs usages dans la theorie ´ des hasards. Memoir ´ es de l’Academie ´ 111(2), 333. Royale des Sciences Paris, 6, 353–371. Ratcliff, R., Thapar, A., & McKoon, G. (2001). The effects of aging Laplace, P.-S. (1812). Theorie ´ Analytique des probabilites ´ .Paris: on reaction time in a signal detection task. Psychology and Aging, Courcier. 16(2), 323. Lee, M. D., Newell, B. R., & Vandekerckhove, J. (2014). Modeling Ross, S. (1983). Introduction to stochastic dynamic programming. the adaptation of search termination in human decision-making. New York: Academic Press. Decision, 1(4), 223–251. Simen, P., Cohen, J. D., & Holmes, P. (2006). Rapid decision threshold Lee, M. D., & Zhang, S. (2012). Evaluating the coherence of take-the- modulation by reward rate in a neural network. Neural Networks, best in structured environments. Judgment and Decision Making, 19(8), 1013–1026. 7(4). Simen, P., Contreras, D., Buck, C., Hu, P., Holmes, P., & Cohen, Link, S., & Heath, R. (1975). A sequential theory of psychological J. D. (2009). Reward rate optimization in two-alternative decision- discrimination. Psychometrika, 40(1), 77–105. making: empirical tests of theoretical predictions. Journal of Maddox, W. T., & Bohil, C. J. (1998). Base-rate and payoff effects Experimental Psychology: Human Perception and Performance, in multidimensional perceptual categorization. Journal of Exper- 35(6), 1865. imental Psychology: Learning Memory, and Cognition, 24(6), Starns, J. J., & Ratcliff, R. (2012). Age-related differences in dif- 1459. fusion model boundary optimality with both trial-limited and Malhotra, G., Leslie, D. S., Ludwig, C. J., & Bogacz, R. (in press). time-limited tasks. Psychonomic Bulletin & Review, 19(1), 139– Overcoming indecision by changing the decision boundary. Jour- 145. nal of Experimental Psychology: General, 146(6), 776. Stone, M. (1960). Models for choice-reaction time. Psychometrika, Middlebrooks, P. G., & Schall, J. D. (2014). Response inhibition 25(3), 251–260. during perceptual decision-making in humans and macaques. Summerfield, C., & Koechlin, E. (2010). Economic value biases uncer- Attention, Perception, & Psychophysics, 76(2), 353–366. tain perceptual choices in the parietal and prefrontal cortices. Moran, R. (2015). Optimal decision-making in heterogeneous and Frontiers in human neuroscience,4. biased environments. Psychonomic Bulletin & Review, 22(1), 38– Thura, D., Cos, I., Trung, J., & Cisek, P. (2014). Context-dependent 53. urgency influences speed–accuracy trade-offs in decision-making Mulder, M. J., Wagenmakers, E.-J., Ratcliff, R., Boekel, W., & and movement execution. The Journal of Neuroscience, 34(49), Forstmann, B. U. (2012). Bias in the brain: a diffusion model 16442–16454. analysis of prior probability and potential payoff. The Journal of Vickers, D. (1970). Evidence for an accumulator model of psy- Neuroscience, 32(7), 2335–2343. chophysical discrimination. Ergonomics, 13(1), 37–58. Myung, I. J., & Busemeyer, J. R. (1989). Criterion learning in a Vickers, D. (1979). Decision processes in visual perception. Academic deferred decision-making task. The American journal of psychol- Press. ogy, pp. 1–16. Voskuilen, C., Ratcliff, R., & Smith, P. L. (2016). Comparing fixed Palmer, J., Huk, A. C., & Shadlen, M. N. (2005). The effect of stim- and collapsing boundary versions of the diffusion model. Journal ulus strength on the speed and accuracy of a perceptual decision. of Mathematical Psychology, 73, 59–79. Journal of vision, 5(5). Wald, A. (1945a). Sequential method of sampling for deciding Pitz, G. F., Reinhold, H., & Geller, E. S. (1969). Strategies of between two courses of action. Journal of the American Statistical information seeking in deferred decision-making. Organizational Association, 40(231), 277–306. Behavior and Human Performance, 4(1), 1–19. Wald, A. (1945b). Sequential tests of statistical hypotheses. The Annals Pollock, S. M. (1964). Sequential search and detection. Cambridge: of Mathematical Statistics, 16(2), 117–186. MIT. (Unpublished doctoral dissertation). Wald, A., & Wolfowitz, J. (1948). Optimum character of the sequential Puterman, M. L. (2005). Markov decision processes: Discrete stochas- probability ratio test. The Annals of Mathematical Statistics, 19(3), tic dynamic programming. New Jersey: Wiley. 326–339. Rapoport, A., & Burkheimer, G. J. (1971). Models for deferred Zacksenhouse, M., Bogacz, R., & Holmes, P. (2010). Robust versus decision-making. Journal of Mathematical Psychology, 8(4), 508– optimal strategies for two-alternative forced choice tasks. Journal 538. of Mathematical Psychology, 54(2), 230–246. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Psychonomic Bulletin & Review Springer Journals

Time-varying decision boundaries: insights from optimality analysis

Free
26 pages
Loading next page...
 
/lp/springer_journal/time-varying-decision-boundaries-insights-from-optimality-analysis-wAYvhadtpD
Publisher
Springer US
Copyright
Copyright © 2017 by The Author(s)
Subject
Psychology; Cognitive Psychology
ISSN
1069-9384
eISSN
1531-5320
D.O.I.
10.3758/s13423-017-1340-6
Publisher site
See Article on Publisher Site

Abstract

Psychon Bull Rev (2018) 25:971–996 DOI 10.3758/s13423-017-1340-6 THEORETICAL REVIEW Time-varying decision boundaries: insights from optimality analysis 1 2 1 3 Gaurav Malhotra · David S. Leslie · Casimir J. H. Ludwig · Rafal Bogacz Published online: 20 July 2017 © The Author(s) 2017. This article is an open access publication Abstract The most widely used account of decision- can choose to opt out of decisions. The theoretical model making proposes that people choose between alternatives presented here provides an important framework to under- by accumulating evidence in favor of each alternative until stand how, why, and whether decision boundaries should this evidence reaches a decision boundary. It is frequently change over time in experiments on decision-making. assumed that this decision boundary stays constant dur- ing a decision, depending on the evidence collected but Keywords Decision-making · Decreasing bounds · not on time. Recent experimental and theoretical work has Optimal decisions · Reward rate challenged this assumption, showing that constant decision boundaries are, in some circumstances, sub-optimal. We introduce a theoretical model that facilitates identification Introduction of the optimal decision boundaries under a wide range of conditions. Time-varying optimal decision boundaries for In many environmental settings, people frequently come our model are a result only of uncertainty over the difficulty across decision-making problems where the speed of mak- of each trial and do not require decision deadlines or costs ing decisions trades off with their accuracy. Consider for example the following problem: a financial advisor is associated with collecting evidence, as assumed by previ- ous authors. Furthermore, the shape of optimal decision employed by a firm to make buy/sell recommendations on their portfolio of assets. All assets seem identical but the boundaries depends on the difficulties of different decisions. When some trials are very difficult, optimal boundaries value of some assets is stochastically rising while the value of others is falling. For each correct recommendation (advi- decrease with time, but for tasks that only include a mixture of easy and medium difficulty trials, the optimal boundaries sor recommends buy and the asset turns out to be rising or vice-versa), the advisor receives a fixed commission and for increase or stay constant. We also show how this simple model can be extended to more complex decision-making each incorrect recommendation (advisor recommends buy and the asset turns out to be falling or vice-versa) they pay a tasks such as when people have unequal priors or when they fixed penalty. In order to make these recommendations, the advisor examines the assets sequentially and observes how the value of each asset develops over time. Each observation Gaurav Malhotra takes a finite amount of time and shows whether the value of gaurav.malhotra@bristol.ac.uk the asset has gone up or down over this time. Before recom- mending whether the firm should buy or sell the asset, the School of Experimental Psychology, University of Bristol, advisor can make as many of these up/down observations as 12a Priory Road, Bristol BS8 1TU, UK they like. However, there is an opportunity cost of time as Department of Mathematics and Statistics, Lancaster the advisor wants to maximize the commission every month University, Lancaster, UK by making as many correct recommendations as possible. 3 How many (up/down) observations should the advisor make MRC Brain Networks Dynamics Unit, University of Oxford, Oxford, UK for each asset before giving a recommendation? 972 Psychon Bull Rev (2018) 25:971–996 Sequential decision problems boundary. Also like the SPRT, the standard sequential sam- pling account assumes that this decision boundary remains The type of problem described above is at the heart of constant during a decision. In fact, Bogacz et al. (2006) sequential analysis and has been investigated by researchers showed that, under certain assumptions, including the from Bernoulli (1713) and Laplace, (1774, 1812) to mod- assumption that all decisions in a sequence are of the same ern day statisticians (for a review, see Ghosh, 1991). This difficulty, the decision-maker can maximize their reward problem is also directly relevant to the psychology and rate by employing the SPRT and maintaining an appro- neuroscience of decision-making. Many decision-making priately chosen threshold that remains constant within and problems, including perceptual decisions (how long to sam- across trials. In the above example, this means that if the ple sensory information before choosing an option) and financial advisor chose the stopping criterion, stop sampling foraging problems (how long to forage at the current patch if you observe three more ups than downs (or vice-versa), before moving to the next patch), can be described in the they stick with this criterion irrespective of whether they form above. The decision-maker has to make a series of have observed ten values of an asset or a hundred. choices and the information needed to make these choices A number of recent studies have challenged this account is spread out over time. The decision-maker wants to max- from both an empirical and a theoretical perspective, argu- imize their earnings by attempting as many decisions as ing that in many situations decision-makers decrease the possible in the allocated time. Sampling more information decision boundary with time and that it is optimal for them (up/down observations) allows them to be more accurate in to do so (Drugowitsch, Moreno-Bote, Churchland, Shadlen, their choices, at the expense of the number of decision prob- & Pouget, 2012; Huang & Rao, 2013; Thura, Cos, Trung, & lems that can be attempted. Therefore the speed of decisions Cisek, 2014;Moran, 2015). The intuition behind these stud- trades off with their accuracy and the decision-maker must ies is that instead of determining the decision boundaries solve (i) the stopping problem, i.e., decide how much infor- based on minimizing the average sample size at a desired mation to sample before indicating their decision, and (ii) level of accuracy (as some formulations of SPRT do), the decision problem, i.e., which alternative to choose, in decision-makers may want to maximize the expected reward such a way that they are able to maximize their earnings. earned per unit time, i.e., the reward rate. Psychological The stopping problem was given a beautifully simple studies and theories of decision-making generally give lit- solution by Wald (1945b), who proposed the following tle consideration to the reward structure of the environment. sequential procedure: after each sample (up/down obser- Participants are assumed to trade-off between accuracy and vation), compute the likelihood ratio, λ ,ofthesamples reaction time in some manner that is consistent with the— (X ,...,X ) and choose the first alternative (buy)if λ ≥ typically vague—experimenter instructions (e.g., “try to be 1 n n A and second alternative (sell)if λ ≤ B, otherwise con- as fast and accurate as possible”). Models integrating to a tinue sampling for n = 1, 2,...,where A and B are two fixed threshold often work well for these situations, giving suitably chosen constants. This procedure was given the good accounts for participants’ accuracy and reaction time name the sequential probability ratio test (SPRT). Wald distributions. However, it has been shown that using models (1945a, b) and Wald and Wolfowitz (1948) showed that integrating to a fixed threshold leads to sub-optimal reward the SPRT is optimal in the sense that it can guarantee a rate in heterogeneous environments—i.e., when decisions required level of accuracy (both Type 1 and Type 2 errors vary in difficulty (Moran, 2015). This leads to the natu- are bounded) with a minimum average sample size (number ral question: how should the decision-maker change their of up /down observations made). decision-boundary with time if their aim was to maximize This sequential procedure of continuing to sample evi- the reward rate. dence until a decision variable (likelihood ratio for SPRT) has crossed a fixed threshold also forms the basis for Optimal decision boundaries in sequential decision the most widely used psychological account of decision- problems making. This account consists of a family of models, which are collectively referred to as sequential sampling mod- A number of models have been used to compute the optimal els (Stone, 1960; LaBerge, 1962; Laming, 1968; Link & decision boundaries in sequential decision-making. These Heath, 1975; Vickers, 1970; Ratcliff, 1978) and have been models differ in (a) how the decision problem is formu- applied to a range of decision tasks over the last 50 years lated, and (b) whether the decision boundary is assumed to (for reviews, see Ratcliff & Smith, 2004; Bogacz, Brown, be fixed across trials or vary from one trial to next. Moehlis, Holmes, & Cohen, 2006). Like the SPRT, sequen- tial sampling models propose that decision-makers solve the stopping problem by accumulating evidence in favor 1 Throughout this article, we use ‘threshold’ to refer to a decision of each alternative until this evidence crosses a decision boundary that remains constant within and across trials. Psychon Bull Rev (2018) 25:971–996 973 Rapoport and Burkheimer (1971) modeled the deferred The present analysis decision-making task (Pitz, Reinhold, & Geller, 1969) where the maximum number of observations were fixed The principal aim of this paper is to identify the minimal in advance (and known to the observer) and making each conditions needed for time-varying decision boundaries, observation carried a cost. There was also a fixed cost under the assumption that the decision-maker is trying to for incorrect decisions and no cost for correct decisions. maximize the reward rate. We will develop a generic pro- Rapoport and Burkheimer used dynamic programming cedure that enables identification of the optimal decision (Bellman, 1957; Pollock, 1964) to compute the policy that boundaries for any discrete, sequential decision problem minimized the expected loss and found that the optimal described at the beginning of this article. In contrast to the boundary collapsed as the number of observations remain- problems considered by Rapoport and Burkheimer (1971) ing in a trial decreased. Busemeyer and Rapoport (1988) and Frazier and Yu (2007), we will show that the pressure found that, in such a deferred decision-making task, though of an approaching deadline is not essential for a decrease in people did not appear to follow the optimal policy, they decision boundaries. did seem to vary their decision boundary as a function of In contrast to Drugowitsch et al. (2012), we do not number of remaining observations. assume any explicit cost for making observations and A similar problem was considered by Frazier and Yu show that optimal boundaries may decrease even when (2007) but instead of assuming that the maximum num- making observations carries no explicit cost. Furthermore, ber of observations was fixed, they assumed that this unlike the very general setup of Drugowitsch et al. (2012) number was drawn from a known distribution and there and Huang and Rao (2013), we make several simplifying was a fixed cost for crossing this stochastic deadline. assumptions in order to identify how the shape of optimal Like Rapoport and Burkheimer (1971), Frazier and Yu decision boundaries changes with the constituent difficul- showed that under the pressure of an approaching deadline, ties of the task. In particular, in the initial exposition of the the optimal policy is to have a monotonically decreasing model, we restrict the difficulty of each decision to be one decision-boundary and the slope of boundaries increased of two possible levels (though see the Discussion for a sim- with the decrease in the mean deadline and an increase in its ple extension to more than two difficulties). In doing so, we variability. reveal three key results: (i) we show that optimal boundaries Two recent studies analyzed optimal boundaries for a must decrease to zero if the mixture of difficulties involves decision-making problem that does not constrain the max- some trials that are uninformative, (ii) the shape of optimal imum number of observations. Drugowitsch et al. (2012) boundaries depends on the inter-trial interval for incorrect considered a very general problem where the difficulty of decisions but not correct decisions (provided the latter is each decision in a sequence is drawn from a Gaussian smaller) and (iii) we identified conditions under which the or a general symmetric point-wise prior distribution and optimal decision boundaries increase (rather than decrease) accumulating evidence comes at a cost for each observa- with time within a trial. In fact, we show that optimal deci- tion. Using the principle of optimality (Bellman, 1957), sion boundaries decrease only under a very specific set Drugowitsch et al. showed that under these conditions, the of conditions. This analysis particularly informs the ongo- reward rate is maximized if the decision-maker reduces their ing debate on whether people and primates decrease their decision boundaries with time. Similarly, Huang and Rao decision boundaries, which has focused on analyzing data (2013) used the framework of partially observed Markov from existing studies to infer evidence of decreasing bound- decision processes (POMDP) to show that expected future aries (e.g., Hawkins, Forstmann, Wagenmakers, Ratcliff, & reward is maximized if the decision-maker reduces the Brown, 2015; Voskuilen, Ratcliff, & Smith, 2016). The evi- decision boundary with time. dence on this point is mixed. Our study suggests that such In contrast to the dynamic programming models men- inconsistent evidence may be due to the way decision dif- tioned above, Deneve (2012) considered an alternative theo- ficulties in the experiment are mixed, as well as how the retical approach to computing decision boundaries. Instead reward structure of the experiment is defined. of assuming that decision boundaries are fixed (though Next, we extend this analysis to two situations which time-dependent) on each trial, Deneve (2012) proposed that are of theoretical and empirical interest: (i) What is the the decision boundary is set dynamically on each trial based influence of prior beliefs about the different decision alter- on an estimate of the trial’s reliability. This reliability is used natives on the shape of the decision boundaries? (ii) What to get an on-line estimate of the signal-to-noise ratio of the is the optimal decision-making policy when it is possible sensory input and update the decision boundary. By sim- to opt-out of a decision and forego a reward, but be spared ulating the model, Deneve found that decision boundaries the larger penalty associated with an incorrect choice? In maximize the reward rate if they decrease during difficult each case, we link our results to existing empirical research. trials, but increase during easy trials. When the decision-maker has unequal prior beliefs about 974 Psychon Bull Rev (2018) 25:971–996 the outcome of the decision, our computations show that the is given by the pair (t, X). Note that, in contrast to pre- optimal decision-maker should dynamically adjust the con- vious accounts that use dynamic programming to establish tribution of prior to each observation during the course of optimal decision boundaries (e.g., Drugowitsch et al., 2012; a trial. This is in line with the dynamic prior model devel- Huang & Rao, 2013), we compute optimal policies directly oped by Hanks, Mazurek, Kiani, Hopp and Shadlen (2011) in terms of evidence and time, rather than (posterior) belief but contrasts with the results observed by Summerfield and time. The reasons for doing so are elaborated in the Dis- and Koechlin (2010) and Mulder, Wagenmakers, Ratcliff, cussion. In any state, (t, X), the decision-maker can take Boekel and Forstmann (2012). Similarly, when it is possible one of two actions:(i) wait and accumulate more evidence to opt-out of a decision, the optimal decision-making pol- (observe asset value goes up/down), or (ii) go and choose icy shows that the decision-maker should choose this option the more likely alternative (buy/sell). only when decisions involve more than one difficulty (i.e., If action wait is chosen, the decision-maker observes the the decision-maker is uncertain about the difficulty of a outcome of a binary random variable, δX,where P(δX = decision) and only when the benefit of choosing this option 1) = u = 1 − P(δX =−1).The up-probability, u, is carefully calibrated. depends on the state of the world. We assume throughout that u ≥ 0.5 if the true state of the world is rising,and u ≤ 0.5 if the true state is falling. The parameter u also A theoretical model for optimal boundaries determines the trial difficulty. When u is equal to 0.5, the probability of each outcome is the same (equal probabil- ity of asset value going up/down); consequently, observing Problem definition an outcome is like flipping an unbiased coin, providing the We now describe a Markov decision process to model the decision-maker absolutely no evidence about which hypoth- stopping problem described at the beginning of this arti- esis is correct. On the other hand, if u is close to 1 or 0 cle. We consider the simplest possible case of this problem, (asset value almost always goes up/down), observing an out- where we: (i) restrict the number of choice alternatives to come provides a large amount of evidence about the correct two (buy or sell), (ii) assume that observations are made at hypothesis, making the trial easy. After observing δX the discrete (and constant) intervals, (iii) assume that observa- decision-maker transitions to a new state (t + 1,X + δX), tions consist of binary outcomes (up or down transitions), as a result of the progression of time and the accumula- and (iv) restrict the difficulty of each decision to one of two tion of the new evidence δX. Since the decision-maker does possible levels (assets could be rising (or falling) at one of not know the state of the world, and consequently does not know u, the distribution over the possible successor states two different rates). The decision-maker faces repeated decision-making (t +1,X ±1) is non-trivial and calculated below. In the most opportunities (trials). On each trial the world is in one of two general formulation of the model, an instantaneous cost (or possible states (asset is rising or falling), but the decision- reward) would be obtained on making an observation, but maker does not know which at the start of the trial. At a throughout this article we assume that rewards and costs series of times steps t = 1, 2, 3,... the decision-maker are only obtained when the decision-maker decides to select can choose to wait and accumulate evidence (observe if a go action. Thus, in contrast to some approaches (e.g., value of asset goes up or down). Once the decision-maker Drugowitsch et al., 2012), the cost of making an observation feels sufficient evidence has been gained, they can choose is 0. to go, and decide either buy or sell. If the decision is correct If action go is chosen then the decision-maker transitions (advisor recommends buy and asset is rising or advisor rec- to one of two special states, C or I, depending on whether the decision made after the go action is correct or incorrect. ommends sell and asset is falling), they receive a reward. If the decision is incorrect they receive a penalty. Under both As with transitions under wait, the probability that the deci- outcomes the decision-maker then faces a delay before start- sion is correct depends in a non-trivial way on the current ing the next trial. If we assume that the decision-maker will state, and is calculated below. From the states C and I,there undertake multiple trials, it is reasonable that they will aim is no action to take, and the decision-maker transitions to to maximize their average reward per unit time. A behav- the initial state (t, X) = (0, 0). From state C the decision- ioral policy which achieves the optimal reward per unit time maker receives a reward R and suffers a delay of D ; from C C will be found using average reward dynamic programming state I they receive a reward (penalty) of R and suffers a (Howard, 1960; Ross, 1983; Puterman, 2005). delay of D . In much of the theoretical literature on sequential sam- We formalize the task as follows. Let t = 1, 2,... be discrete points of time during a trial, and let X denote the pling models, it is assumed, perhaps implicitly, that the previous evidence accumulated by the decision-maker at decision-maker knows the difficulty level of a trial. This those points in time. The decision-maker’s state in a trial corresponds to knowledge that the up-probability of an Psychon Bull Rev (2018) 25:971–996 975 observation is u = 0.5 +  when the true state is rising, for fixed candidate policies. To do so, we must first deter- mine the state-transition probabilities under either action and u = 0.5 −  when the true state is falling.However,in (wait/go) from each state for a given set of drifts (Eqs. 6 ecologically realistic situations, the decision-maker may not and 7 below). These state-transition probabilities can then know the difficulty level of the trial in advance. This can be used to compare the wait and go actions in any given be modeled by assuming that the task on a particular trial state using the expected reward under each action in that is chosen from several different difficulties. In the example state. above, it could be that up / down observations come from different sources and some sources are noisier than oth- Computing state-transition probabilities ers. To illustrate the simplest conditions resulting in varying decision boundaries, we model the situation where there Computing the transition probabilities is trivial if one knows are only two sources of observations: an easy source with the up-probability, u, of the process generating the out- 1 1 u ∈ U ={ −  , +  } and a difficult source with e e e 2 2 comes: the probability of transitioning from (t, x) to (t + 1 1 1 u ∈ U ={ −  , +  },where  , ∈[0, ] are the d d d e d 2 2 2 1,x + 1) is u,and to (t + 1,x − 1) is 1 − u.However,when drifts of the easy and difficult stimuli, with  < . Thus, d e each trial is of an unknown level of difficulty, the observed during a difficult trial, u is close to 0.5, while for an easy outcomes (up/down) during a particular decision provide trial u is close to 0 or 1. We assume that these two types information not only about the correct final choice but also of tasks can be mixed in any fraction, with P(U ∈ U ) the about the difficulty of the current trial. Thus, the current probability that the randomly selected drift corresponds to state provides information about the likely next state under a an easy task in the perceptual environment. For now, we wait action, through information about the up-probability, assume that within both of U and U , u is equally likely to e d u. Therefore, the key step in determining the transition prob- be above or below 0.5—i.e., there is equal probability of the abilities is to infer the up-probability, u, based on the current assets rising and falling. In the section titled “Extensions of state and use this to compute the transition probabilities. the model” below, we will show how our results generalize As already specified, we model a task that has trials to the situation of unequal prior beliefs about the state of the drawn from two difficulties (it is straightforward to gener- world. alize to more than two difficulties): easy trials with u in the Figure 1a depicts evidence accumulation as a random 1 1 set U ={ −  , +  } and difficult trials with u in the set e e e 2 2 walk in two-dimensional space with time along the x-axis 1 1 U ={ − , + } (note that this does not preclude a zero d d d 2 2 and the evidence accumulated, X ,...,X , based on the 1 t drift condition,  = 0). To determine the transition proba- series of outcomes, +1, +1, −1, +1, along the y-axis. The bilities under the action wait, we must marginalize over the figure shows both the current state of the decision-maker set of all possible drifts, U = U ∪ U : e d at (t, X ) = (4, 2) and their trajectory in this state-space. wait In this current state, the decision-maker has two available p = P(X = x +1|X = x) t +1 t (t,x)→(t +1,x+1) actions: wait or go. As long as they choose to wait they = P(X = x +1|X = x, U = u) · P(U = u|X = x) t +1 t t will make a transition to either (5, 3) or (5, 1), depending u∈U wait wait on whether the next δX outcome is +1or −1. Figure 1b p = 1 − p (1) (t,x)→(t +1,x−1) (t,x)→(t +1,x+1) shows the transition diagram for the stochastic decision pro- where U is the (unobserved) up-probability of the current cess that corresponds to the random walk in Fig. 1a once trial. P(X = x +1|X = x, U = u) is the probability that t +1 t the go action is introduced. Transitions under go take the δX = 1 conditional on X = x and theup-probability being decision-maker to one of the states C or I, and subsequently u; this is simply u (the current evidence level X is irrelevant back to (0, 0) for the next trial. when we also condition on U = u). Allthatremains is to Our formulation of the decision-making problem has calculate the term P(U = u|X = x). stochastic state transitions, decisions available at each state, This posterior probability of U = u at the current state and transitions from any state (t, X) depending only on can be inferred using Bayes’ law: the current state and the selected action. This is there- P(X = x|U = u) · P(U = u) P(U = u|X = x) = fore a Markov decision process (MDP) (Howard, 1960; P(X = x|U =˜ u) · P(U =˜ u) u ˜∈U Puterman, 2005), with states (t, x) and the two dummy (2) states C and I corresponding to the correct and incorrect choice. A policy is a mapping from states (t, x) of this where P(U = u) is the prior probability of the up- MDP to wait/go actions. An optimal policy that maxi- probability being equal to u. The likelihood term, P(X = mizes the average reward per unit time in this MDP can be x|U = u), can be calculated by summing the probabili- determined by using the policy iteration algorithm (Howard, ties of all paths that would result in state (t, x).Weuse the standard observation about random walks that each of the 1960; Puterman, 2005). A key component of this algorithm t +x paths that reach (t, x) contains upward transitions and is to calculate the average expected reward per unit time 2 976 Psychon Bull Rev (2018) 25:971–996 (a) (b) Fig. 1 a Evidence accumulation as a random walk. Gray lines show current trajectory and black lines show possible trajectories if the decision- maker chooses to wait. b Evidence accumulation and decision-making as a Markov decision process: transitions associated with the action go are shown in dashed lines, while transitions associated with wait are shown in solid lines. The rewarded and unrewarded states are shown as C and I, respectively (for Correct and Incorrect) t −x downward transitions. Thus, the likelihood is given by where the term P(U = u|X = x) is given by Eq. 5. the summation over paths of the probability of seeing this Equation 6 gives the decision-maker the probability of an number of upward and downward moves: increase or decrease in evidence in the next time step if they choose to wait. (t +x)/2 (t −x)/2 (t +x)/2 (t −x)/2 P(X = x|U = u) = u (1−u) = n u (1−u) . t paths Similarly, we can work out the state-transition probabil- paths (3) ities under the action go. Under this action, the decision- maker makes a transition to either the correct or incorrect Here n is the number of paths from state (0, 0) to state paths state. The decision-maker will transition to the Correct state (t, x), which may depend on the current decision-making if they choose buy and the true state of the world is rising, policy. Plugging the likelihood into (2)gives 1 1 i.e., true u is in U ={ +  , +  }, or if they choose + e d 2 2 (t +x)/2 (t −x)/2 n u (1−u) P(U = u) paths sell and the true state of the world is falling, i.e., true u is in P(U = u|X = x) = . (t +x)/2 (t −x)/2 1 1 n u ˜ (1−˜ u) P(U =˜ u) paths u ˜∈U U ={ −  , −  } (assuming  > 0; see the end of this − e d d 2 2 (4) section for how to handle  = 0). The decision-maker will choose the more likely Some paths from (0, 0) to (t, x) would have resulted in a decision to go (based on the decision-making policy), and alternative–they compare the probability of the unobserved therefore could not actually have resulted in the state (t, x). drift U coming from the set U versus coming from the set Note, however, that the number of paths n is identical in paths U , given the data observed so far. The decision-maker will both numerator and denominator, so can be cancelled. respond buy when P(U ∈ U |X = x) > P(U ∈ U |X = + t − t x) and respond sell when P(U ∈ U |X = x) < P(U ∈ (t +x)/2 (t −x)/2 + t u (1 − u) P(U = u) P(U = u|X = x) =  . t U |X = x). The probability of these decisions being cor- − t (t +x)/2 (t −x)/2 u ˜ (1−˜ u) P(U =˜ u) u ˜∈U rect is simply the probability of the true states being rising (5) and falling respectively, given the information observed so Using Eq. 1, the transition probabilities under the action far. Thus when P(U ∈ U |X = x) > P(U ∈ U |X = x) + t − t wait can therefore be summarized as: the probability of a correct decision is P(U ∈ U |X = x), + t wait wait and when P(U ∈ U |X = x) < P(U ∈ U |X = x) the + t − t p = u · P(U = u|X = x) = 1 − p (t,x)→(t +1,x+1) (t,x)→(t +1,x−1) probability of a correct answer is P(U ∈ U |X = x); over- u∈U − t (6) all, the probability of being correct is the larger of P(U ∈ Psychon Bull Rev (2018) 25:971–996 977 U |X = x) and P(U ∈ U |X = x), meaning that the texts on stochastic dynamic programming such as Howard + t − t state transition probabilities for the optimal decision-maker (1960), Ross (1983) and Puterman (2005). The technique for the action go in state (t, x) are: searches for the optimal policy amongst the set of all poli- cies by iteratively computing the expected returns for all go { } p = max P(U ∈ U |X = x), P(U ∈ U |X = x) + t − t states for a given policy (step 1) and then improving the (t,x)→C policy based on these expected returns (step 2). go go p = 1 − p . (7) (t,x)→I (t,x)→C Step 1: Compute values of states for given π Assuming that the prior probability for each state of the world is the same, i.e., P(U ∈ U ) = P(U ∈ U ), + − To begin, assume that we have a current policy, π,which the posterior probabilities satisfy P(U ∈ U |X = x) > + t maps states to actions, and which may not be the optimal P(U ∈ U |X = x) if and only if the likelihoods satisfy − t policy. Observe that fixing the policy reduces the Markov P(X = x|U ∈ U )> P(X = x|U ∈ U ). In turn, this t + t − decision process to a Markov chain. If this Markov chain is inequality in the likelihoods holds if and only if x> 0. allowed to run for a long period of time, it will return an Thus, in this situation of equal prior probabilities, the opti- average reward ρ per unit time, independently of the initial mal decision-maker will select buy if x> 0and sell if state (Howard, 1960; Ross, 1983). However, the short-run go x< 0 so that the transition probability p is equal to expected earnings of the system will depend on the current (t,x)→C P(U ∈ U |X = x) when x> 0and P(U ∈ U |X = x) + t − t state, so that each state, (t, x), can be associated with a rel- when x< 0. ative value, v , that quantifies the relative advantage of (t,x) Note that when  = 0, a situation which we study below, being in state (t, x) under policy π. the sets U and U intersect, with being a member of + − Following the standard results of Howard (1960), the both. This corresponds to the difficult trials having an up- relative value of state v is the expected value over suc- (t,x) probability of for the true state of the world being either cessor states of the following three components: (i) the rising and falling. Therefore, in the calculations above, we instantaneous reward in making the transition, (ii) the rela- need to replace P(U ∈ U |X = x) in the calculation of the + t tive value of the successor state and (iii) a penalty term equal go transition probability p with P(U = +  |X = e t to the length of delay to make the transition multiplied by (t,x)→C 2 1 1 the average reward per unit time. From a state (t, x), under x) + P(U = |X = x) and P(U ∈ U |X = x) with t − t 2 2 1 1 1 action wait, the possible successor states are (t + 1,x + 1) P(U = −  |X = x) + P(U = |X = x). e t t 2 2 2 and (t + 1,x − 1) with transition probabilities given by Eq. 6; under action go, the possible successor states are C Finding optimal actions and I with transition probabilities given by Eq. 7; the delay In order to find the optimal policy, a dynamic programming for all of these transitions is one time step, and no instanta- procedure called policy iteration is used. The remainder of neous reward is received. Both C and I transition directly to this section provides a sketch of this standard procedure (0, 0), with reward R or R , and delay D or D respec- C I C I as applied to the model we have constructed. For a more tively. The general dynamic programming equations reduce detailed account, the reader is directed towards standard to the following wait π wait π π p v + p v − ρ if π(t, x) = wait (t,x)→(t +1,x+1) (t +1,x+1) (t,x)→(t +1,x−1) (t +1,x−1) v = (t,x) go go π π p v + p v − ρ if π(t, x) = go C I (t,x)→C (t,x)→I π π π v = R + v − D ρ C C C (0,0) π π π v = R + v − D ρ (8) I I I (0,0) π π The unknowns of the system are the relative values v , v terms will produce an alternative solution to the equations. (t,x) π π So we identify the solutions by fixing v = 0 and inter- and v , and the average reward per unit time ρ . The sys- (0,0) π π preting all other v terms as being values relative to state tem is underconstrained, with one more unknown (ρ )than (0, 0). equations. Note also that adding a constant term to all v We assume all policies considered will eventually go, so that the sys- We will revise this assumption in the section titled “Extensions of the tem is ergodic and the limiting state probabilities are independent of model”below. the starting state. 978 Psychon Bull Rev (2018) 25:971–996 new Step 2: Improve π → π to wait an arbitrarily large time before taking the action go. However due to computational limitations, we limit the So far, we have assumed that the policy, π, is arbitrarily largest value of time in a trial to a fixed value t by forcing max chosen. In the second step, we use the relative values of the decision-maker to make a transition to the incorrect state wait states, determined using Eq. 8, to improve this policy. This at t +1; that is, for any x, p = 1. In the policies max (t ,x)→I max improvement can be performed by applying the principle of computed below, we set t to a value much larger than the max optimality (Bellman, 1957): in any given state on an opti- interval of interest (time spent during a trial) and verified mal trajectory, the optimal action can be selected by finding that the value of t does not affect the policies in the cho- max the action that maximizes the expected return and assuming sen intervals. The code for computing the optimal policies that an optimal policy will be followed from there on. as well as the state-transition probabilities is contained in a When updating the policy, the decision-maker thus Toolbox available on the Open Science Framework (https:// selects an action for a state which maximizes the expecta- osf.io/gmjck/). tion of the immediate reward plus the relative value of the successor state penalized by the opportunity cost, with suc- cessor state values and opportunity cost calculated under Predicted optimal policies the incumbent policy π. In our model, actions need only be selected in states (t, x), and we compare the two possible The theory developed above gives a set of actions (a pol- evaluations for v in Eq. 8. Therefore the decision-maker icy) that optimizes the reward rate. We now use this the- (t,x) new sets π (t, x) = wait if ory to generate optimal policies for a range of decision problems of the form discussed at the beginning of this wait π wait π p v + p v article. The transition probabilities and state values com- (t,x)→(t +1,x+1) (t +1,x+1) (t,x)→(t +1,x−1) (t +1,x−1) go go π π puted in Eqs. 6, 7 and 8 are a function of the set of >p v + p v (9) C I (t,x)→C (t,x)→I up-probabilities (U) and inter-trial delays (D and D ). C I Hence, the predicted policies will also be a function of the and selects go otherwise. Note also that, by Eq. 8 and the given set of up-probabilities (i.e., the difficulties) and inter- identification v = 0, the relative values of the cor- (0,0) π π rect and incorrect states satisfy v = R − D ρ and trial delays. We show below how changing these variables C C π π leads to a change in the predicted optimal policies and how v = R − D ρ . We therefore see the trade-off between I I choosing to wait, receiving no immediate reward and sim- these policies correspond to decision boundaries that may or may not vary with time based on the value of these ply transitioning to a further potentially more profitable variables. state, and choosing go, in which there is a probability of receiving a good reward but a delay will be incurred. It will go Single difficulty only be sensible to choose go if p is sufficiently (t,x)→C high, in comparison to the average reward ρ calculated We began by computing optimal policies for single diffi- under the current policy π. Intuitively, since ρ is the aver- culty tasks. For the example at the beginning of this article, age reward per time step, deciding to go and incur the this means all rising assets go up during an observation delays requires that the expected return from doing so out- ¯ ¯ weighs the expected opportunity cost Dρ (where D is a period with the same probability, + ,and all falling assets suitably weighted average of D and D ). The new policy go up with the probability − . Figure 2 shows opti- C I new π π can be shown to have a better average reward ρ than ρ mal policies for three different tasks with drifts  = 0.45, (Howard, 1960; Puterman, 2005).  = 0.20 and  = 0, respectively. Panel (a) is a task This policy iteration procedure can be initialized with an that consists exclusively of very easy trials, panel (b) con- arbitrary policy and iterates over steps 1 and 2 to improve sists exclusively of moderately difficult trials and panel (c) new the policy. The procedure stops when the policy π is consists exclusively of impossible (zero drift) trials. The unchanged from π, which occurs after a finite number of inter-trial delay in each case was D = D = 150 (that C I iterations, and when it does so it has converged on an opti- is, the inter-trial delay was 150 times as long as the delay mal policy, π . This optimal policy determines the action between two consecutive up/down observations). The state in each state that maximizes the long-run expected average space shown in Fig. 2 is organized according to number of reward per unit time. samples (time) along the horizontal axis and cumulative evi- For computing the optimal policies shown in this article, dence (X ) along the vertical axis. Each square represents a we initialized the policy to one that maps all states to the possible state and the color of the square represents the opti- action go then performed policy iteration until the algorithm mal action for that state, with black squares standing for go converged. The theory above does not put any constraints and light grey squares standing for wait. The white squares on the size of the MDP—the decision-maker can continue are combinations of evidence and time that will never occur Psychon Bull Rev (2018) 25:971–996 979 Optimal Policy Optimal Policy Optimal Policy 25 25 25 20 20 20 15 15 15 10 10 10 5 5 5 0 0 0 −5 −5 −5 −10 −10 −10 −15 −15 −15 −20 −20 −20 −25 −25 −25 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 number of samples number of samples number of samples (a) (b) (c) Fig. 2 Each panel shows the optimal actions for different points in computations were D = D = 150 and all trials in a task had C I the state space after convergence of the policy iteration. Gray squares the same difficulty. The up-probability for each decision in the task indicate that wait is the optimal action in that state while black was drawn, with equal probability from (a) u ∈{0.05, 0.95},(b) squares indicate that go is optimal. The inter-trial delays for all three u ∈{0.30, 0.70} and (c) u = 0.50 during a random walk (e.g., (t, x) = (1, 0)) and do not always remained close to the maximum simulated time. In order correspond to a state of the MDP. to prevent confusion and exclude this boundary effect from We can observe from Fig. 2 that, in each case, the optimal other effects, all the figures for optimal policies presented policy constitutes a clear decision boundary: the optimal below are cropped at t = 50: simulations were performed decision is to wait until the cumulative evidence crosses a for t ≥ 70, but results are displayed until t = 50. max specific bound. For all values of evidence greater than this In agreement with previous investigations of optimal bound (for the current point of time), it is optimal to guess bounds (Bogacz et al., 2006), computations also showed the more likely hypothesis. In each panel, the bound is deter- that the decision boundaries depended non-monotonically mined by the cumulative evidence, x, which was defined on the task difficulty, with very high drifts leading to nar- above as the difference between number of up and down row bounds and intermediate drifts leading to wider bounds. observations, |n − n |. Note that, in all three cases, the Note that the height of the decision boundary is |n − n |= u d u d decision bound stays constant for the majority of time and 5for  = 0.20 in Fig. 2b, but decreases on making the collapses only as the time approaches the maximum sim- task more easy (as in Fig. 2a) as well as more difficult (as ulated time step, t . We will discuss the reason for this in Fig. 2c). Again, this makes intuitive sense: the height of max boundary effect below, but the fact that decision bounds the decision boundary is low when the task consists of very remain fixed prior to this boundary effect shows that it is optimal easy trials because each outcome conveys a lot of informa- to have a fixed decision bound if the task difficulty is fixed. tion about the true state of the world; similarly, decision In Fig. 2a and b, the optimal policy dictates that the boundary is low when the task consists of very difficult trials decision-maker waits to accumulate a criterion level of because the decision-maker stands to gain more by making evidence before choosing one of the options. In contrast, decisions quickly than observing very noisy stimuli. Fig. 2c dictates that the optimal decision-maker should make a decision immediately (the optimal action is to go Mixed difficulties in state (0, 0)), without waiting to see any evidence. This makes sense because the up-probability for this computa- Next, we computed the optimal policies when a task con- tion is u = ; that is, the observed outcomes are completely tained mixture of two types of decisions with different random without evidence in either direction. So the theory difficulties. For the example at the beginning of this article, suggests that the decision-maker should not wait to observe this means some rising assets go up during an observation any outcomes and choose an option immediately, saving period with the probability +  while others go up with time and thereby increasing the reward rate. the probability +  . Similarly, some falling assets go up In panels (a) and (b), we can also observe a collapse of 2 with the probability − while others go up with probabil- the bounds towards the far right of the figure, where the e ity −  . Figure 3 shows the optimal policy for two single boundary converges to |n −n |= 0. This is a boundary effect u d d and arises because we force the model to make a transition difficulty tasks, as well as a mixed difficulty task (Fig. 3c), to the incorrect state if a decision is not reached before the in which trials can be either easy or difficult with equal probability (P(U ∈ U ) = ). The drift of the easy task is very last time step, t (in this case, t = 70). Increasing max max e t moved this boundary effect further to the right, so that it  = 0.20 and the difficult task is  = 0. e d max evidence evidence evidence 980 Psychon Bull Rev (2018) 25:971–996 Optimal Policy Optimal Policy Optimal Policy 25 25 25 20 20 20 15 15 15 10 10 10 5 5 5 0 0 0 −5 −5 −5 −10 −10 −10 −15 −15 −15 −20 −20 −20 −25 −25 −25 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 number of samples number of samples number of samples (a) (b) (c) Fig. 3 Optimal actions for single and mixed difficulty tasks. The inter- drawn from u ∈{0.30, 0.70}; b Single difficulty task with u = ; c trial intervals used for computing all three policies are D = D = Mixed difficulty task with u ∈{0.30, 0.50, 0.70}, with both easy and C I 150. a Single difficulty task with up-probability for each decision difficult trials equally likely, i.e., P(U ∈ U ) = The optimal policies for the single difficulty tasks We also explored cases with a mixture of decision dif- (Fig. 3a and b) are akin to the optimal policies in Fig. 2. ficulties, but where the difficult decisions had a positive The most interesting aspect of the results is the optimal pol- drift ( > 0). Figure 4 shows optimal policies for the icy for mixed difficulty condition (Fig. 3c). In contrast to same parameters as Fig. 3c, except the drift of the diffi- single difficulty conditions, we see that the decision bound- cult decisions has been changed to  = 0.02, 0.05, and ary under this condition is time-dependent. Bounds are wide 0.10, respectively. The drift for the easy decisions remained at the start of the trial (|n − n |= 4) and narrow down  = 0.20. Bounds still decrease with time when  = 0.02 u d e d as time goes on (reaching |n − n |= 0at t = 44). In and 0.05 but the amount of decrease becomes negligible u d other words, the theory suggests that the optimal decision- very rapidly. In fact, when  = 0.10, the optimal pol- maker should start the trial by accumulating information icy (at least during the first 50 time-steps) is exactly the and trying to be accurate. But as time goes on, they should same as the single difficulty task, with  = 0.20 (compare decrease their accuracy and guess. In fact, one can ana- with Fig. 3a). We explored this result using several differ- lytically show that the decision boundaries will eventually ent values of inter-trial intervals and consistently found that collapse to |n − n |= 0 if there is a non-zero probability decision boundaries show an appreciable collapse for only a u d that one of the tasks in the mixture has zero drift ( = 0) small range of decision difficulties and, in particular, when (see Appendix A). one type of decision is extremely difficult or impossible. Optimal Policy Optimal Policy Optimal Policy 25 25 25 20 20 20 15 15 15 10 10 10 5 5 5 0 0 0 −5 −5 −5 −10 −10 −10 −15 −15 −15 −20 −20 −20 −25 −25 −25 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 number of samples number of samples number of samples Fig. 4 Optimal actions for mixed difficulty tasks with different dif- (b) u ∈{0.30, 0.45, 0.55, 0.70},and (c) u ∈{0.30, 0.40, 0.60, 0.70} ficulty levels. Each panel shows mixed difficulty task with up-pro- with equal probability. All other parameters remain the same as in bability for each decision drawn from (a) u ∈{0.30, 0.48, 0.52, 0.70}, computations shown in Fig. 3 above evidence evidence evidence evidence evidence evidence Psychon Bull Rev (2018) 25:971–996 981 Optimal Policy Optimal Policy Optimal Policy 25 25 25 20 20 20 15 15 15 10 10 10 5 5 5 0 0 0 −5 −5 −5 −10 −10 −10 −15 −15 −15 −20 −20 −20 −25 −25 −25 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 number of samples number of samples number of samples (a) (b) (c) Probability (Correct) Probability (Correct) Probability (Correct) 1 1 1 0.9 0.9 0.9 0.8 0.8 0.8 0.7 0.7 0.7 0.6 0.6 0.6 0.5 0.5 0.5 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 number of samples number of samples number of samples (d) (e) (f) Fig. 5 Optimal actions for a mixture of difficulties when the easy respectively. Panel (c) shows optimal policy in mixed difficulty task task has narrower bounds than the difficult task. The inter-trial delays with up-probability chosen from u ∈{0.05, 0.40, 0.60, 0.95} and for all three computations are D = D = 150. Panels (a)and (b) P(U ∈ U ) = P(U ∈ U ) = . Panels (d–f) show the change in pos- C I e d show optimal policies for single difficulty tasks with up-probability terior probabilities P(U ∈ U |X = x) with time at the upper decision + t of each decision chosen from u ∈{0.05, 0.95} and u ∈{0.40, 0.60}, boundary for conditions (a–c), respectively An intuitive explanation for collapsing bounds in Figs. 3c difficulty tasks and a mixed difficulty task that combines and 4a, b could be as follows: the large drift (easier) task these two difficulties. However, in this case, the two single (Fig. 3a) has wider bounds than the small drift task (Fig. 3b); difficulty tasks are selected so that the bounds for the large with the passage of time, there is a gradual increase in the drift (easy) task (Fig. 5a) are narrower than the small drift probability that the current trial pertains to the difficult task (difficult) task (Fig. 5b), reversing the pattern used in the if the boundary has not yet been reached. Hence, it would set of tasks for Fig. 3. Figure 5c shows the optimal actions make sense to start with wider bounds and gradually nar- in a task where these two difficulty levels are equally likely. row them to the bounds for the more difficult task as one In contrast to Fig. 3c, the optimal bounds for this mixture becomes more certain that the current trial is difficult. If this are narrower at the beginning, with |n − n |= 4and then u d explanation is true, then bounds should decrease for a mix- get wider, reaching |n − n |= 6 and then stay constant. u d ture of difficulties only under the condition that the easier Thus, the theory predicts that inter-mixing difficulties does task has wider bounds than the more difficult task. not necessarily lead to monotonically collapsing bounds. In order to get an insight into why the optimal boundary Increasing bounds increases with time in this case, we computed the posterior probability of making the correct decision at the optimal The next set of computations investigated what happens to boundary. An examination of this probability showed that decision boundaries for mixed difficulty task when the eas- although the optimal boundary is lower for the easy task ier task has narrower bounds than the more difficult task. (Fig. 5a) than for the difficult task (Fig. 5b), the posterior Like Fig. 3,Fig. 5 shows optimal policies for two single P(U ∈ U |X = x) at which the choice should be made is + t evidence evidence evidence 982 Psychon Bull Rev (2018) 25:971–996 Optimal Policy Optimal Policy Optimal Policy 25 25 25 20 20 20 15 15 15 10 10 10 5 5 5 0 0 0 −5 −5 −5 −10 −10 −10 −15 −15 −15 −20 −20 −20 −25 −25 −25 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 number of samples number of samples number of samples (a) (b) (c) Fig. 6 Optimal actions for single and mixed difficulty tasks when inter-trial intervals are reduced to D = D = 50. All other parameters are C I the same as Fig. 3 higher for the easy task (Fig. 5d) than for the difficult task increase and then asymptote towards the wider of the two (Fig. 5e). For the mixed difficulty task, although the opti- bounds. mal boundary increases with time (Fig. 5c), the probability of making a correct choice decreases with time (Fig. 5f). Effect of inter-trial intervals This happens because the posterior probability of the current trial being difficult increases with time. This fits well with The computations so far have focused on how the diffi- the intuitive explanation of time-varying decision bound- culty level (drift) affects the optimal policy. Therefore, all aries given for collapsing bounds. At the start of the trial, computations shown so far used the same inter-trial inter- the decision-maker does not know whether the trial is easy vals (D = D = 150) but varied the drift. However, our C I or difficult and starts with a decision boundary somewhere conclusions about policies in single and mixed difficulty between those for easy and difficult single difficulty tasks. conditions are not restricted to a particular choice of inter- As time progresses and a decision boundary is not reached, trial delay. Figure 6, for example, shows how optimal policy the probability of the trial being difficult increases and the changes when this delay is changed. To generate these poli- decision boundaries approach the boundaries for the diffi- cies, the inter-trial delay was decreased to D = D = C I cult task. Since the decision boundaries for the difficult task 50. All other parameters were the same as those used for ( = 0.10) are wider than the easy task ( = 0.45) in computing policies in Fig. 3. d e Fig. 5, this means that the decision boundaries increase with A comparison of Figs. 3aand 6a, which have the same time during the mixed difficulty task. drifts but different inter-trial intervals, shows that the opti- We computed the optimal policies for a variety of mix- mal bounds decrease from |n − n |= 5 when the inter-trial u d ing difficulties and found that bounds increase, decrease delay is 150 to |n − n |= 3 inter-trial delay is reduced to u d or remain constant in a pattern that is consistent with this 50. Intuitively, this is because decreasing the inter-trial inter- intuitive explanation: when the task with smaller drift (the vals alters the balance between waiting and going (Eq. 8), more difficult task) has narrower bounds than the task with making going more favorable for certain states. When the larger drift (as in Fig. 3), mixing the two tasks leads to inter-trial interval decreases, an error leads to a compara- either constant bounds in-between the two bounds, or to tively smaller drop in the reward rate as the decision-maker monotonically decreasing bounds that asymptote towards quickly moves on to the next reward opportunity. There- the narrower of the two bounds. In contrast, when the task fore, the decision-maker can increase their reward rate by with the smaller drift has wider bounds than the task with lowering the height of the boundary to go. A compari- larger drift (as in Fig. 5), mixing the two tasks leads to either son of Figs. 3cand 6c shows that a similar result holds constant bounds in-between the two bounds or bounds that for the mixed difficulty condition: decision boundaries still decrease with time, but the boundary becomes lower when inter-trial delay is decreased. The sawtooth (zigzag) pattern in Fig. 5(d–f) is a consequence of the Thus far, we have also assumed that the inter-trial inter- discretization of time and evidence. For example, moving from left to vals for correct and error responses (D and D , respec- C I right along the boundary in Fig. 5d, the value of evidence oscillates tively) are the same. In the next set of computations, we between |n − n |= 6and |n − n |= 7, leading to the oscillation in u d u d the value of the posterior probability P(U ∈ U |X = x). investigated the shape of decision boundaries when making + t evidence evidence evidence Psychon Bull Rev (2018) 25:971–996 983 Optimal Policy Optimal Policy Optimal Policy 25 25 25 20 20 20 15 15 15 10 10 10 5 5 5 0 0 0 −5 −5 −5 −10 −10 −10 −15 −15 −15 −20 −20 −20 −25 −25 −25 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 number of samples number of samples number of samples (a) (b) (c) Fig. 7 Optimal actions remain the same if D + D remain the same. Each panel shows the optimal policy for up-probability drawn from the C p same set as Fig. 3, but for an inter-trial delay of D = 75 for correct guesses and D = 150 for errors C I an error carried an additional time-penalty, D , so that the Prior beliefs about the world delay is D after correct response and D +D after errors. C C p An unintuitive result from previous research (Bogacz First, consider the assumption that both states of the world et al., 2006) is that different combinations of D and D are equally likely. A key question in perceptual decision- C p that have the same sum (D + D ), lead to the same bound- making is how decision-makers combine this prior belief C p ary. So, for example, the optimal boundaries are the same with samples (cues) collected during the trial. The effect when both correct and incorrect decisions lead to an equal of prior information on decision-making can be static, delay of 150 time steps as when the correct decisions lead i.e., remain constant during the course of a decision, or to a delay of 75 time steps but the incorrect decisions lead dynamic, i.e., change as the decision-maker samples more to an additional 75 time steps. information. Correspondingly, sequential sampling models Results for computations shown in Fig. 7 indicate that can accommodate the effect of prior in either the starting this property generalizes to the case of mixed difficulties. point if the effect of prior is static or in the drift or thresh- The optimal policy for single and mixed difficulty tasks in old if the effect is dynamic (Ashby, 1983; Ratcliff, 1985; this figure are obtained for up-probability drawn from the Diederich, & Busemeyer, 2006; Hanks et al., 2011). same set as in Fig. 3, but with delays of D = 75 and Experiments with humans and animals investigating D = D + D = 150. Comparing Figs. 3 and 7, one can whether the effect of prior beliefs is constant or changes I C p see that changing the delays has not affected the decision with time, have led to mixed results. A number of recent boundaries at all. This is because even though D = 75 for experiments have shown that shifting the starting point is Fig. 7, D + D was the same as Fig. 3. Moreover, not only more parsimonious with the accuracy and reaction time C p are the boundaries the same for the single difficulty condi- of participants (Summerfield and Koechlin, 2010; Mulder tions (as previously shown), they are also the same for the et al., 2012). However, these experiments only consider a corresponding mixed difficulty conditions. single task difficulty. In contrast, when Hanks et al. (2011) considered a task with a mixture of difficulties, they found that data from the experiment can be better fit by a time- Extensions of the model dependent prior model. Instead of assuming that the effect of a prior bias is a shift in starting point, this model assumes The theoretical model outlined above considers a simplified that the prior dynamically modifies the decision variable— decision-making task, where the decision-maker must choose i.e., the decision variable at any point is the sum of the drift from two equally likely options. We now show how the and a dynamic bias signal that is a function of the prior and above theory generalizes to situations where: (a) the world increases monotonically with time. is more likely to be in one state than the other (e.g., assets We examined this question from a normative are more likely to be falling than rising), and (b) the decision- perspective—should the effect of a prior belief be time- maker can give up on a decision that appears too difficult dependent if the decision-maker wanted to maximize (make no buy or sell recommendation on an asset). In each reward rate? Edwards (1965) has shown that when the relia- case, the normative model illuminates how the sequential bility of the task is known and constant, the optimal strategy sampling models should be adapted for these situations. is to shift the starting point. More recently, Huang, Hanks, evidence evidence evidence 984 Psychon Bull Rev (2018) 25:971–996 25 25 25 20 20 15 15 15 10 10 5 5 5 0 0 −5 −5 −5 −10 −10 −10 −15 −15 −15 −20 −20 −20 −25 −25 −25 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 number of samples number of samples number of samples (a) (b) (c) Fig. 8 Change in optimal policy during single difficulty tasks with increasingly biased prior beliefs. a P(U ∈ U ) = 0.50; b P(U ∈ U ) = 0.70; + + c P(U ∈ U ) = 0.97. For all three computations, up-probability is drawn from u ∈{0.30, 0.70} and the inter-trial intervals are D = D = 150 + C I Shadlen, Friesen and Rao (2012) argued that instead of Figure 8 shows how the optimal policy changes when modeling the data in terms of a sequential sampling model the prior belief changes from both states of the world being with adjustment to starting point or drift, the decisions in equally probable (assets are equally likely to rise and fall) experiments such as Hanks et al. (2011) can be adequately to one state being more probable than the other (assets are described by a POMDP model that assumed priors to be more likely to rise than fall). All policies in Fig. 8 are for sin- distributed according to a (piecewise) Normal distribution gle difficulty tasks where the difficulty (drift) is fixed and and maximized the reward. We will now show that a nor- known ahead of time. mative model that maximizes the reward rate, such as the We can observe that a bias in the prior beliefs shifts the model proposed by Huang et al. (2012), is in fact, consis- optimal boundaries: when the prior probabilities of the two tent with sequential sampling models. Whether the effect of states of the world were the same (P(U ∈ U ) = P(U ∈ prior in such a model is time-dependent or static depends U )), the height of the boundary for choosing each alter- on the mixture of difficulties. As observed by Hanks et al. native was |n − n |= 5(Fig. 8a). Increasing the prior u d (2011), in mixed difficulty situations, the passage of time probability of the world being in the first state to P(U ∈ itself contains information about the reliability of stimuli: U ) = 0.70 reduces the height of the boundary for choosing the longer the trial has gone on, the more unreliable the the first alternative to (n − n ) = 4, while it increases the u d source of stimuli is likely to be and decision-makers should height of the boundary for choosing the other alternative to increasingly trust their prior beliefs. (n −n ) =−6(Fig. 8b). Thus, the optimal decision-maker u d For the MDP shown in Fig. 1b, in any state (t, x),the will make decisions more quickly for trials where the true effect of having biased prior beliefs is to alter the transi- state of the world matches the prior but more slowly when tion probabilities for wait as well as go actions. We can the true state and the prior mismatch. Furthermore, note that see that changing the prior in Eq. 5 will affect the posterior the increase in boundary in one direction exactly matches probability P(U = u|X = x), which, in turn, affects the the decrease in boundary in the other direction, so that the wait transition probability p in Eq. 6. Similarly, change in boundaries is equivalent to a shift in the starting (t,x)→(t +1,x+1) a change in the prior probabilities changes the posteriors point, as proposed by Edwards (1965). Increasing the bias P(U ∈ U |X = x) and P(U ∈ U |X = x) in Eq. 7, in prior further (Fig. 8c) increased this shift in boundaries, + t − t go in turn changing the transition probability p .We with the height of the boundary for choosing the first alter- (t,x)→C argued above that when priors are equal, P(U ∈ U ) = native reduced to (n − n ) = 0when P(U ∈ U ) = 0.97. + u d + P(U ∈ U ), the optimal decision-maker should recommend In this case, the decision-maker has such a strong prior (that buy or sell based solely on the likelihoods: i.e., buy when- asset values are rising) that it is optimal for them to choose ever x> 0 and recommend sell whenever x< 0. This will the first alternative (buy an asset) even before making any no longer be the case when the priors are unequal. In this observations. case, the transition probabilities under the action go will be Thus, the optimal policy predicted by the above theory given by the more general formulation in Eq. 7, i.e., buy concurs with shifting the starting point when the task dif- whenever the posterior probability for rising is larger than ficulty is fixed and known ahead of time. Let us now look falling (P(U ∈ U |X = x) > P(U ∈ U |X = x))and sell at the mixed difficulty condition. Figure 9 shows the opti- + t − t otherwise. mal policy for a mixed difficulty task with up-probability evidence evidence evidence Psychon Bull Rev (2018) 25:971–996 985 25 25 20 20 20 15 15 10 10 10 5 5 5 0 0 0 −5 −5 −5 −10 −10 −10 −15 −15 −15 −20 −20 −20 −25 −25 −25 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 number of samples number of samples number of samples 1 1 1 25 25 25 20 20 20 0.8 0.8 0.8 15 15 15 10 10 10 0.6 0.6 0.6 5 5 5 0 0 0 0.4 0.4 0.4 −5 −5 −5 −10 −10 −10 0.2 0.2 0.2 −15 −15 −15 −20 −20 −20 −25 0 −25 0 −25 0 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 number of samples number of samples number of samples 1 1 1 25 25 25 20 20 20 0.8 0.8 0.8 15 15 15 10 10 10 0.6 0.6 0.6 5 5 5 0 0 0 0.4 0.4 0.4 −5 −5 −5 −10 −10 −10 0.2 0.2 0.2 −15 −15 −15 −20 −20 −20 0 0 0 −25 −25 −25 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 number of samples number of samples number of samples Fig. 9 Optimal policy during mixed difficulty trials with biased prior row shows optimal policies, the second row shows the posterior prob- beliefs. For all computations, the mixture of drifts involves  = 0.20, ability for the trial being easy given the state and the third row shows = 0and P(U ∈ U ) = . Three different priors are used: the left the posterior probability for the trial having up-probability > 0.50 d e column uses P(U ∈ U ) = 0.52, the middle column uses P(U ∈ given the state. For all three computations, the inter-trial intervals are U ) = 0.55, and the right column uses P(U ∈ U ) = 0.70. The first D = D = 150 + + C I drawn from the set u ∈{0.30, 0.50, 0.70} and three differ- at longer durations (again most evident in the third column ent degrees of prior all biased towards the world being in the of Fig. 9, where it becomes optimal to choose the first alter- first state to varying degrees. native with the passage of time, even when the cumulative Like the single difficulty case, a prior bias that the world evidence is negative). is more likely to be in the first state (asset values are more These optimal policies are in agreement with the dynamic likely to rise) decreases the boundary for the first alternative prior model developed by Hanks et al. (2011), which pro- (buy) and increases the boundary for the second alternative poses that the contribution of prior increases with time. To (sell). However, unlike the single difficulty case, this shift see this, consider the assets example again. Note that the in boundaries is not constant, but changes with time: the prior used for generating the optimal policy in the third col- optimal policies in Fig. 9 are not simply shifted along the umn of Fig. 9 corresponds to assets being more likely to rise evidence axis (compare with Fig. 8); rather, there are two than fall (P(U ∈ U ) = 0.70). As time goes on, the cumu- components of the change in boundary. First, for all val- lative evidence required to choose buy keeps decreasing ues of time, the distance to the upper boundary (for buy)is while the cumulative evidence required to choose sell keeps same-or-smaller than the equal-prior case (e.g., in the third increasing. In other words, with the passage of time, increas- column of Fig. 9, (n − n ) = 2evenat t = 2), and the dis- ingly larger evidence is required to overcome the prior. This u d tance to the lower boundary (for sell) is same-or-larger than is consistent with a monotonically increasing dynamic prior the equal prior case. Second, the shift in boundaries is larger signal proposed by Hanks et al. (2011). evidence evidence evidence evidence evidence evidence evidence evidence evidence 986 Psychon Bull Rev (2018) 25:971–996 pass The third column in Fig. 9 also shows another interest- immediate reward or penalty: r = 0. The policy itera- ij ing aspect of optimal policies for unequal prior. In this case, tion is carried out in the same way as above, except Eqs. 8 the bias in prior is sufficiently strong and leads to a ‘time- and 9 are updated to accommodate this alternative. varying region of indecision’: instead of the collapsing Kiani and Shadlen (2009) introduced an option similar bounds observed for the equal-prior condition computations to this pass action (they call it “opt-out”) in an experiment show optimal bounds that seem parallel but change mono- conducted on rhesus monkeys. The monkeys were trained tonically with time. So, for example, when P(U ∈ U ) = to make saccadic eye movements to one of two targets that 0.70, it is optimal for the decision-maker to keep waiting indicated the direction of motion of a set of moving dots on for more evidence even at large values of time, provided the the screen (one of which was rewarded). In addition to being current cumulative evidence lies in the grey (wait)region able to choose one of these targets, on a random half of the of the state-space. trials, the monkeys were presented a third saccadic target (a The intuitive reason for this time-varying region of inde- “sure target”) that gave a small but certain reward. This “opt- cision is that, for states in this region, the decision-maker is out” setting is similar to our extended model with a pass neither able to infer if the trial is easy nor able to infer the action with one distinction. Since Kiani and Shadlen (2009) true state of the world. To see this, we have plotted the pos- did not use a fixed-time block paradigm where there was a terior probability of the trial being easy in the second row trade-off between the speed and accuracy of decisions, they of Fig. 9 and the posterior for the first state (rising)being had to explicitly reward the “opt-out” action with a small the true state of the world in the third row. The posterior reward. In contrast, we consider a setting where there is that the trial is easy does not depend on the prior about the an implicit cost of time. Therefore it is sufficient to reduce state of the world: all three panels in second row are iden- the delay for the pass option without associating it with tical. However, the posterior on the true state of the world an explicit reward. Kiani and Shadlen (2009) found that the does depend on the prior beliefs: as the prior in favor of the monkeys chose the sure target when their chance of making world being in the first state (rising) increases, the region of the correct decision about motion direction was small; that intermediate posterior probabilities is shifted further down is, when the uncertainty of the motion direction was high. with the passage of time. The ‘region of indecision’ corre- Figure 10 shows the optimal policy predicted by extend- sponds to an area of overlap in the second and third rows ing the above theory to include a pass option. For the single where the posterior probability that the trial is easy is close difficulty task (Fig. 10a), it is never optimal to choose the to 0.5 and the posterior probability that the true state of the pass option. This is because choosing to pass has a cost world is rising is also close to 0.5 (both in black). Hence, associated with it (the inter-trial delay on passing) and no the optimal thing to do is to wait and accumulate more benefit—the next trial is just as difficult, so the same amount evidence. of information would need to be accumulated. More interestingly, Fig. 10b and c show the optimal pol- Low confidence option icy for the mixed difficulty task, with up-probability for each decision chosen from the set u ∈{0.30, 0.50, 0.70}.In So far, we have considered two possible actions at every agreement with the findings of Kiani and Shadlen (2009), time step: to wait and accumulate more information, or to the theory predicts that the pass action is a function of go and choose the more likely alternative. Of course this both evidence and time and is taken only in cases where is not true in many realistic decision-making situations. For the decision-maker has waited a relatively long duration and instance, in the example at the beginning of the article, accumulated little or no evidence favoring either hypoth- the decision-maker may choose to examine the next asset esis. An inspection of the posterior probabilities, P(U ∈ in the portfolio without making a recommendation if they U |X = x), reveals why it becomes optimal to choose + t are unsure about their decision after making a sequence the pass option with the passage of time. It can be seen in of up/down observations. We now show how the theory Fig. 10e and f that for a fixed evidence x, as time increases, outlined above can be extended to such a case: the decision- P(U ∈ U |X = x) decreases (this is in contrast for the + t maker has a third option (in addition to wait and go), which single difficulty case, Fig. 10d). Thus, with the increase is to pass and move to the next trial with a reduced inter- in time, the confidence of the decision-maker in the same trial delay. In this case, the MDP in Fig. 1(b) is changed to amount of cumulative evidence should diminish and the pass include a third option—pass—with a delay D but no expected value of choosing the pass action becomes larger than the expected value of wait or go. The computations also reveal how the optimal policies depend on the incentive provided for the pass option. In This pattern seemed to hold even when we increased t to 100. max Fig. 10b, the inter-trial interval for the pass action is nearly Further research would be required to investigate if this is analytically the case when t →∞. an eighth of the interval for incorrect decisions while in max Psychon Bull Rev (2018) 25:971–996 987 25 25 25 20 20 20 15 15 15 10 10 10 5 5 5 0 0 0 −5 −5 −5 −10 −10 −10 −15 −15 −15 −20 −20 −20 −25 −25 −25 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 number of samples number of samples number of samples (a) (b) (c) 1 1 1 25 25 25 20 20 20 0.8 0.8 0.8 15 15 15 10 10 10 0.6 0.6 0.6 5 5 5 0 0 0 0.4 0.4 0.4 −5 −5 −5 −10 −10 −10 0.2 0.2 0.2 −15 −15 −15 −20 −20 −20 0 0 0 −25 −25 −25 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 number of samples number of samples number of samples (d) (e) (f) Fig. 10 Optimal actions for all states when actions include the pass (b), the inter-trial interval for pass action is 20 time steps while for option. Gray = wait; black = go; red = pass. For all computations, (c) it is 40 time steps. Panels (d–f) show the corresponding posterior = 0.20,  = 0, D = D = 150. (a) The single difficulty case probabilities of a drift > 0.50, P(U ∈ U |X = x), for the conditions e d C I + t with P(U ∈ U ) = 1. For (b)and (c), P(U ∈ U ) = .For (a)and in panels (a–c) e e Fig. 10c the inter-trial interval for the pass action is approx- Next, by restricting the levels of difficulties to two, we imately a fourth of the interval for incorrect decision. Since were able to explore the effect of different difficulties on the all paths to the pass action is blocked by go action in shape of the decision boundaries. Our computations showed Fig. 10c, the theory predicts that decreasing the incentive that optimal decision bounds do not necessarily decrease slightly should result in the optimal decision-maker never in the mixed difficulty condition and may, in fact, increase choosing the pass option. or remain constant. Computations using a variety of differ- ent difficulty levels revealed the following pattern: optimal bounds decreased when difficult trials (in mixed blocks) had Discussion lower optimal bounds than easy trials; they increased when the pattern was reversed, i.e., when difficult trials had higher Key insights optimal bounds than easy trials. In addition to computing optimal boundaries, we also Previous research has shown that when the goal of the computed posterior probabilities for various inferences dur- decision-maker is to maximize their reward rate, it may ing the course of a decision. These computations provided be optimal for them to change their decision boundary insight into the reason for the shape of boundaries under with time. In this article, we have systematically outlined different conditions. Optimal boundaries change with time only in the mixed difficulty condition and not in the sin- a dynamic programming procedure that can be used to compute how the decision boundary changes with time. gle difficulty condition because observations made during Several important results were obtained by using this proce- the mixed difficulty condition provide the decision-maker dure to compute optimal policies under different conditions. two types of information: in addition to providing evidence Firstly, by removing the assumptions about deadlines of about the true state of the world, observations also help the decisions and the cost of making observations, we found decision-maker infer the difficulty level of the current trial. that neither of these were a pre-requisite for an optimal time- At the start of the trial, the difficulty level of the current dependent decision boundary. Instead, what was critical, trial is determined by the decision-maker’s prior beliefs— was a sequence of decisions with inter-mixed difficulties. e.g., that both easy and difficult trials are equally likely. So evidence evidence evidence evidence evidence evidence 988 Psychon Bull Rev (2018) 25:971–996 the optimal decision-maker starts with decision boundaries The majority of experiments included in these meta- that reflect these prior beliefs. As the trial progresses, the analyses consider mixed difficulty blocks with a larger decision-maker uses the cumulative evidence as well as the number of difficulty levels than we have considered so time spent to gather this evidence to update the posterior far. For example, one of the studies considered by both on the difficulty level of the trial. They use this posterior Hawkins et al. (2015) and Voskuilen et al. (2016)is to then update the decision boundary dynamically. In cases Experiment 1 from Ratcliff and McKoon (2008), who where the decision boundary for the difficult trials is lower use a motion-discrimination task with motion coherence (higher) than easy trials, the decision-maker can maximize that varies from trial to trial across six different levels their reward rate by decreasing (increasing) the decision (5%, 10%, 15%, 25%, 35%, 50%). It is unclear what the boundary with time. shape of optimal bounds should be for this mixture of diffi- Similarly, the model also provided insight into the rela- culties, especially because we do not know what participants tionship between the shape of optimal decision boundaries were trying to optimize in this study. However, even if par- and priors on the state of the world. When priors are ticipants were maximizing reward rate, we do not know unequal, observations in mixed difficulty trials provide three whether they should decrease their decision boundaries types of information. They can be used to perform the two under these conditions. inferences mentioned above—about the true state of the It is possible to extend the above framework to more world and the difficulty of the trial—but additionally, they than two difficulties and make predictions about the shape can also be used to compute the weight of the prior. Com- of optimal boundaries in such settings. However, one prob- putations showed that it is optimal for the decision-maker lem is that this framework assumes an exact knowledge of to increase the weight of the prior with time, when deci- different drifts, , used in the mixture of difficulties. In sions have inter-mixed difficulties and the decision-maker the experiments considered by Hawkins et al. (2015)and has unequal priors. A possible explanation for this counter- Voskuilen et al. (2016), we do not know the exact values intuitive finding is that the optimal decision-maker should these drifts take since the paradigms used in these stud- consider the reliability of signals when calculating how ies (motion coherence, numerosity judgment, etc.) involve much weight to give the prior. As the number of observa- implicit sampling of evidence. tions increase, the reliability of the evidence decreases and One advantage of the expanded judgment paradigm is the optimal decision-maker should give more weight to the that the experimenter is able to directly observe the drift prior. Note that this is the premise on which Hanks et al. of samples shown to the participant and compare the deci- (2011) base their “dynamic bias” model. Our computations sion boundary used by participants with the one predicted show how the dynamic bias signal should change with time by reward-rate maximization. We have recently conducted when the goal of the decision-maker is to maximize the a large series of experiments in which we adopted this reward rate. approach, adopting a very explicit reward structure and creating conditions for which the model predicts that bound- Implications for empirical research aries should change when different difficulty levels are mixed (Malhotra, Leslie, Ludwig, & Bogacz, in press). Using the dynamic programming procedure to predict opti- We found that participants indeed modulated the slope mal policies provides a strong set of constraints for observ- of their decision boundaries in the direction predicted by ing different boundary shapes. In particular, we found that maximization of reward rate. optimal boundaries decreased appreciably under a lim- In order to understand why our findings contrast with ited set of conditions and only if one type of decision is those of Hawkins et al. (2015) and Voskuilen et al. (2016), extremely difficult. This observation is particularly rele- we extended the model to accommodate any number of dif- vant to a number of recent studies that have investigated ficulty levels. Instead of assuming that the up-probability the shape of decision boundaries adopted by participants in comes from the set U ∪ U , we assumed that u ∈ U, e d decision-making experiments. where U is a set of up-probabilities with n different drifts, Hawkins et al. (2015) performed a meta-analysis of { ,..., }. We then attempted to model the set of stu- 1 n reaction-time and error-rate data from eight studies using a dies analyzed by Hawkins et al. (2015) and Voskuilen et al. variety of different paradigms and found that, overall, these (2016). data favored a fixed bounds model over collapsing bounds As mentioned above, one problem is that we do not know models in humans. Similarly, Voskuilen et al. (2016) car- what the actual drift of the evidence was in these studies. Our strategy was to match the observed error rates for a ried out a meta-analysis using data from four numerosity discrimination experiments and two motion-discrimination experiments and found that data in five out of six experi- 6 The authors’ pre-print accepted for publication is available at https:// ments favored fixed boundaries over collapsing boundaries. osf.io/2rdrw/. Psychon Bull Rev (2018) 25:971–996 989 Table 1 Set of studies and drifts used to generate optimal policies Study Paradigm Conditions Distribution Drifts ({ ... }) 1 n { { PHS 05 Motion 0%, 3.2%, 6.4%, Uniform 0, 0.03, 0.05, 12.8%, 25.6%, 51.2%} 0.10, 0.20, 0.40} RTM 01 Distance 32 values Uniform 17 values Range: [1.7, 2.4] cm Range: [0, 0.50] R 07 Brightness {Bright : 2%, 35%, 45%, Uniform {0.05, 0.10, 0.20} Dark:55%, 65%, 98%} RM 08 Motion {5%, 10%, 15%, Uniform {0.04, 0.07, 0.10 25%, 35%, 50%} 0.15, 0.20, 0.30} MS 14 Color {35%, 42%, 46%, Uniform {0, 0.05, 0.10, 0.20} 50%, 54%, 58%, 65%} VRS 16: E1 Numerosity Range: [21, 80] Piecewise {0, 0.02, 0.04, 0.06 Uniform 0.30, 0.32, 0.34, 0.36} VRS 16: E2 Numerosity Range: [3, 98] Approximately {0, 0.05,..., 0.50} Gaussian VRS 16: E3 Numerosity Range: [31, 70] Uniform {0, 0.02,..., 0.20} VRS 16: E4 Numerosity Range: [3, 98] Uniform {0, 0.02,..., 0.48} Notes. Each row shows the set of conditions used in the experiment, the distribution of these conditions across trials and the set of drift parameters used to compute optimal policies in Fig. 11. The value given to a condition refers to the motion coherence for the motion discrimination task, to the separation of dots for the distance judgment task, to the proportion of black pixels for the brightness discrimination task, to the percentage of cyan to magenta checkers for the color judgment task and to the number of asterisks for the numerosity judgment task. For the computation VRS 16: E2, the probability of each drift value  was equal to N (; μ, σ ),where N (·) is the probability density of the normal distribution with μ = 0, and standard deviation σ = 0.21 and Z is a normalization factor ensuring that the probabilities add up to 1. The names of studies are abbreviated as follows PHS 05: (Palmer, Huk, & Shadlen, 2005), RTM 01: (Ratcliff, Thapar, & McKoon, 2001), R 07: (Ratcliff, Hasegawa, Hasegawa, Smith, & Segraves, 2007), RM 08: (Ratcliff & McKoon, 2008), MS 14: (Middlebrooks & Schall, 2014), VRS 16: (Voskuilen et al., 2016) with E1...E4 standing for Experiments 1 ... 4, respectively. set of difficulties in the original experiment with the error levels used in these experiments with the distribution of rates for the optimal bounds predicted by a corresponding drifts used for our computations. We chose this set of exper- set of drifts (see Appendix B for details). We found that a iments so that they cover the entire range of mixture of range of reasonable mappings between the difficulties used difficulties considered across experiments considered by in an experiment and the set of drifts { ,..., } gave fairly Hawkins et al. (2015) and Voskuilen et al. (2016). 1 n similar shapes of optimal boundaries. Figure 11 shows the optimal policies obtained by using Another problem is that it is unclear how the inter-trial the dynamic programming procedure for each mixture of interval used in the experiments maps on to the inter-trial drifts in Table 1. While the shape of any optimal bound- interval used in the dynamic programming procedure. More ary depends on the inter-trial interval (as discussed above), precisely, in the dynamic programming procedure the inter- we found that the slope of optimal boundaries remained trial interval is specified as a multiple of the rate at which similar for a range of different inter-trial intervals and the the evidence is delivered. However, due to the implicit sam- inset in each figure shows how this slope changes with a pling in the original experiment, we do not know the relation change in inter-trial interval. The insets also compare this between the (internal) evidence sampling rate and the inter- slope (solid, red line) with the flat boundary (dotted, black trial intervals. Therefore, we computed the optimal policies line) and the optimal slope for a mixture of two difficul- for a wide range of different inter-trial intervals. As we ties,  ∈{0, 0.20}, which leads to rapidly decreasing bounds show below, even though the optimal policy changes with a (dashed, blue line). A (red) dot in each inset indicates the change in inter-trial interval, the slope of the resulting opti- value of inter-trial interval used to plot the policies in the mal decision boundaries remain fairly similar across a wide main plot. All policies have been plotted for the same value range of intervals. of inter-trial interval used in the computations above, i.e., Table 1 summarizes the conditions used in experiments, D = D = 150, except Fig. 11b, which uses D = D = C I C I the distribution of these conditions across trials and a cor- 300 to highlight a property of optimal policies observed responding set of drifts used in the dynamic programming when a task consists of more than two difficulty levels (see procedure. We also matched the distribution of difficulty below). 990 Psychon Bull Rev (2018) 25:971–996 (a) (b) (c) (d) (e) (f) (h) (i) (g) Fig. 11 Optimal policies for mixture of difficulties used in experi- and compares it to flat boundaries (dotted, black line) and the mixture ments considered by Hawkins et al. (2015) and Voskuilen et al. (2016).  ∈{0.20, 0.50}, which gives a large slope across the range of inter- Insets show the slope of the optimal boundary (measured as tangent trial intervals (dashed, blue line). The dot in the inset along each solid of a line fitting the boundary) across a range of inter-trial intervals (red) line indicates the value of inter-trial interval used to generate the for the mixture of drifts that maps to the experiment (solid, red line) optimal policy shown in the main figure Extending the framework to more than two difficulties evidence as a function of number of samples observed dur- reveals two important results. First, the optimal bounds are ing a trial. Optimal bounds do seem to decrease slightly nearly flat across the range of mixed difficulty tasks used in for some mixed difficulty tasks when they include a sub- these experiments. In order to monitor the long-term trend stantial proportion of “very difficult” trials (e.g., MS_14 for these slopes, we have plotted each of the policies in and VRS_16: E2, E3). However, even in these cases, the Fig. 11 till time step t = 100 (in contrast to t = 50 above). amount of decrease is small (compare the solid, red line In spite of this, we observed very little change in optimal to the dashed, blue line which corresponds to a mixture Psychon Bull Rev (2018) 25:971–996 991 that gives a large decrease) and it would be very diffi- & Maddox, 2003; Myung & Busemeyer, 1989), especially cult to distinguish between constant or decreasing bounds with an increase in the difficulty of trials (Balci et al., 2011; based on the reaction-time distributions obtained from these Starns and Ratcliff, 2012) and with the increase in speed of bounds. Second, for some mixtures of difficulties, the opti- the decision-making task (Simen et al., 2009). To explain mal bounds are a non-monotonic function of time where the this behavior, a set of studies have investigated alternative optimal boundary first increases, then remains constant for objective functions (Bohil & Maddox, 2003; Bogacz et al., some time and finally decreases (see, for example, Fig. 11b). 2006; Zacksenhouse et al., 2010). For example, Bogacz, This non-monotonic pattern occurred only when number of Hu, Holmes and Cohen (2010) found that only about 30% trial difficulties was greater than two. of participants set the boundaries to the level maximizing Clearly these computations and in particular their map- reward rate. In contrast, the bounds set by the majority of ping onto the original studies have to be interpreted with participants could be better explained by maximization of a caution, due to the difficulties in translating continuous, modified reward rate which includes an additional penalty implicit sampling paradigms to the discrete expanded judg- (in a form of negative reward) after each incorrect trial, ment framework. Nevertheless, our wide-ranging explo- although no such penalty was given in the actual experi- ration of the parameter space (mixtures of difficulties and ment (Bogacz et al., 2010). Analogous virtual penalties for inter-trial intervals) suggests that the optimal boundaries in errors imposed by the participants themselves can be eas- these experiments may be very close to flat boundaries. ily incorporated in the proposed framework by making R In that case, even if participants maximized reward rate more negative (Eq. 8). in these experiments, it may be difficult to identify subtly However, understanding the behavior that maximizes decreasing boundaries on the basis of realistic empirical evi- reward rate is important for several reasons. Firstly, recent dence. Of course, we do not know what variable participants evidence indicates that the decision boundaries adopted by were trying to optimize in these studies. The computations human participants approach reward-rate optimizing bound- above highlight just how crucial an explicit reward structure aries, in single difficulty tasks, provided participants get is in that regard. enough training and feedback (Evans & Brown, 2016). This suggests that people use reward rate to learn the decision Reward-rate maximization boundaries over a sequence of trials. Secondly, the shape of the reward landscape may explain A related point is that many decision-making tasks why people adopt more cautious strategies than warranted (included those considered in the meta-analyses mentioned by maximizing reward rate. In a recent set of experiments, above) do not carefully control the reward structure of the we used an expanded judgment task to directly infer the experiment. Many studies instruct the participant simply to decision boundaries adopted by participants and found that be “as fast and accurate as possible”. The model we consider participants may be choosing decision boundaries that trade in this study is unable to make predictions about the optimal off between maximizing reward rate and the cost of errors shape of boundaries in these tasks, because it is not clear in the boundary setting (Malhotra et al., in press). That what the participant is optimizing. It could be that when is, we considered participants’ decision boundaries on a the goal of the participant is not precisely related to their “reward landscape” that specifies how reward rate varies as performance, they adopt a strategy such as “favor accuracy a function of the height and slope of the decision boundary. over speed” or “minimize the time spent in the experi- We noted that these landscapes were asymmetrical around ment and leave as quickly as possible, without committing the maximum reward rate, so that an error in the “wrong” a socially unacceptable proportion of errors” (Hawkins, direction would incur a large cost. Participants were gen- Brown, Steyvers, & Wagenmakers, 2012). On the other erally biased away from this “cliff edge” in the reward hand, it could also be that when people are given instruc- landscape. Importantly, across a range of experiments, par- tions that precisely relate their performance to reward, the ticipants were sensitive to experimental manipulations that cost required to estimate the optimal strategy is too high and modified this global reward landscape. That is, participants people simply adopt a heuristic – a fixed threshold – that shifted their decision boundaries in the direction predicted does a reasonable job during the task. by the optimal policies shown in Fig. 3 when the task More generally, one could question whether people switched from single to mixed difficulties. This happened indeed try to maximize reward rate while making sequential even when the task was fast-paced and participants were decisions and hence the relevance of policies that maximize given only a small amount of training on each task. Thus, even though people may not be able to maximize reward reward rate for empirical research. After all, a number of studies have found that people tend to overvalue accuracy rate, they are clearly sensitive to reward-rate manipulations and set decision boundaries that are wider than warranted and respond adaptively to such manipulations by changing by maximizing reward rate (Maddox & Bohil, 1998; Bohil their decision boundaries. 992 Psychon Bull Rev (2018) 25:971–996 Lastly, the optimal policies predicted by the dynamic Another simplifying assumption is that the decision-maker’s programming procedure above provides a normative target environment remains stationary over time. In more ecolog- for the (learned or evolved) mechanism used by people to ically plausible situations, parameters such as the drift rate, make decisions. Thus, these normative models provide a inter-stimulus interval and reward per decision will vary framework for understanding empirical behavior; if people over time. For example, the environment may switch from deviate systematically from these optimal decisions, it will being plentiful (high expectation of reward) to sparse (low be insightful to understand why, and under what conditions, expectation of reward). In these situations, each trial will they deviate from a policy that maximizes the potential inform the decision-maker about the state of the environment reward and how these alternative objective functions relate and the normative decision-maker should adapt the boundary to reward-rate maximization. from trial-to-trial based on the inferred state. Such an adaptive model was first examined by Vickers (1979), who proposed Assumptions and generalizations a confidence-based adjustment of decision-boundary from trial-to-trial. Similarly, Simen, Cohen and Holmes (2006) We have made a number of assumptions in this study with proposed a neural network model that continuously adjusts the specific aim of establishing the minimal conditions for the boundary based on an estimate of the current reward time-varying decision boundaries and exploring how prop- rate. erties of decision (such as difficulty) affects the shape of Recent experiments have shown that participants indeed decision bounds. respond to environmental change by adapting the gain of Firstly, note that in contrast to previous accounts that each stimulus (Cheadle et al., 2014) or the total amount use dynamic programming to establish optimal decision of evidence examined for each decision (Lee, Newell, & boundaries (e.g., Drugowitsch et al., 2012; Huang & Rao, Vandekerckhove, 2014). Lee et al. (2014) found that sequen- 2013), we compute optimal policies directly in terms of tial sampling models can capture participant behavior in evidence and time, rather than (posterior) belief and time. such environments by incorporating a regulatory mecha- There are two motivations for doing this. Firstly, our key nism like confidence, i.e. a confidence-based adjustment of goal here is to understand the shape of optimal decision decision boundary. However, they also found large indivi- boundaries for sequential sampling models which define dual differences in the best-fitting model and in the parame- boundaries in terms of evidence. Indeed, most studies ters chosen for the regulatory mechanism. An approach that which have aimed to test whether decision boundaries col- combines mechanistic models such as those examined by lapse, do so by fitting sequential sampling or accumulator Lee et al. (2014) and normative models such as the one dis- models to reaction time and error data (Ditterich, 2006; cussed above could explain why these individual differences Drugowitsch et al., 2012; Hawkins et al., 2015; Voskuilen occur and how strategies differ with respect to a common et al., 2016). Secondly, we do not want to assume that currency, such as the average reward. the decision-making system necessarily computes poste- rior beliefs. This means that the decision-making process Acknowledgements This research was carried out as part of the that aims to maximize reward rate can be implemented by project ‘Decision-making in an unstable world’, supported by the a physical system integrating sensory input. For an alter- Engineering and Physical Sciences Research Council (EPSRC), Grant native approach, see Drugowitsch et al. (2012), who use Reference EP/1032622/1. The funding source had no other role other dynamic programming to compute the optimal boundaries than financial support. Additionally, RB was supported by Medi- cal Research Council grant MC UU 12024/5 and GM and CL were in belief space and then map these boundaries to evidence supported by EPSRC grant EP/M000885/1. space. All authors contributed to the development of the theory, carrying Next, a key assumption is that policies can be compared out the computations and writing of the manuscript. All authors have on the basis of reward rate. While reward rate is a sensible read and approved the final manuscript. All authors state that there are no conflicts of interest that may inap- scale for comparing policies, it may not always be the eco- propriately impact or influence the research and interpretation of the logically rational measure. In situations where the number findings. of observations are limited (e.g., Rapoport & Burkheimer, 1971; Lee & Zhang, 2012) or the time available for making a decision is limited (e.g., Frazier & Yu, 2007), the decision- maker should maximize the expected future reward rather Open Access This article is distributed under the terms of the than the reward rate. If the number of decisions are fixed and Creative Commons Attribution 4.0 International License (http:// creativecommons.org/licenses/by/4.0/), which permits unrestricted time is not a commodity, then the decision-maker should use, distribution, and reproduction in any medium, provided you give maximize the accuracy. In general, the ecological situation appropriate credit to the original author(s) and the source, provide a or the experiment’s design will determine the scale on which link to the Creative Commons license, and indicate if changes were policies can be compared. made. Psychon Bull Rev (2018) 25:971–996 993 Appendix A: Eventually it is optimal to go at zero the decision-maker will choose the action go after τ<  steps, achieving a value of Q (go) and (T +τ,x) In this appendix we show that, under a mild condition, it will incurring an additional waiting cost of ρ τ,or be optimal to guess a hypothesis when the evidence level x the decision-maker will still be waiting until time T + is 0 for sufficiently large t. In other words, the bounds do , achieving value v but incurring an additional (t +,x) eventually collapse to 0. The situation we envisage is one in waiting cost of ρ . which some of the trials in the mixed condition have zero Therefore the value of waiting at (T , 0) is a convex combi- drift, i.e.,  = 0, so that nation of the time-penalized value of these future outcomes, so 1 1 1 1 − p p := P U = >0, P U = +  =P U = −  = . 0 e e 2 2 2 2 π π π π π Q (wait ) ≤max max {Q (go)−ρ τ },v −ρ  . (T ,0) (T +τ,x) (t +,x) 1≤τ<,|x|≤τ We also assume that the decision-maker gets a unit reward (12) for making correct decisions and no reward for making π π incorrect decisions and there is a fixed inter-trial delay D Note that by Eq. 11, v ≤ 1 − ρ D. Also note that (t +,x) between taking a go action and returning to the zero-value Q (go) is the expected instantaneous reward from (T +τ,x) state (0, 0). the action, plus a time penalty of ρ D. We will show We start with some observations. First note that the below that, for any η> 0, we can choose T sufficiently decision-maker could use a policy which always guesses at large that this expected instantaneous reward is less than or 1 π 1 π t = 0. This policy scores on average per trial, and tri- equal to + η. Therefore Q (go) ≤ + η − ρ D. 2 (T +τ,x) 2 als take D time units (since there is no time spent gathering Hence evidence). Hence the average reward per unit time of this guessing policy is . The optimal policy π ˆ will therefore 2D π π π π Q (wait ) ≤max max { +η−ρ (τ +D)},1−ρ D −ρ  . π ˆ (T ,0) 1≤τ< 2 have average reward per unit time, ρ ≥ . Similarly, an 2D oracle policy can guess correctly at time 0, and achieve an For an interval  = 2D average reward per unit time of ; since the performance of π π π π any policy is bounded above by this perfect instant guessing Q (wait ) ≤ max + η − ρ , 1 − 2ρ D − ρ D. (T ,0) π ˆ policy, ρ ≤ .Wehaveshown that Now 1 − 2ρ D ≤ 0byEq. 10 and if we choose η (which is 1 1 1 π ˆ an arbitrary constant) to be such that η< ,then η< ρ ≤ ρ ≤ . (10) 2D 2D D and it follows that Along similar lines, and recalling that we have fixed v = π π π (0,0) Q (wait ) < − ρ D = Q (go). (T ,0) (T ,0) 0, note that the maximum possible reward resulting from choosing a hypothesis is equal to 1, and there will be delay The optimal action at (T , 0) is therefore to go. at least D in transitioning from (t, x) to (0, 0),sofor any π, It remains to show that, if T is sufficiently large, the x and t expected instantaneous reward is bounded by + η.The expected instantaneous reward in any state is equal to the π π v ≤ 1 − ρ D. (11) (t,x) size of the reward times the probability of receiving it. Since we assume that the reward size is one unit and decision- We now prove that for a sufficiently large T , the opti- makers receive a reward only for correct decisions, the go malactioninthe state (T , 0) is go. This will be true if the expected instantaneous reward in a state (t, x) is p . (t,x)→C value of taking the action go, in state (T , 0), is larger than From Eq. 7, we know that taking the action wait. If we denote the value of taking go p = max {P(U ∈ U |X = x), P(U ∈ U |X = x)} , action a in state (t, x) under policy π by Q (a), we can + t − t (t,x)→C (t,x) π π write this condition as Q (go) > Q (wait ).Note (T ,0) (T ,0) and in the special case where  = 0, P(U ∈ U |X = x) d + t π π that Q (go) = − ρ D, since the selected hypothesis 1 1 (T ,0) 2 can be replaced by P(U = +  |X = x) + P(U = e t 2 2 is correct with probability and then there is a delay of D |X = x). Recall that each of the paths that reach X = 2 t t t +x t −x before returning to the zero-value state (0, 0). Therefore, we x contain n = upward transitions and n = u d 2 2 π 1 would like to prove that Q (wait ) < − ρ D. downward transitions. From Eq. 5 we have that (T ,0) 2 Now, consider a time window of duration  after T .The n n u d u (1 − u) P(U = u) value of waiting at (T , 0) will depend on one of two future P(U = u | X = x) =  . n n u ˜ (1 −˜ u) P(U =˜ u) u ˜∈U outcomes during this time window. Either: 994 Psychon Bull Rev (2018) 25:971–996 Hence by Eq. 7 the expected instantaneous reward under the action go when x ≥ 0 (and the hypothesis + is selected) is therefore 1 1 1 go p = P U = +  |X = x + P U = |X = x e t t (t,x)→C 2 2 2 n n n n u d u d 1−p 1 1 0 1 1 1 +  −  + p 2 2 2 2 2 2 n n n n n n u d u d u d 1 1 1−p 1 1 1 1 1−p 0 0 +  −  + p + −  + 2 2 2 2 2 2 2 2 t −x 2 x (1 − 4 ) (1 + 2) (1 − p ) + p 0 0 = . t −x 2 x x (1 − 4 ) [(1 + 2) + (1 − 2) ](1 − p ) + 2p 0 0 t −x For fixed x, (1 − 4 ) → 0as t →∞,sothat accuracy levels in the original study with the accuracy lev- the expected reward from going at x converges to as t els for each drift for the simulated decisions. We rejected becomes large. Since the maximization in Eq. 12 is over any given set of drifts that underestimated or overestimated |x|≤ τ< , we can take T sufficiently large that the the empirically observed range of accuracies for the chosen expected instantaneous reward from the action go in any range of inter-trial delays. This left us with a set of drifts, state (T + τ, x) with 1 ≤ τ<  and |x|≤ τ is less than shown in Table 1, that approximately matched the level of + η.Sofor any η we can say the following: for any x,for a accuracies in the original study. sufficiently large t ≥ T , the instantaneous reward for going For example, Fig. 12 shows the accuracies for deci- at (t, x) is less than + η. An identical calculation holds for sions simulated in the above manner by computing optimal x ≤ 0. bounds for two different sets of drifts: {0, 0.03, 0.05, 0.10, 0.20, 0.40} and {0.03, 0.04, 0.06, 0.08, 0.11, 0.15}. Each mixture contains six different difficulties, just like the orig- Appendix B: Mapping experimental conditions inal study conducted by Palmer et al. (2005). We performed to drifts these simulations for a range of inter-trial delays and Fig. 12 shows three such delays. The (yellow) bar on the left of each We now describe how we selected a set of drifts correspond- panel shows the empirically observed range of accuracies. It ing to each study in Table 1. Since these studies do not use is clear that the range of difficulties in Fig. 12b considerably an expanded judgment paradigm, we do not explicitly know underestimates the empirically observed range of accuracies the values of drift parameter; instead, these studies specify a and therefore is not an appropriate approximation of the dif- measurable property of the stimulus, such as motion coher- ficulties used in the original study. On the other hand, the ence, that is assumed to correlate with the drift. However, range of difficulties in Fig. 12a captures the observed range we know how the participants performed in this task – i.e. of accuracies for a variety of inter-trial delays. their accuracy – during each of these coherence conditions. Figure 12 also illustrates that the mapping between drift These accuracy levels constrain the possible range of drift rates and error rates is complex since the parameter space is rates that correspond to the motion coherences and can be highly multi-dimensional—with accuracy a function of the used to infer an appropriate range of drifts. inter-trial delay as well as the n values in the set { ... }. 1 n Specifically, we used the following method to determine In order to choose an appropriate mapping, we explored whether any given set of drift rates, { ,..., }, approxi- a considerable range of delays and sets of drifts. While 1 n mated the conditions for a study: (i) we used the dynamic the complexity of this parameter space makes it difficult programming procedure described in the main text to com- to be absolutely certain that there is no combination of a pute the optimal bounds for a mixed difficulty task with set of drifts and delays for which more strongly decreasing difficulties drawn from the given set of drifts and for a range boundaries are seen, our observation was that the optimal of inter-trial delays, D ; (ii) we simulated decisions by inte- boundaries shown in Fig. 11 were fairly typical of each grating noisy evidence to these optimal bounds, with the study for reasonable choice of parameters. Where more drift rate of each trial chosen randomly from the given set strongly decreasing boundaries were seen, they were (a) still of drifts; (iii) we determined the accuracy levels, a ,...,a , much shallower than the optimal boundaries for the mixture 1 n of these simulated decisions; (iv) finally, we compared the of two difficulties ( ∈{0, 0.20}, as conveyed in insets for Psychon Bull Rev (2018) 25:971–996 995 1.0 1.0 delay delay 0.8 100 0.8 100 200 200 300 300 0.6 0.6 0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.6 0.7 0.8 0.9 1.0 up−probability up−probability (a) (b) Fig. 12 A comparison of accuracies between decisions simulated [0.03, 0.15]. Squares, circles,and triangles show these accuracies for from optimal bounds and decision performed by participants. The two an inter-trial delay, D , of 100, 200, and 300 time units, respectively. panels show the accuracy of 10,000 simulated decisions during two The yellow bar on the left of each panel shows the range of accura- mixed difficulty tasks, each of which use a mixture of six difficulty cies observed in Experiment 1 of Palmer et al. (2005) which used six levels but differ in the range of difficulties. Panel (a) uses a large range different motion coherence levels: {0%, 3.2%, 6.4%, 12.8%, 25.6%, of drifts [0, 0.40], while panel (b) uses a comparatively smaller range 51.2%} Fig. 11), and (b) the predicted accuracy levels did not match Diederich, A., & Busemeyer, J. R. (2006). Modeling the effects of payoff on response bias in a perceptual discrimination task: those that were empirically observed. Bound-change, drift-rate-change, or two-stage-processing hypoth- esis. Perception & Psychophysics, 68(2), 194–207. Ditterich, J. (2006). Stochastic models of decisions about motion References direction: behavior and physiology. Neural Networks, 19(8), 981– Drugowitsch, J., Moreno-Bote, R., Churchland, A. K., Shadlen, M. N., Ashby, F. G. (1983). A biased random walk model for two choice & Pouget, A. (2012). The cost of accumulating evidence in per- reaction times. Journal of Mathematical Psychology, 27(3), 277– ceptual decision-making. Journal of Neuroscience, 32, 3612– Balci, F., Simen, P., Niyogi, R., Saxe, A., Hughes, J. A., Holmes, Edwards, W. (1965). Optimal strategies for seeking information: P., & Cohen, J. D. (2011). Acquisition of decision-making crite- Models for statistics, choice reaction times, and human informa- ria: reward rate ultimately beats accuracy. Attention, Perception, & tion processing. Journal of Mathematical Psychology, 2(2), 312– Psychophysics, 73(2), 640–657. Bellman, R. (1957). Dynamic programming Princeton. Princeton N.J.: Evans, N. J., & Brown, S. D. (2016). People adopt optimal policies in University Press. simple decision-making, after practice and guidance. Psychonomic Bernoulli, J. (1713). Ars conjectandi. Impensis Thurnisiorum, fratrum. Bulletin & Review, pp. 1–10. doi:10.3758/s13423-016-1135-1. Bogacz, R., Brown, E., Moehlis, J., Holmes, P., & Cohen, J. D. Frazier, P., & Yu, A. J. (2007). Sequential hypothesis testing under (2006). The physics of optimal decision-making: a formal analysis stochastic deadlines. In Advances in neural information process- of models of performance in two-alternative forced choice tasks. ing systems (pp. 465–472). Psychological Review, 113(4), 700–765. Ghosh, B. K. (1991). A brief history of sequential analysis. Handbook Bogacz, R., Hu, P. T., Holmes, P. J., & Cohen, J. D. (2010). Do humans of Sequential Analysis,1. produce the speed–accuracy trade-off that maximizes reward rate? Hanks, T. D., Mazurek, M. E., Kiani, R., Hopp, E., & Shadlen, M. N. The Quarterly Journal of Experimental Psychology, 63(5), 863– (2011). Elapsed decision time affects the weighting of prior prob- ability in a perceptual decision task. The Journal of Neuroscience, Bohil, C. J., & Maddox, W. T. (2003). On the generality of optimal 31(17), 6339–6352. versus objective classifier feedback effects on decision criterion Hawkins, G. E., Brown, S. D., Steyvers, M., & Wagenmakers, E.-J. learning in perceptual categorization. Memory & Cognition, 31(2), (2012). An optimal adjustment procedure to minimize experiment 181–198. time in decisions with multiple alternatives. Psychonomic Bulletin Busemeyer, J. R., & Rapoport, A. (1988). Psychological models of &Review, 19(2), 339–348. deferred decision-making. Journal of Mathematical Psychology, Hawkins, G. E., Forstmann, B. U., Wagenmakers, E.-J., Ratcliff, R., 32(2), 91–134. & Brown, S. D. (2015). Revisiting the evidence for collapsing Cheadle, S., Wyart, V., Tsetsos, K., Myers, N., De Gardelle, V., boundaries and urgency signals in perceptual decision-making. Castan˜on, ´ S. H., & Summerfield, C. (2014). Adaptive gain control The Journal of Neuroscience, 35(6), 2476–2484. during human perceptual choice. Neuron, 81(6), 1429–1441. Howard, R. A. (1960). Dynamic programming and Markov processes. Deneve, S. (2012). Making decisions with unknown sensory reliability. New York: Wiley. Frontiers in Neuroscience,6. Accuracy (percent correct) Accuracy (percent correct) 996 Psychon Bull Rev (2018) 25:971–996 Huang, Y., Hanks, T., Shadlen, M., Friesen, A. L., & Rao, R. P. (2012). Ratcliff, R. (1978). A theory of memory retrieval. Psychological How prior probability influences decision-making: a unifying Review, 83, 59–108. probabilistic model. In Advances in neural information processing Ratcliff, R. (1985). Theoretical interpretations of the speed and accu- systems (pp. 1268–1276). racy of positive and negative responses. Psychological Review, Huang, Y., & Rao, R. P. (2013). Reward optimization in the primate 92(2), 212–225. brain: a probabilistic model of decision-making under uncertainty. Ratcliff, R., Hasegawa, Y. T., Hasegawa, R. P., Smith, P. L., & PloS one, 8(1), e53344. Segraves, M. A. (2007). Dual diffusion model for single-cell Kiani, R., & Shadlen, M. N. (2009). Representation of confidence recording data from the superior colliculus in a brightness- associated with a decision by neurons in the parietal cortex. discrimination task. Journal of Neurophysiology, 97(2), 1756– Science, 324, 759–764. 1774. LaBerge, D. (1962). A recruitment theory of simple behavior. Psy- Ratcliff, R., & McKoon, G. (2008). The diffusion decision model: the- chometrika, 27(4), 375–396. ory and data for two-choice decision tasks. Neural Computation, Laming, D. R. J. (1968). Information theory of choice-reaction times. 20(4), 873–922. London: Academic Press. Ratcliff, R., & Smith, P. L. (2004). A comparison of sequential sam- Laplace, P.-S. (1774). Memoire ´ sur les suites recurro-r ´ ecurrentes ´ et sur pling models for two-choice reaction time. Psychological Review, leurs usages dans la theorie ´ des hasards. Memoir ´ es de l’Academie ´ 111(2), 333. Royale des Sciences Paris, 6, 353–371. Ratcliff, R., Thapar, A., & McKoon, G. (2001). The effects of aging Laplace, P.-S. (1812). Theorie ´ Analytique des probabilites ´ .Paris: on reaction time in a signal detection task. Psychology and Aging, Courcier. 16(2), 323. Lee, M. D., Newell, B. R., & Vandekerckhove, J. (2014). Modeling Ross, S. (1983). Introduction to stochastic dynamic programming. the adaptation of search termination in human decision-making. New York: Academic Press. Decision, 1(4), 223–251. Simen, P., Cohen, J. D., & Holmes, P. (2006). Rapid decision threshold Lee, M. D., & Zhang, S. (2012). Evaluating the coherence of take-the- modulation by reward rate in a neural network. Neural Networks, best in structured environments. Judgment and Decision Making, 19(8), 1013–1026. 7(4). Simen, P., Contreras, D., Buck, C., Hu, P., Holmes, P., & Cohen, Link, S., & Heath, R. (1975). A sequential theory of psychological J. D. (2009). Reward rate optimization in two-alternative decision- discrimination. Psychometrika, 40(1), 77–105. making: empirical tests of theoretical predictions. Journal of Maddox, W. T., & Bohil, C. J. (1998). Base-rate and payoff effects Experimental Psychology: Human Perception and Performance, in multidimensional perceptual categorization. Journal of Exper- 35(6), 1865. imental Psychology: Learning Memory, and Cognition, 24(6), Starns, J. J., & Ratcliff, R. (2012). Age-related differences in dif- 1459. fusion model boundary optimality with both trial-limited and Malhotra, G., Leslie, D. S., Ludwig, C. J., & Bogacz, R. (in press). time-limited tasks. Psychonomic Bulletin & Review, 19(1), 139– Overcoming indecision by changing the decision boundary. Jour- 145. nal of Experimental Psychology: General, 146(6), 776. Stone, M. (1960). Models for choice-reaction time. Psychometrika, Middlebrooks, P. G., & Schall, J. D. (2014). Response inhibition 25(3), 251–260. during perceptual decision-making in humans and macaques. Summerfield, C., & Koechlin, E. (2010). Economic value biases uncer- Attention, Perception, & Psychophysics, 76(2), 353–366. tain perceptual choices in the parietal and prefrontal cortices. Moran, R. (2015). Optimal decision-making in heterogeneous and Frontiers in human neuroscience,4. biased environments. Psychonomic Bulletin & Review, 22(1), 38– Thura, D., Cos, I., Trung, J., & Cisek, P. (2014). Context-dependent 53. urgency influences speed–accuracy trade-offs in decision-making Mulder, M. J., Wagenmakers, E.-J., Ratcliff, R., Boekel, W., & and movement execution. The Journal of Neuroscience, 34(49), Forstmann, B. U. (2012). Bias in the brain: a diffusion model 16442–16454. analysis of prior probability and potential payoff. The Journal of Vickers, D. (1970). Evidence for an accumulator model of psy- Neuroscience, 32(7), 2335–2343. chophysical discrimination. Ergonomics, 13(1), 37–58. Myung, I. J., & Busemeyer, J. R. (1989). Criterion learning in a Vickers, D. (1979). Decision processes in visual perception. Academic deferred decision-making task. The American journal of psychol- Press. ogy, pp. 1–16. Voskuilen, C., Ratcliff, R., & Smith, P. L. (2016). Comparing fixed Palmer, J., Huk, A. C., & Shadlen, M. N. (2005). The effect of stim- and collapsing boundary versions of the diffusion model. Journal ulus strength on the speed and accuracy of a perceptual decision. of Mathematical Psychology, 73, 59–79. Journal of vision, 5(5). Wald, A. (1945a). Sequential method of sampling for deciding Pitz, G. F., Reinhold, H., & Geller, E. S. (1969). Strategies of between two courses of action. Journal of the American Statistical information seeking in deferred decision-making. Organizational Association, 40(231), 277–306. Behavior and Human Performance, 4(1), 1–19. Wald, A. (1945b). Sequential tests of statistical hypotheses. The Annals Pollock, S. M. (1964). Sequential search and detection. Cambridge: of Mathematical Statistics, 16(2), 117–186. MIT. (Unpublished doctoral dissertation). Wald, A., & Wolfowitz, J. (1948). Optimum character of the sequential Puterman, M. L. (2005). Markov decision processes: Discrete stochas- probability ratio test. The Annals of Mathematical Statistics, 19(3), tic dynamic programming. New Jersey: Wiley. 326–339. Rapoport, A., & Burkheimer, G. J. (1971). Models for deferred Zacksenhouse, M., Bogacz, R., & Holmes, P. (2010). Robust versus decision-making. Journal of Mathematical Psychology, 8(4), 508– optimal strategies for two-alternative forced choice tasks. Journal 538. of Mathematical Psychology, 54(2), 230–246.

Journal

Psychonomic Bulletin & ReviewSpringer Journals

Published: Jul 20, 2017

References

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off