TY - JOUR AU - Cleynen, Alice AB - Introduction In long-term diseases such as cancer, patients alternate between remission and relapse phases and are monitored along time through non-invasive check-ups such as blood samples [1, 2]. Based on these noisy indirect disease measurements of some markers, practitioners must decide on treatment allocation, sometimes with little knowledge on the process dynamics (e.g. aggressiveness of the relapse) which may differ between patients [3–5]. Long retrospect of medical practice has allowed the definition of milestones to help practitioners in making follow-up decisions, but automated personalized criteria are yet to be defined to improve individual patient follow-up. The ability to monitor patients in the least invasive manner, according to their personal preferences (more check-ups to enforce relapse detection, less aggressive treatments for better quality of life, etc) is a crucial step towards better care, but requires fine knowledge of diseases dynamics and reliable prediction algorithms. One of the main requirements for such task is the definition of a universal model adapted to patient-specific parameters that could describe in an exhaustive manner the possible consequences of the practitioner’s decisions. Mathematical models have become essential for understanding cancer cell evolution and for designing more effective treatment plans. They have for instance been used to link the tumor markers to tumor sizes [6, 7], or to predict evolution of tumor growth from initial measurements [8–10]. Over the past decade, game theory, and particularly evolutionary game theory, has emerged as one of the most promising approaches for optimizing cancer treatments. In these models, different cell types (e.g., healthy cells, various subclones of tumor cells) compete for resources and space using distinct strategies. Unlike classical game theory, evolutionary game theory does not assume rational choice; instead, cell populations inherit traits that influence their selective fitness. Deterministic mathematical models are then fitted to patient data in order to identify either a treatment dose of equilibrium or a treatment strategy that minimizes the tumor burden [11, 12]. Similar approaches, based on a Lotka-Voltera modeling of the resistant and naive cell competition, have successfully been implemented in the clinic [13–15], with recent theoretical work extending these models to finite, long-term optimal control of cancer cell load [16]. While these models have contributed significantly toward personalized care, they still face limitations. Current approaches often rely on arbitrary thresholds, fall short of enabling complete remission, are unable to predict relapse occurrences, and lack adaptability to individual patient preferences. Additionally, they do not deal with data collection times, which can represent a significant burden on some patients. Models for online adaptive treatment selection should accommodate the continuous evolution of disease, alternating phases of remission and relapse, marker values that span a continuous range, and noisy observations collected at discrete visit intervals. A promising modeling framework is based on Piecewise Deterministic Markov Process (PDMP) [17–19]. Indeed, PDMPs are non diffusive hybrid stochastic processes that can handle both continuous and discrete variables and their interactions in continuous time. The only source of stochasticity comes from the jumps of the process, which evolves deterministically between jumps. This framework allows for highly adaptable dynamics, such as incorporating Lotka-Volterra equations between jumps, while enabling parameter changes at each jump—a critical feature for modeling distinct phases of a patient’s disease trajectory, including the potential for complete remission. PDMPs are simple to simulate, easy to interpret, and their controlled versions allow continuous time dynamics on continuous (or hybrid discrete and continuous) state spaces with decisions taken in continuous time [20, 21]. This flexibility makes PDMPs particularly suitable for personalized, adaptive treatment strategies, including choice of the sample collection times. Our approach is illustrated on a large dataset concerning a cohort of Multiple Myeloma (MM) patient data from the Intergroupe Francophone du Myélome (IFM) 2009 clinical trial [22]. We assume that there is a single cancer marker that remains at a nominal threshold ζ0 throughout any remission phase, and that at patient relapse its level increases exponentially with multiple possible behaviors until treatment is administered, or a threshold D is reached and the patient dies. To set up the context, we will assume that the study begins at time t0 = 0 when the patient enters their first remission state, and we will denote t1, t2, …, their visit dates, defined over time by the practitioner, until some time horizon H is reached, or the patient dies. The time lapse between visits may not be constant, so that different patients may have different visit numbers and dates. More precisely, we will assume that at each visit time, the practitioner may choose to schedule the next visit in either 15, 30 or 60 days. Such decision may be based on the previous and current marker measurements, which we denote Y0, Y1, …. Note that the exact value of the marker is hidden as measurements are corrupted by noise, and measurements are only collected at visit dates. Together with the next visit date, the practitioner may chose to modify the patient’s current treatment, fixing it to one of the two available treatments, a and b, or to no treatment at all, denoted ∅. An example of patient follow-up data is presented in Fig 1a. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 1. Example of patient follow-up data, PDMP model. a) Marker values are measured at each patient visits over a certain period of time. Data from the Intergroupe Francophone du Myélome 2009 clinical trial, courtesy of the Centre de Recherche en Cancérologie de Toulouse. b) PDMP model, representation of the marker level of a patient. The risk function λ controls the time to relapse, while parameters v and v′ control the aggressiveness of the disease and the efficiency of the treatment respectively. https://doi.org/10.1371/journal.pone.0315661.g001 We model the underlying continuous-time dynamics of the patient health by a controlled PDMP (Xt)0≤t≤H [20, 21, 23–26]. We propose to optimally control the process, that is to choose online the next treatment and visit date, based on present and past observations and decisions by minimizing a cost function which is calibrated to balance the burden of deteriorated quality of life under treatment (including hospital visits) with the risk of dying from the disease. Previous work has focused on discretizing this problem in order to solve it approximately through Dynamic Programming (DP) iterations [27]. More specifically, the optimal control problem for the PDMP has first been expressed as a Partially Observed Markov Decision Process (POMDP) [28]. This step is simply done by considering decision dates as stages of the POMDP. Note that the time lapse between decisions is not constant, the continuous time dynamics is encoded in the specific parametrization of the transition kernel, and the POMDP still has a continuous state space, with continuous observation space. The problem is then classically converted into a fully observed Markov Decision Process (MDP) [28, 29] on the belief or filter space. The filter process represents the probability distribution of the hidden values of the patient current state given the past and present observations. Second, the state space of the controlled PDMP has been discretized, so that an approximation of the filter process could be computed, charging only finitely many states. This approximate filter is called conditional filter in the sequel. Third, the belief space has been discretized in order to solve the MDP via dynamic programming iterations on a finite space. In the current article, instead of discretizing the state and belief spaces, we use a Monte-Carlo Tree Search approach to (approximately) solve the controlled PDMP problem by simulation. More precisely, we propose an adaptation of the Partially Observed Monte-Carlo Planning (POMCP) algorithm [30], originally designed to solve discrete time / finite state and observation spaces POMDP, to the case of controlled PDMP. The novel challenge is that controlled PDMP involve continuous time as well as continuous state and observation spaces. We show empirically that this simulation-based approach outperforms the discretization-based approach both in terms of computation time and quality of returned policies. Thus, this approach is promising for providing an automated decision aiding tool for practitioners. To our knowledge, this is the first method to address this challenging question. Current decision making is typically based on heuristic rules derived from expert clinical knowledge [31–33]. In the Results section, we first propose the PDMP model to describe patient marker data and the characteristics of the long-term cost functions. We then show how the POMCP algorithm can be tuned to control patient trajectories and finally propose several simulation studies to illustrate our approach. In the first simulation study, we evaluate the impact of the parameters of the POMCP algorithm on its performance, showing robustness to the exploration/exploitation tradeoff, discretization precision, but importance of the number of simulations to explore the tree. In the second simulation study, we compare our approach to the standard dynamic programming resolution strategy and benchmark it against alternative methods. We discuss our findings in the Discussion section. Details on the dataset, a short review on PDMPs and the original POMCP algorithm, as well as implementation of our strategy are postponed to the Methods section. Results Controlled piecewise deterministic Markov processes form a universal class of versatile models for patient follow-up Despite the discrete-time acquisition of the marker measurements, we choose to model the dynamics of the patient’s health by a continuous-time controlled Piecewise Deterministic Markov Process (PDMP). The formalism of PDMPs is both light and versatile [18–20, 34]. It allows to describe the dynamics of the disease with only three biologically relevant parameters: the disease’s risk function, λ, that dictates how often the patient is likely to relapse, the aggressiveness of the disease v that dictates how fast the marker level will increase during relapses, and the treatment efficiency v′ that dictates how fast the marker level decreases under treatment. In the formalism of PDMP, this is formulated by an exponential flow Phi which slope parameter (v or v′) depends on patient condition and treatment, the risk function λ, and a transition kernel Q, that dictates how the patient’s state evolves at relapses, here preventing the marker values from jumping abruptly at patient condition changes. This is illustrated in Fig 1b. We consider a common risk function (identical for all patients) which we allow to depend on the time since the last remission date as well as the cancer marker level. We assume that the aggressiveness of the disease can be patient-dependent. It is either high or low and we model this as two different diseases. This leads us to introduce two different treatments, each efficient for one of those diseases and slowing the progression of the other. It is also an option not to treat for a given period. We introduce three variables m, ζ, u, where the mode m corresponds to the overall condition of the patient (m = 0: remission, m = 1: disease 1, m = 2: disease 2, m = 3: death of the patient), ζ ∈ [ζ0, D] is the level of the marker, where ζ0 is the nominal value and D the death level and u ≥ 0 is the time since the last change of overall condition (added for technical reasons to deal with non-constant risk functions). The precise definition of the controlled PDMP and its parameters (λ, Φ, Q) are given in the Methods section. The complete state of the patient is thus encoded by s = (m, ζ, u). We denote X0, …, Xn the process values at the observation dates t0, …, tn. The overall condition of the patient m, the level of the marker ζ and the relapse dates (together with the time u since the last change of condition) are not directly observed and thus cannot be used by the clinician to select a treatment. At each visit of the patient to the medical center, we assume that the practitioner receives a noisy observation of the marker level y = ζ + ϵ where ϵ is some Gaussian noise. The practitioner also knows the time t since the beginning of the patient follow-up. The complete observation available to the practitioner is thus encoded by ω = (y, t). The practitioner also has access to an indicator that the patient is still alive as treatment and follow-up stop at the death of the patient. Based on the collection of present and past measurements and decisions, the practitioner selects both a time delay r until the next visit to the medical center and a treatment ℓ to hold until this next visit. Note that in our framework measurements are only made at visit dates. A decision is thus a pair d = (ℓ, r), where ℓ ∈ {∅, a, b}, and r ∈ {15, 30, 60}. Given a fixed arbitrary decision policy, simulating controlled patient trajectories is easy: see Algorithm 1 given in the Methods Section. Cost functions encode the diverse impacts of treatment on the patient’s quality of life For the practitioner, controlling the disease is equivalent to choosing the best available treatment as well as the best next visit date in order to minimize its impact on the patient’s quality of life along time. Defining this impact is a difficult task as it will typically depend on the treatment’s side effects, the number of visits, the burden of living with a disease and the remaining life expectancy. This paper proposes a mathematical definition of the impact of the treatment on quality of life in terms of a cost function that takes into account those different aspects. For a decision d = (ℓ, r) comprising a treatment allocation ℓ and a time to next visit r, and for a current marker level ζ at time tk, and future marker level ζ′ at time tk+1 = tk + r, we define (1) where Cv is a visit cost, κ is non-negative scale factor penalizing high marker values, β is a penalty for applying an unnecessary treatment and M is the death cost. This cost function thus takes into account a visit cost, to prevent patients from undergoing too many screening tests, a cost depending on the marker value at the next visit, to encourage treatment and calibrate visit dates, a cost for degradation of quality of life due to treatments, in particular if they are not appropriate, and a cost for dying. Calibrating cost parameters CV, κ, β, and M is a very difficult task, which is allowed to be patient-dependent (some patients may even express a wish to stop the follow-up rather than undergo very long and painful treatments), and treatment strategies are bound to be parameters-dependent. When cast as a controlled PDMP with this cost function, the practitioner’s problem is mathematically equivalent to solving a (continuous state space) Partially Observable Markov Decision Process (POMDP) [27, 35], which expected value optimization can be stated as (2) where V is called the optimal policy value and represents the lowest possible expected total cost, Π is the set of admissible policies (yielding decisions depending only on current and past observations), Nπ is the patient-specific number of visits within the time-horizon of the study when using policy π, dn = π(Y0, t0, …, Yn, tn) is the decision (ℓ, r) taken at the n-th visit date tn according to policy π, and (Yn)n≥0 represents the marker observation process for the controlled-PDMP/POMDP. Solving this problem amounts to computing (a good approximation of) the optimal policy value and identifying an admissible policy π* that reaches (a value close to) the minimum. Adapted Partially Observed Monte-Carlo Planning is particulartly well suited for controlled PDMPs The Partially Observed Monte-Carlo Planning (POMCP) algorithm [30] is an efficient simulation-based algorithm that has been designed for real-time planning in large finite state-space POMDPs. In this paper we show that even-though it has not been designed to handle continuous state and observation spaces, we can adapt it to solve controlled PDMPs, thanks to their efficient simulation property, without resorting to the computation of complex integrals for computing transition probabilities. The objective of POMCP is to reduce the complexity of dynamic programming, which requires the construction of the entire decision tree (including the probabilities of every possible outcome with every possible decision at every future time-point), by sampling the tree in a principled way so as to compute the current optimal action. POMCP is thus an online algorithm, which re-estimates the optimal strategy at each new data acquisition (POMCP does not “forget” previously computed strategies, but updates them using new simulated samples after every new observation is received). The POMCP algorithm relies on two main properties. The first one is the ability to simulate trajectories, so as to progressively build the decision tree and update filters Θ at every intermediate node h of the tree. Recall that a filter is a probability distribution representing the (approximate) distribution of the current hidden state given the observations. The standard POMCP algorithm uses a specific family of simulation-based filters Θp called particle filters specified below. Filters are used to sample sets of plausible states. The second property is the requirement to provide estimates of the expected value of the policy in leaves of the current exploration tree, in order to guide exploration and build the decision policy. In the Methods Section we detail the algorithm (Algorithm 2) and show that POMCP is particularly well suited for controlled PDMPs. We simply point out here why trajectories simulation and policy evaluation are particularly efficient in POMCP, in the case of a controlled PDMP. Simulation is particularly straightforward with PDMPs [21, 36, 37], requiring only to simulate the jump times and exploit the deterministic behavior between jumps, see Algorithm 1. In our medical framework, it is made even more simple since only few jumps are allowed. When little knowledge is available about the underlying process, a classic approach is to resort to particle filters Θp [38]. A particle filter Θp at step n is a discrete uniform probability distribution with finite support Bp (where Bp may have repeated atoms). It is updated at step n + 1 though simulations: states s from Bp are updated through a one-step simulation to a new state s′, and selected to be added to Bp if the simulated observation is close to the true one. As an alternative filter to compare to, we propose to use a conditional filter Θc derived from the exact filter (that is the conditional distribution of the hidden state given the observations) from [27]. The exact conditional filter is updated through a recurrence formula involving ratios of integrals over the state space. By discretizing the state space, one can construct the approximation Θc of the exact filter. Unlike the particle filter that has a dynamically changing support with a uniform mass function, this conditional filter has a fixed support (the discretized space) with changing mass functions that are updated through analytical ratios of weighted sums. To estimate the future expected cost at some node of the tree, POMCP requires to simulate many full trajectories from the current node to a leaf of the tree. This requires to apply an arbitrary strategy to pick actions at every future nodes for which a decision has not yet been optimized. This arbitrary strategy is called a rollout strategy in the POMCP framework. The most naive rollout strategy consists in uniformly randomly selecting decisions from the decision set {∅, a, b} × {15, 30, 60}. We consider instead a mode-based rollout strategy, which consists in choosing action ∅ in mode 0 (no treatment if the simulated patient is in remission), action a in mode 1 and b in mode 2 (most efficient treatment if the simulated patient has relapsed) and a fixed next visit date of 15 days. This rollout strategy, while not being necessarily optimal (depending on the cost function it might be optimal not to treat at the beginning of a relapse, or to treat preventively when in remission), exploits knowledge of the cost function, hence yields better estimates of action costs at time t. Note also that this mode-based policy is not applicable for real patients, since their mode is not observed. It is only applicable to simulated patients. This is fine since POMCP’s rollout strategy is only used through simulations to estimate costs. POMCP has a number of tuning parameters (number of simulations, number of particles in the filter, exploration vs exploitation rates) which are described (as well as the impact of varying their values) in the following section. Following up a patient with adapted POMCP is easy and fast in practice The previous paragraphs set the grounds for optimizing the long-term follow up of patients. In practice, we will assume a patient will enter the follow-up study once they enter the remission phase after an initial round of treatment. The practitioner may hence assume that their current state is known, i.e. s0 = (0, ζ0, 0) and the initial value of both the particle and conditional filters is the Dirac mass at s0. The initial observation is ω0 = (ζ0, t0). The adapted POMCP algorithm is run to obtain the optimal decision d0, which the practitioner can use (if they decide to) to allocate treatment and decide on the next visit date t1. At visit n, the patient will come back for some new marker measurement, so that the n-th observation value ωn = (yn, tn) is obtained. The practitioner will have access to their full history, hn = 〈ω0d0ω1d1…dn−1ωn〉 as well as their last belief filter, or . An initial update of the filter is performed, either using the recursion formula for from and ωn, or by particle filtering through rejection sampling for from and ωn. The adapted POMCP algorithm is then ran to obtain the optimal current decision dn, which the practitioner can use (or not) to allocate treatment and decide on the next visit date tn+1. This is illustrated in Fig 2. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 2. Practice of patient follow-up. a) At each new visit the patient has a new marker measurement, and the practitioner receives a new observation ωn = (yn, tn). b) The filter is updated with the new observation, either through particle rejection sampling (particle filter) or via a recursion formula (conditional filter). c) The decision tree is partially explored via simulation through an adapted POMCP algorithm using the updated filter. The algorithm returns the optimal decision dn, combination of a time to next visit (defining tn+1) and treatment to allocate (influencing yn+1). https://doi.org/10.1371/journal.pone.0315661.g002 Adapted POMCP can be tuned to outperform dynamic programming The simulation study presented in this section has been conducted by exploiting real data obtained from the Centre de Recherche en Cancérologie de Toulouse (CRCT). Multiple myeloma (MM) is the second most common haematological malignancy in the world and is characterised by the accumulation of malignant plasma cells in the bone marrow. Classical treatments are based on chemotherapies, which, if appropriate, act fast and efficiently bring MM patients to remission in a few weeks. However almost all patients eventually relapse more than once and the five-year survival rate is around 50%. We have obtained data from the Intergroupe Francophone du Myélome 2009 clinical trial [22] which has followed 748 French MM patients from diagnosis to their first relapse on a standardized protocol for up to six years. At each visit a blood sample was obtained to evaluate the amount of monoclonal immunoglobulin protein in the blood, a marker for the disease progression. An example of patient dataset is given in Fig 1. Based on these data, we calibrated our PDMP model as described in the Methods section, and we performed simulations to evaluate the performance of the POMCP strategy to select the combination of treatment and next visit date at each time point of the trajectories (these time-points being themselves selected by the algorithm). The performance of the approach was measured by a Monte-Carlo estimate of the expectation and confidence interval of its value as well as the runtime of the online computation of a complete trajectory. For each disease parameters’ configuration 500 simulations were performed to estimate these values. Codes and parameters are available at https://github.com/acleynen/pomcp4pdmp [39]. Study 1: Impact of the parameters’ values on POMCP’s performance. We evaluated the impact of the value of 6 parameters on the performance: (i) the filter chosen (conditional or particle), (ii) the rollout procedure chosen, (iii) the exploitation/exploration tradeoff parameter ;′, (iv) the number nsearch of simulations in the online POMCP procedure, (v) the number K of initial states to sample from at each of the nsearch simulations, and (vi) the internal POMCP precision parameter to select particles in the particle filter. Those parameters are described at length in the Methods Section, see Algorithm 2. The Github page [39] contains tables with results for all sets of parameter values that were tested. In this section we describe the most important results. We performed all parameter comparisons for both the conditional and the particle filters. The conditional particle filter is a discrete probability distribution on a finite fixed support Bc of size 184 with 81 states in condition m = 0, 31 states in condition m = 1, 71 states in condition m = 2 and one state in condition m = 3. The choice of these states is discussed in [27]. To adapt this filter to the POMCP environment, at each iteration n we start by randomly sampling K states s from Bc with the distribution given by . For the particle filter, this number K directly corresponds to the number of particles in the filter, hence the size of Bp. Note that for the conditional filter the support Bc does not change over time, whereas for the particle filter Bp keeps the same size but possibly contains different states at each iteration. Mode-based rollout outperforms naive rollout. We found that the uniform rollout procedure (selecting decisions randomly) produced very poor results compared to the mode-based rollout procedure and hence we only present results for the mode-based rollout policy here. POMCP is robust to the exploration/exploitation trade-off. We found that the exploitation/exploration trade-off parameter had little influence on the overall performance, with the exception of extreme values (α′ = 0.99, almost no exploration, and α′ = 0.2, almost no exploitation, both leading to poorer performance). We also tried several adaptive strategies to select the value of α′ depending on the confidence in the belief (measured in terms of entropy) which did not improve the results. The following results are therefore discussed for a fixed value of α′ = 0.5, but the reader may refer to the Tables 4 and 5 in S1 File for additional results on these parameters. Increasing the number of exploratory simulations yields the best performance gain. For a fixed number of sampled belief states K and precision , increasing the number of simulations nsearch improved the performance of the algorithm while decreasing the variance in the simulations for both filter types (see the top left panel of Fig 3 for K = 500, ). In the case of the conditional filter, the runtime increases linearly, from 103 seconds per trajectory with nsearch = 100 simulations to 104 seconds per trajectory with nsearch = 1000. In the case of the particle filter, the runtime is 3 times as much for 100 simulations since when nsearch < K additional simulations must be performed to compute the particle filter at time n + 1. The difference decreases to only 1.2 times the runtime of the conditional filter for nsearch = 1000. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 3. Impact of POMCP parameters on the estimated value function. Top left: increasing the number of simulations for filters with 500 initial states improves the average trajectory costs. Top right: increasing the number of atoms in the filter improves the performance of the particle filter but not the conditional filter. https://doi.org/10.1371/journal.pone.0315661.g003 The particle filter requires a large belief state to achieve high performance. The particle filter requires a large belief state to achieve high performance. As expected, for a fixed number of exploratory simulations nsearch and precision , increasing the number K of particles in the particle filter led to a tremendous improvement for the particle filter together with a significantly decreased variance in the trajectory costs, while it had no impact on the conditional filter (see top right panel of Fig 3 for nsearch = 500, ). Similarly, K does not have an impact on the runtime of the conditional filter, while it leads to an exponential increase in the runtime of the particle filter. POMCP for controlled PDMPs requires one additional tuning: the precision of the tree observation nodes. The bottom two panels of Fig 3 illustrate the impact of the precision for a fixed filter size (K = 500) and two different numbers of simulations (nsearch = 500, left, and nsearch = 1000, right). For a smaller number of simulations, decreasing the precision improves the result up to , and then worsens them again. This tendency is still observed for the particle filter when nsearch increases, while the performance of the conditional filter is optimal with the loosest precision, . Those results are the consequence of two factors: as detailed in the Methods Section, each simulation creates a novel node in the tree exploration, where a node is a set of potential future trajectories with their estimated costs. When the precision is very fine, each simulation produces a different future observation and hence the estimation relies on one-step forward simulations which may miss future events. On the other hand, when the precision is very loose, each simulation will build on the previous simulation to explore a step forward, yielding very long trees with few branches. This will also miss the variability of different outcomes. The second factor comes from the way the filters are constructed. Particle filters are based on the comparison of simulations with observations. When the precision is very loose, almost all simulations will be accepted, creating a strong bias in the belief of the current state, that will propagate from time-point to time-point. As the conditional filter update does not rely on simulations, hence neither on precision, there is no propagation of uncertainty from step to step, and when the number of exploratory simulations nsearch is large enough to guarantee some diversity in the tree exploration, estimating the cost of each decision from longer trajectories will provide better results. Finally, one may note that the gain in using conditional filters is mostly apparent in extreme parameter scenarios, for instance with very low number of particles K, with very high precision rates, etc. Provided the user has enough computing budget, both filters tend to provide very similar results. Study 2: Adapted POMCP outperforms the dynamic programming approach. In this study we compare the results of five resolution strategies calibrated with their optimal parameters on biological relevant outcomes: the death rate, the Progression-Free Survival (PFS) time, that is, the time from study entry to the first relapse, the time spent under treatment, the number of visits to the hospital and the cost. Those quantities were normalized so that they range between 0 and 1, and such that an optimal result is 0. To do so, death rate was normalized so that a random treatment strategy yields 1 (here 5% of patients); the PFS was transformed as 1-(PFS/H) (where we recall that H is the study horizon), so that a patient who does not relapse has normalized PFS equal to 0; the time spent under treatment was normalized by H, the number of visits was normalized as since over the horizon, a visit every 15 days produces 160 visits, whereas a visit every 60 days produces 40 visits; and finally the cost was normalized as where v0 is the the best approximation of the optimal value obtained through discretizations in [27], and Cthreshold is the average cost of the threshold strategy. Here again, we simulated 500 trajectories with each strategy under the same cost parameters. The results are summarized in the Radar plot of Fig 4, and additional visual information on average trajectory cost is given in the barplots. In the Radar plot representation, a perfect strategy should delimitate the inner circle. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 4. a) Radar plot comparing performances of 4 solution strategies on death rate, progression-free survival (PFS), time spent under treatment, average number of visits per patient, and average trajectory cost. An optimal strategy would be the inner-circle. b) Barplot of trajectory cost for 500 simulations under three main strategies: POMCP with particle filter, POMCP with conditional filter and Dynamic Programming on discretized processes. https://doi.org/10.1371/journal.pone.0315661.g004 Compared strategies are the discretization/DP approach (DP) from [27] that relies on exact resolution by dynamic programming of the discretized POMPD, the adapted POMCP with the conditional filter (POMCP-Conditional), the adapted POMCP with the particle filter (POMCP-Particles), a threshold strategy which assigns treatment when the marker reaches a given threshold (Threshold), and a random strategy agnostic to the marker values (Random). The threhold strategy was fixed to treat when the marker exceeds 5 and until it returns to a nominal value. The actionner starts with treatment a, and adapts the treatment if the marker does not decrease at the next visit. Visits are set every 60 days. The random strategy randomly selects a treatment (including not treating) and a next visit date at each new timepoints regardless of the marker value. Interestingly, the combination of POMCP with the conditional filter yields the lowest average trajectory cost and the shortest average time spent under treatment, by slightly increasing the number of visits and reducing PFS compared to the DP approach. However, the particle-based POMCP approach (relying fully on simulations with no other exploitation of the underlying model) yields cost almost as good as the previous two approaches, with increased number of visits but longest progression-free survival time. Importantly, out of 500 simulations, one of the trajectories ended with a patient dying. Not surprisingly, the threshold strategy has a higher cost than other approaches as it does not aim at maintaining the marker value as close as possible to the nominal value: the trajectories are therefore highly penalized by the delay in starting treatment. One might still note that the lower number of visits does not compensate for the delay in treatment, and that this threshold strategy still leads to more time under treatment for the patients. Controlled piecewise deterministic Markov processes form a universal class of versatile models for patient follow-up Despite the discrete-time acquisition of the marker measurements, we choose to model the dynamics of the patient’s health by a continuous-time controlled Piecewise Deterministic Markov Process (PDMP). The formalism of PDMPs is both light and versatile [18–20, 34]. It allows to describe the dynamics of the disease with only three biologically relevant parameters: the disease’s risk function, λ, that dictates how often the patient is likely to relapse, the aggressiveness of the disease v that dictates how fast the marker level will increase during relapses, and the treatment efficiency v′ that dictates how fast the marker level decreases under treatment. In the formalism of PDMP, this is formulated by an exponential flow Phi which slope parameter (v or v′) depends on patient condition and treatment, the risk function λ, and a transition kernel Q, that dictates how the patient’s state evolves at relapses, here preventing the marker values from jumping abruptly at patient condition changes. This is illustrated in Fig 1b. We consider a common risk function (identical for all patients) which we allow to depend on the time since the last remission date as well as the cancer marker level. We assume that the aggressiveness of the disease can be patient-dependent. It is either high or low and we model this as two different diseases. This leads us to introduce two different treatments, each efficient for one of those diseases and slowing the progression of the other. It is also an option not to treat for a given period. We introduce three variables m, ζ, u, where the mode m corresponds to the overall condition of the patient (m = 0: remission, m = 1: disease 1, m = 2: disease 2, m = 3: death of the patient), ζ ∈ [ζ0, D] is the level of the marker, where ζ0 is the nominal value and D the death level and u ≥ 0 is the time since the last change of overall condition (added for technical reasons to deal with non-constant risk functions). The precise definition of the controlled PDMP and its parameters (λ, Φ, Q) are given in the Methods section. The complete state of the patient is thus encoded by s = (m, ζ, u). We denote X0, …, Xn the process values at the observation dates t0, …, tn. The overall condition of the patient m, the level of the marker ζ and the relapse dates (together with the time u since the last change of condition) are not directly observed and thus cannot be used by the clinician to select a treatment. At each visit of the patient to the medical center, we assume that the practitioner receives a noisy observation of the marker level y = ζ + ϵ where ϵ is some Gaussian noise. The practitioner also knows the time t since the beginning of the patient follow-up. The complete observation available to the practitioner is thus encoded by ω = (y, t). The practitioner also has access to an indicator that the patient is still alive as treatment and follow-up stop at the death of the patient. Based on the collection of present and past measurements and decisions, the practitioner selects both a time delay r until the next visit to the medical center and a treatment ℓ to hold until this next visit. Note that in our framework measurements are only made at visit dates. A decision is thus a pair d = (ℓ, r), where ℓ ∈ {∅, a, b}, and r ∈ {15, 30, 60}. Given a fixed arbitrary decision policy, simulating controlled patient trajectories is easy: see Algorithm 1 given in the Methods Section. Cost functions encode the diverse impacts of treatment on the patient’s quality of life For the practitioner, controlling the disease is equivalent to choosing the best available treatment as well as the best next visit date in order to minimize its impact on the patient’s quality of life along time. Defining this impact is a difficult task as it will typically depend on the treatment’s side effects, the number of visits, the burden of living with a disease and the remaining life expectancy. This paper proposes a mathematical definition of the impact of the treatment on quality of life in terms of a cost function that takes into account those different aspects. For a decision d = (ℓ, r) comprising a treatment allocation ℓ and a time to next visit r, and for a current marker level ζ at time tk, and future marker level ζ′ at time tk+1 = tk + r, we define (1) where Cv is a visit cost, κ is non-negative scale factor penalizing high marker values, β is a penalty for applying an unnecessary treatment and M is the death cost. This cost function thus takes into account a visit cost, to prevent patients from undergoing too many screening tests, a cost depending on the marker value at the next visit, to encourage treatment and calibrate visit dates, a cost for degradation of quality of life due to treatments, in particular if they are not appropriate, and a cost for dying. Calibrating cost parameters CV, κ, β, and M is a very difficult task, which is allowed to be patient-dependent (some patients may even express a wish to stop the follow-up rather than undergo very long and painful treatments), and treatment strategies are bound to be parameters-dependent. When cast as a controlled PDMP with this cost function, the practitioner’s problem is mathematically equivalent to solving a (continuous state space) Partially Observable Markov Decision Process (POMDP) [27, 35], which expected value optimization can be stated as (2) where V is called the optimal policy value and represents the lowest possible expected total cost, Π is the set of admissible policies (yielding decisions depending only on current and past observations), Nπ is the patient-specific number of visits within the time-horizon of the study when using policy π, dn = π(Y0, t0, …, Yn, tn) is the decision (ℓ, r) taken at the n-th visit date tn according to policy π, and (Yn)n≥0 represents the marker observation process for the controlled-PDMP/POMDP. Solving this problem amounts to computing (a good approximation of) the optimal policy value and identifying an admissible policy π* that reaches (a value close to) the minimum. Adapted Partially Observed Monte-Carlo Planning is particulartly well suited for controlled PDMPs The Partially Observed Monte-Carlo Planning (POMCP) algorithm [30] is an efficient simulation-based algorithm that has been designed for real-time planning in large finite state-space POMDPs. In this paper we show that even-though it has not been designed to handle continuous state and observation spaces, we can adapt it to solve controlled PDMPs, thanks to their efficient simulation property, without resorting to the computation of complex integrals for computing transition probabilities. The objective of POMCP is to reduce the complexity of dynamic programming, which requires the construction of the entire decision tree (including the probabilities of every possible outcome with every possible decision at every future time-point), by sampling the tree in a principled way so as to compute the current optimal action. POMCP is thus an online algorithm, which re-estimates the optimal strategy at each new data acquisition (POMCP does not “forget” previously computed strategies, but updates them using new simulated samples after every new observation is received). The POMCP algorithm relies on two main properties. The first one is the ability to simulate trajectories, so as to progressively build the decision tree and update filters Θ at every intermediate node h of the tree. Recall that a filter is a probability distribution representing the (approximate) distribution of the current hidden state given the observations. The standard POMCP algorithm uses a specific family of simulation-based filters Θp called particle filters specified below. Filters are used to sample sets of plausible states. The second property is the requirement to provide estimates of the expected value of the policy in leaves of the current exploration tree, in order to guide exploration and build the decision policy. In the Methods Section we detail the algorithm (Algorithm 2) and show that POMCP is particularly well suited for controlled PDMPs. We simply point out here why trajectories simulation and policy evaluation are particularly efficient in POMCP, in the case of a controlled PDMP. Simulation is particularly straightforward with PDMPs [21, 36, 37], requiring only to simulate the jump times and exploit the deterministic behavior between jumps, see Algorithm 1. In our medical framework, it is made even more simple since only few jumps are allowed. When little knowledge is available about the underlying process, a classic approach is to resort to particle filters Θp [38]. A particle filter Θp at step n is a discrete uniform probability distribution with finite support Bp (where Bp may have repeated atoms). It is updated at step n + 1 though simulations: states s from Bp are updated through a one-step simulation to a new state s′, and selected to be added to Bp if the simulated observation is close to the true one. As an alternative filter to compare to, we propose to use a conditional filter Θc derived from the exact filter (that is the conditional distribution of the hidden state given the observations) from [27]. The exact conditional filter is updated through a recurrence formula involving ratios of integrals over the state space. By discretizing the state space, one can construct the approximation Θc of the exact filter. Unlike the particle filter that has a dynamically changing support with a uniform mass function, this conditional filter has a fixed support (the discretized space) with changing mass functions that are updated through analytical ratios of weighted sums. To estimate the future expected cost at some node of the tree, POMCP requires to simulate many full trajectories from the current node to a leaf of the tree. This requires to apply an arbitrary strategy to pick actions at every future nodes for which a decision has not yet been optimized. This arbitrary strategy is called a rollout strategy in the POMCP framework. The most naive rollout strategy consists in uniformly randomly selecting decisions from the decision set {∅, a, b} × {15, 30, 60}. We consider instead a mode-based rollout strategy, which consists in choosing action ∅ in mode 0 (no treatment if the simulated patient is in remission), action a in mode 1 and b in mode 2 (most efficient treatment if the simulated patient has relapsed) and a fixed next visit date of 15 days. This rollout strategy, while not being necessarily optimal (depending on the cost function it might be optimal not to treat at the beginning of a relapse, or to treat preventively when in remission), exploits knowledge of the cost function, hence yields better estimates of action costs at time t. Note also that this mode-based policy is not applicable for real patients, since their mode is not observed. It is only applicable to simulated patients. This is fine since POMCP’s rollout strategy is only used through simulations to estimate costs. POMCP has a number of tuning parameters (number of simulations, number of particles in the filter, exploration vs exploitation rates) which are described (as well as the impact of varying their values) in the following section. Following up a patient with adapted POMCP is easy and fast in practice The previous paragraphs set the grounds for optimizing the long-term follow up of patients. In practice, we will assume a patient will enter the follow-up study once they enter the remission phase after an initial round of treatment. The practitioner may hence assume that their current state is known, i.e. s0 = (0, ζ0, 0) and the initial value of both the particle and conditional filters is the Dirac mass at s0. The initial observation is ω0 = (ζ0, t0). The adapted POMCP algorithm is run to obtain the optimal decision d0, which the practitioner can use (if they decide to) to allocate treatment and decide on the next visit date t1. At visit n, the patient will come back for some new marker measurement, so that the n-th observation value ωn = (yn, tn) is obtained. The practitioner will have access to their full history, hn = 〈ω0d0ω1d1…dn−1ωn〉 as well as their last belief filter, or . An initial update of the filter is performed, either using the recursion formula for from and ωn, or by particle filtering through rejection sampling for from and ωn. The adapted POMCP algorithm is then ran to obtain the optimal current decision dn, which the practitioner can use (or not) to allocate treatment and decide on the next visit date tn+1. This is illustrated in Fig 2. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 2. Practice of patient follow-up. a) At each new visit the patient has a new marker measurement, and the practitioner receives a new observation ωn = (yn, tn). b) The filter is updated with the new observation, either through particle rejection sampling (particle filter) or via a recursion formula (conditional filter). c) The decision tree is partially explored via simulation through an adapted POMCP algorithm using the updated filter. The algorithm returns the optimal decision dn, combination of a time to next visit (defining tn+1) and treatment to allocate (influencing yn+1). https://doi.org/10.1371/journal.pone.0315661.g002 Adapted POMCP can be tuned to outperform dynamic programming The simulation study presented in this section has been conducted by exploiting real data obtained from the Centre de Recherche en Cancérologie de Toulouse (CRCT). Multiple myeloma (MM) is the second most common haematological malignancy in the world and is characterised by the accumulation of malignant plasma cells in the bone marrow. Classical treatments are based on chemotherapies, which, if appropriate, act fast and efficiently bring MM patients to remission in a few weeks. However almost all patients eventually relapse more than once and the five-year survival rate is around 50%. We have obtained data from the Intergroupe Francophone du Myélome 2009 clinical trial [22] which has followed 748 French MM patients from diagnosis to their first relapse on a standardized protocol for up to six years. At each visit a blood sample was obtained to evaluate the amount of monoclonal immunoglobulin protein in the blood, a marker for the disease progression. An example of patient dataset is given in Fig 1. Based on these data, we calibrated our PDMP model as described in the Methods section, and we performed simulations to evaluate the performance of the POMCP strategy to select the combination of treatment and next visit date at each time point of the trajectories (these time-points being themselves selected by the algorithm). The performance of the approach was measured by a Monte-Carlo estimate of the expectation and confidence interval of its value as well as the runtime of the online computation of a complete trajectory. For each disease parameters’ configuration 500 simulations were performed to estimate these values. Codes and parameters are available at https://github.com/acleynen/pomcp4pdmp [39]. Study 1: Impact of the parameters’ values on POMCP’s performance. We evaluated the impact of the value of 6 parameters on the performance: (i) the filter chosen (conditional or particle), (ii) the rollout procedure chosen, (iii) the exploitation/exploration tradeoff parameter ;′, (iv) the number nsearch of simulations in the online POMCP procedure, (v) the number K of initial states to sample from at each of the nsearch simulations, and (vi) the internal POMCP precision parameter to select particles in the particle filter. Those parameters are described at length in the Methods Section, see Algorithm 2. The Github page [39] contains tables with results for all sets of parameter values that were tested. In this section we describe the most important results. We performed all parameter comparisons for both the conditional and the particle filters. The conditional particle filter is a discrete probability distribution on a finite fixed support Bc of size 184 with 81 states in condition m = 0, 31 states in condition m = 1, 71 states in condition m = 2 and one state in condition m = 3. The choice of these states is discussed in [27]. To adapt this filter to the POMCP environment, at each iteration n we start by randomly sampling K states s from Bc with the distribution given by . For the particle filter, this number K directly corresponds to the number of particles in the filter, hence the size of Bp. Note that for the conditional filter the support Bc does not change over time, whereas for the particle filter Bp keeps the same size but possibly contains different states at each iteration. Mode-based rollout outperforms naive rollout. We found that the uniform rollout procedure (selecting decisions randomly) produced very poor results compared to the mode-based rollout procedure and hence we only present results for the mode-based rollout policy here. POMCP is robust to the exploration/exploitation trade-off. We found that the exploitation/exploration trade-off parameter had little influence on the overall performance, with the exception of extreme values (α′ = 0.99, almost no exploration, and α′ = 0.2, almost no exploitation, both leading to poorer performance). We also tried several adaptive strategies to select the value of α′ depending on the confidence in the belief (measured in terms of entropy) which did not improve the results. The following results are therefore discussed for a fixed value of α′ = 0.5, but the reader may refer to the Tables 4 and 5 in S1 File for additional results on these parameters. Increasing the number of exploratory simulations yields the best performance gain. For a fixed number of sampled belief states K and precision , increasing the number of simulations nsearch improved the performance of the algorithm while decreasing the variance in the simulations for both filter types (see the top left panel of Fig 3 for K = 500, ). In the case of the conditional filter, the runtime increases linearly, from 103 seconds per trajectory with nsearch = 100 simulations to 104 seconds per trajectory with nsearch = 1000. In the case of the particle filter, the runtime is 3 times as much for 100 simulations since when nsearch < K additional simulations must be performed to compute the particle filter at time n + 1. The difference decreases to only 1.2 times the runtime of the conditional filter for nsearch = 1000. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 3. Impact of POMCP parameters on the estimated value function. Top left: increasing the number of simulations for filters with 500 initial states improves the average trajectory costs. Top right: increasing the number of atoms in the filter improves the performance of the particle filter but not the conditional filter. https://doi.org/10.1371/journal.pone.0315661.g003 The particle filter requires a large belief state to achieve high performance. The particle filter requires a large belief state to achieve high performance. As expected, for a fixed number of exploratory simulations nsearch and precision , increasing the number K of particles in the particle filter led to a tremendous improvement for the particle filter together with a significantly decreased variance in the trajectory costs, while it had no impact on the conditional filter (see top right panel of Fig 3 for nsearch = 500, ). Similarly, K does not have an impact on the runtime of the conditional filter, while it leads to an exponential increase in the runtime of the particle filter. POMCP for controlled PDMPs requires one additional tuning: the precision of the tree observation nodes. The bottom two panels of Fig 3 illustrate the impact of the precision for a fixed filter size (K = 500) and two different numbers of simulations (nsearch = 500, left, and nsearch = 1000, right). For a smaller number of simulations, decreasing the precision improves the result up to , and then worsens them again. This tendency is still observed for the particle filter when nsearch increases, while the performance of the conditional filter is optimal with the loosest precision, . Those results are the consequence of two factors: as detailed in the Methods Section, each simulation creates a novel node in the tree exploration, where a node is a set of potential future trajectories with their estimated costs. When the precision is very fine, each simulation produces a different future observation and hence the estimation relies on one-step forward simulations which may miss future events. On the other hand, when the precision is very loose, each simulation will build on the previous simulation to explore a step forward, yielding very long trees with few branches. This will also miss the variability of different outcomes. The second factor comes from the way the filters are constructed. Particle filters are based on the comparison of simulations with observations. When the precision is very loose, almost all simulations will be accepted, creating a strong bias in the belief of the current state, that will propagate from time-point to time-point. As the conditional filter update does not rely on simulations, hence neither on precision, there is no propagation of uncertainty from step to step, and when the number of exploratory simulations nsearch is large enough to guarantee some diversity in the tree exploration, estimating the cost of each decision from longer trajectories will provide better results. Finally, one may note that the gain in using conditional filters is mostly apparent in extreme parameter scenarios, for instance with very low number of particles K, with very high precision rates, etc. Provided the user has enough computing budget, both filters tend to provide very similar results. Study 2: Adapted POMCP outperforms the dynamic programming approach. In this study we compare the results of five resolution strategies calibrated with their optimal parameters on biological relevant outcomes: the death rate, the Progression-Free Survival (PFS) time, that is, the time from study entry to the first relapse, the time spent under treatment, the number of visits to the hospital and the cost. Those quantities were normalized so that they range between 0 and 1, and such that an optimal result is 0. To do so, death rate was normalized so that a random treatment strategy yields 1 (here 5% of patients); the PFS was transformed as 1-(PFS/H) (where we recall that H is the study horizon), so that a patient who does not relapse has normalized PFS equal to 0; the time spent under treatment was normalized by H, the number of visits was normalized as since over the horizon, a visit every 15 days produces 160 visits, whereas a visit every 60 days produces 40 visits; and finally the cost was normalized as where v0 is the the best approximation of the optimal value obtained through discretizations in [27], and Cthreshold is the average cost of the threshold strategy. Here again, we simulated 500 trajectories with each strategy under the same cost parameters. The results are summarized in the Radar plot of Fig 4, and additional visual information on average trajectory cost is given in the barplots. In the Radar plot representation, a perfect strategy should delimitate the inner circle. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 4. a) Radar plot comparing performances of 4 solution strategies on death rate, progression-free survival (PFS), time spent under treatment, average number of visits per patient, and average trajectory cost. An optimal strategy would be the inner-circle. b) Barplot of trajectory cost for 500 simulations under three main strategies: POMCP with particle filter, POMCP with conditional filter and Dynamic Programming on discretized processes. https://doi.org/10.1371/journal.pone.0315661.g004 Compared strategies are the discretization/DP approach (DP) from [27] that relies on exact resolution by dynamic programming of the discretized POMPD, the adapted POMCP with the conditional filter (POMCP-Conditional), the adapted POMCP with the particle filter (POMCP-Particles), a threshold strategy which assigns treatment when the marker reaches a given threshold (Threshold), and a random strategy agnostic to the marker values (Random). The threhold strategy was fixed to treat when the marker exceeds 5 and until it returns to a nominal value. The actionner starts with treatment a, and adapts the treatment if the marker does not decrease at the next visit. Visits are set every 60 days. The random strategy randomly selects a treatment (including not treating) and a next visit date at each new timepoints regardless of the marker value. Interestingly, the combination of POMCP with the conditional filter yields the lowest average trajectory cost and the shortest average time spent under treatment, by slightly increasing the number of visits and reducing PFS compared to the DP approach. However, the particle-based POMCP approach (relying fully on simulations with no other exploitation of the underlying model) yields cost almost as good as the previous two approaches, with increased number of visits but longest progression-free survival time. Importantly, out of 500 simulations, one of the trajectories ended with a patient dying. Not surprisingly, the threshold strategy has a higher cost than other approaches as it does not aim at maintaining the marker value as close as possible to the nominal value: the trajectories are therefore highly penalized by the delay in starting treatment. One might still note that the lower number of visits does not compensate for the delay in treatment, and that this threshold strategy still leads to more time under treatment for the patients. Study 2: Adapted POMCP outperforms the dynamic programming approach. In this study we compare the results of five resolution strategies calibrated with their optimal parameters on biological relevant outcomes: the death rate, the Progression-Free Survival (PFS) time, that is, the time from study entry to the first relapse, the time spent under treatment, the number of visits to the hospital and the cost. Those quantities were normalized so that they range between 0 and 1, and such that an optimal result is 0. To do so, death rate was normalized so that a random treatment strategy yields 1 (here 5% of patients); the PFS was transformed as 1-(PFS/H) (where we recall that H is the study horizon), so that a patient who does not relapse has normalized PFS equal to 0; the time spent under treatment was normalized by H, the number of visits was normalized as since over the horizon, a visit every 15 days produces 160 visits, whereas a visit every 60 days produces 40 visits; and finally the cost was normalized as where v0 is the the best approximation of the optimal value obtained through discretizations in [27], and Cthreshold is the average cost of the threshold strategy. Here again, we simulated 500 trajectories with each strategy under the same cost parameters. The results are summarized in the Radar plot of Fig 4, and additional visual information on average trajectory cost is given in the barplots. In the Radar plot representation, a perfect strategy should delimitate the inner circle. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 4. a) Radar plot comparing performances of 4 solution strategies on death rate, progression-free survival (PFS), time spent under treatment, average number of visits per patient, and average trajectory cost. An optimal strategy would be the inner-circle. b) Barplot of trajectory cost for 500 simulations under three main strategies: POMCP with particle filter, POMCP with conditional filter and Dynamic Programming on discretized processes. https://doi.org/10.1371/journal.pone.0315661.g004 Compared strategies are the discretization/DP approach (DP) from [27] that relies on exact resolution by dynamic programming of the discretized POMPD, the adapted POMCP with the conditional filter (POMCP-Conditional), the adapted POMCP with the particle filter (POMCP-Particles), a threshold strategy which assigns treatment when the marker reaches a given threshold (Threshold), and a random strategy agnostic to the marker values (Random). The threhold strategy was fixed to treat when the marker exceeds 5 and until it returns to a nominal value. The actionner starts with treatment a, and adapts the treatment if the marker does not decrease at the next visit. Visits are set every 60 days. The random strategy randomly selects a treatment (including not treating) and a next visit date at each new timepoints regardless of the marker value. Interestingly, the combination of POMCP with the conditional filter yields the lowest average trajectory cost and the shortest average time spent under treatment, by slightly increasing the number of visits and reducing PFS compared to the DP approach. However, the particle-based POMCP approach (relying fully on simulations with no other exploitation of the underlying model) yields cost almost as good as the previous two approaches, with increased number of visits but longest progression-free survival time. Importantly, out of 500 simulations, one of the trajectories ended with a patient dying. Not surprisingly, the threshold strategy has a higher cost than other approaches as it does not aim at maintaining the marker value as close as possible to the nominal value: the trajectories are therefore highly penalized by the delay in starting treatment. One might still note that the lower number of visits does not compensate for the delay in treatment, and that this threshold strategy still leads to more time under treatment for the patients. Discussion In this paper, a mechanistic PDMP model for cancer evolution and treatment has been presented. While illustrated here on a Multiple Myeloma dataset, our approach is directly implementable in any other long-term disease monitoring framework, either from different types of cancers or from chronic diseases such as asthma, diabetes, etc. In particular, the framework is easily adaptable to different schemes of marker sampling times, from high frequency (such as in diabetes monitoring) to low frequency (such as cancer follow-up, cardiovascular diseases, etc). PDMPs are very flexible tools that allow to model routine screening data with very few parameters with biological meaning. In particular, they extend the state-of-the art modeling of tumor growth through Lotka-Voltera based equations by allowing abrupt changes in the parameters. When embedded in a control framework, PDMPs usually suffer from the need to compute intractable integrals and thus resort to several layers of approximations with heavy computational burden. In our case, there are two major difficulties in solving the optimization problem for the controlled PDMP. The first one is related to the partial observations of the process, since the practitioner only observes some noisy measurement of the marker at visit dates, and not the overall condition of the patient nor the relapse dates. The second one comes from the continuous state space and continuous time dynamics of the process, which prevent direct use of exhaustive exploration solution strategies such as dynamic programming [40]. In a previous work [27], we have dealt with those difficulties by defining an equivalent fully observable Markov decision process on an enlarged state space through the use of conditional filters, the conditional distributions of the hidden process given the observations. Then we discretized the state space of the original process, in order to obtain finite support filters and discretized again these finite support approximate filters to obtain finitely many (belief) states. The fully discretized model can then be solved by dynamic programming. Here we investigated another original solution approach exploiting filter objects under a different (simulation-based) dimension reduction strategy. We show that the inherent generator function of the PDMP can be exploited to make use of simulation-based solution strategies such as POMCP with excellent performance. Provided the number of simulations is large enough (either to explore the outcome space, or to construct consistent belief states), this approach can even outperform discretization approaches that exploit the knowledge of the underlying model. We have also proposed to combine both approaches, using discretization based conditional filters and simulation based solution stategy, resulting in a more robust algorithm (in particular less sensitive to the choice of POMCP parameters and with more stable variance), but with little performance gain. The main advantage of the discretization/DP approach is that solutions are pre-computed for all new patients. This is especially useful under the assumption that all patients have the same dynamics with the same parameters. This leads to an explicit strategy directly implementable by the practitioner, who receives blood test results and can directly be assisted by a corresponding treatment strategy. Its main drawback is that the model presented here is at maximum complexity for such an approach. In particular, it will become intractable if one wants to take into account more disease markers or more modes and treatments. Conversely, the simulation-based approach can extend to any complexity provided simulations can be performed easily and fast. In addition, one of the main advantages of simulation-based approaches such as that presented here is that cost parameters can be modified with each new patient, yielding a patient-based procedure closer to precision medicine. This remains quite theoretical, as in practice calibrating cost parameters is a very difficult task, but with experience practitioners may be able to encode personal preferences, such as shorter life with better quality, or longer life at the price of more treatments, etc. On the other hand, upon receiving a new blood sample result, the algorithm has to be reran, taking a few hours on a high-performance computer. In this work, we have adapted the POMCP algorithm to solving a controlled PDMP with known model, too complex to be solved through exact dynamic programming. The model presented here shows how PDMPs can address the limitations of current state-of-the-art approaches, but still faces two major limitations that we aim at tackling in future works. The first is that while we allow for several types of relapses, we do not model acquisition of treatment resistance and its effect on the dynamics. A more realistic model will include a dependency of both the treatment efficacy and the relapse intensity on the number of previous treatments. The second limitation is that we made the assumption that the patient-disease model was known, which is a daring assumption. Therefore, one next step of our approach is to extend it by considering an unknown model and applying Reinforcement Learning methods [41]. A fundamental problem in Reinforcement Learning is the difficulty of deciding whether to select actions in order to learn a better model of the environment, or to exploit current knowledge about the rewards and effects of actions [42]. This is especially true in disease control problems [43, 44]. Methods Datasets, parameters and code availability In order to propose a simulation study as realistic as possible we have used real data to infer the parameters of the design. The data come from the follow-up of 748 multiple myeloma patients registered in the 2009 IFM clinical trial described in the Results Section. The protocol was approved by the institutional ethics committee at the coordinating center (Purpan Hospital, Toulouse, France). All the patients provided written informed consent to have data from their medical records used in research. An example of data is given in Fig 1. From this data, we opted for the exponential form of the dynamics in the disease states with boundaries ζ0 = 1 and D = 40 for remission and death levels, simply calibrated as the minimal and maximal values in the data set. Then, as described in the Results Section, 3 parameters had to be calibrated, and all hyperparameters are explicitly given in the github repository [39]. For the risk function λ, we choose to distinguish the standard relapse (from remission to disease state) from the therapeutic escape (from a disease state under appropriate treatment to the other disease state). We then further separate the risk by disease and treatment, so that for any treatment ℓ and any state s = (m, ζ, u), one has , where the form of is specified in Table 1. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 1. Risk function of the controlled continuous time PDMP. https://doi.org/10.1371/journal.pone.0315661.t001 For the standard relapse, the risk μi for disease i was chosen as piecewise increasing linear functions calibrated such that the risk of relapsing increases until some duration τ1 (average of standard relapses occurrences), then remains constant, and further increases between say τ2 and τ3 years (to model late or non-relapsing patients). This function and corresponding density are illustrated in Fig 5 for disease b. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 5. Risk and density functions for standard relapse from remission condition to disease b condition (similar shapes for standard relapse to disease a). https://doi.org/10.1371/journal.pone.0315661.g005 For the therapeutic escape, we chose to fit a Weibull survival distribution of the form with to account for a higher relapse risk when the marker decreases. We arbitrarily chose and calibrated such that only about 5% of patients experience a therapeutic escape. The aggressiveness of the disease/treatment efficiency v/v′ may depend on both treatment ℓ and mode m. Specific values are thus denoted by , see Table 2. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 2. Flow Φ of the controlled continuous time PDMP. https://doi.org/10.1371/journal.pone.0315661.t002 After setting aside patients that do not relapse (about 20%), we first estimated remission and relapse times using maximal slope difference, and then fitted exponential regression models on each segment. We then clustered patients based on their relapse coefficient and chose the number of clusters using the slope heuristic on residual sum of squares. We obtained two groups and used the average values to obtain (22% of patients) and (78% of relapsing patients). We then computed the average of the treatment parameters for each group and obtained and . The data do not present patient relapsing under treatment, and therefore we could not estimate v for therapeutic escapes or inappropriate treatments. Because we assume the aggressiveness in those circumstances should be smaller than under standard relapse, we chose and . By separating the risk λ by disease, the kernel function Q is automatically fitted, since we assume that the marker level does not jump at relapses, and the mode is selected by the risk clocks leading to the jump, whichever rings first, see Table 3. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 3. Markov transition kernel Q for the controlled PDMP. https://doi.org/10.1371/journal.pone.0315661.t003 Finally, we arbitrarily selected a centered Gaussian distribution with noise parameter σ2 = 1 for the observation process. We resorted to extensive simulations study to select cost parameters that seemed reasonable (very few patients dying over our study horizon, on average not more that a fourth of the follow-up time spent under treatment, etc). We arbitrarily fixed the visit cost CV to 1. We then fixed the death cost M to 110 so that with visits every 15 days and early relapse, a patient would rather die that spend the entire horizon under treatment with numerous visits. We then fixed β = 0.1 so that the penalty of applying an unnecessary treatment would count as 1.5, 3 or 6 times the visit cost depending on the choice of next visit date r. Finally, we selected κ = 1/6 from extensive simulations so that for low marker observations (typically when it is hard to decide between relapse or remission with high noise) it might be preferable to wait for new data acquisition rather than treat by default. All codes and parameters are available at https://github.com/acleynen/pomcp4pdmp [39]. Reminder on controlled PDMPs Here is a description of how to simulate a trajectory of a controlled PDMP between two consecutive visits to the medical center. The controlled PDMP parameters are λ, the disease risk function (distribution of duration until the next jump i.e. condition change), Q, the Markov kernel, defining the stochastic transition to the state reached after the next jump and , the parameters of the exponential deterministic behavior of the marker between two jumps, defined from the current mode and treatment applied. For the sake of unified notation, we denote here , , and . Algorithm 1 Simulation of a trajectory between two consecutive decision times of a controlled PDMP. 1: procedure SimulatePDMP(m, ζ, u, ℓ, r) 2:  t ← 0 3:   4:  while t < r do 5:   S ∼ λ 6:   S ← min{S, t*(m, ζ, u, ℓ)} 7:   if t + S > r then 8:    return m, ζexp(vr), u + r 9:   else 10:    t ← t + S 11:    ζ ← ζexp(vS) 12:    u ← u + S 13:    m ∼ Q(⋅|m, ζ, u, ℓ) 14:    u ← 0 15:     Procedure SimulatePDMP takes as input an initial position Xt = s = (m, ζ, u) with mode m, marker level ζ, time since the last jump u, and a decision d = (ℓ, r) with treatment to be applied ℓ for a duration r until the next visit to the medical center and returns the state Xt+r = s′ = (m′, ζ′, u′) of the process at time t + r given that treatment ℓ was applied. At line 5, S ∼ λ means that S is sampled from the distribution with risk function λ, which means that it has the following survival function At line 6, t*(m, ζ, u, ℓ) is the (deterministic) time to reach either the nominal value ζ0 or the death level D from the current point (m, ζ, u) (if no change of condition or treatment occurs). The third variable u representing the time since the last jump allows transitions described in SimulatePDMP to be Markovian, i.e. to be independent of the previous transitions. Reminder on Partially Observed Monte-Carlo Planning In this section we give a description of the original POMCP algorithm. A Python implementation of the POMCP algorithm can be found here: https://github.com/GeorgePik/POMCP. The algorithm is called iteratively at each observation step n and requires several entries in order to output a decision dn to be used by the operator. The inputs include a set of possible decisions (denoted A), a history hn = 〈ω0d0ω1 … dn−1ωn〉 of successive observations and decisions up to step n, including the current observation ωn, an approximate (particle) filter Θp(hn) with support Bp(hn), a simulator of state and observation at step n + 1 together with their cost given a state s and decision d at step n, a stopping criterion Timeout and an arbitrary Rollout strategy to provide a heuristic evaluation of an history, whenever needed. Algorithm 2 Original POMCP algorithm [30]. 1: procedure POMCPhn 2:  repeat 3:   if hn = ∅ then 4:     5:   else 6:    s ∼ Θp(hn) 7:   Simulate(s, hn) 8:  until Timeout() 9:  d* ← argmindV(hnd) 10:  V(hn) ← mindV(hnd) 11:  return d* 12: procedure Rollout(s,h) 13:   if s.t = H then 14:    return 0 15:  d ∼ πrollout(h) 16:   17:  return c+Rollout(s′, hdω) 18:  procedure Simulate(s,h) 19:  if s.t ≥ H then return 0 20:   if then 21:    for all d ∈ A do 22:      23:    C ← Rollout(s, h) 24:    return C 25:   26:   27:  C ← c + Simulate(s′, hd*ω) 28:  Bp(h) ← Bp(h) ∪ {s} 29:  N(h) ← N(h) + 1 30:  N(hd*) ← N(hd*) + 1 31:   32:  return C The POMCP algorithm involves several data structures: Simulated states s = (m, ζ, u). Decisions d = (ℓ, r), belonging to a finite decision space A, preferably small. Observations ω = (y, t). The original algorithm assumes that they belong to a finite observation space Ω of limited size. Histories . Histories represent sequences of decisions and observations of variable lengths. They are the concatenation of the sequence of past observations/decisions hn = 〈ω0d0ω1d1…dn−1ωn〉 plus an arbitrary sequence of future observations/decisions, built using the rollout strategy and a simulation model of the POMCP (line 28, h ← hd*ω). is a tree data structure rooted at the initial history hn. Each node of will correspond to an extended history, ended by either a decision or an observation. Each simulation step creates a novel node in the tree, and histories h are attached to by appending the corresponding novel decision/observation to the parent node’s history. In addition, we also attach to each node (now denoted or ) (i) integer numbers N(h) or N(hd) where N(h) corresponds to the number of times the history h has been simulated and N(hd) to the number of times d has been selected after h was encountered, (ii) real numbers V(hd) corresponding to an estimate of the cost value of hd, that is the expected sum of future costs obtained if the optimal policy is applied after h has been encountered and d selected, until the final decision step and (iii) Θp(h), called a particle filter, which is a discrete uniform distribution on a set Bp(h) of states s compatible with the current history h. When Simulate is called with entry a history h that already belongs to , it updates the values of N(h) and Bp(h) as well as the values N(hd) and V(hd) for all the successor nodes hd. It is a property of the algorithm that whenever , as well. When Simulate is called with entry a history h which does not yet belong to (as is the case initially for hn), it appends h as well as all its successor nodes hd to and initializes their values N(h), Bp(h), N(hd) and V(hd). Procedure Simulate is based on a generator function, that generates a successor (hidden) state s′, an observation ω and an immediate cost c, from decision d applied in current (hidden) state s. Repeated calls to are used to progressively expand . Simulation sequences and updates are performed, starting from hn, until Timeout() function requires to stop (generally, after an arbitrary number nsearch of trajectories have been simulated or a fixed amount of time has been spent). Then, the decision d* which maximizes V(hnd*) is applied to the real-world system, and a real-life observation ω ∈ Ω is obtained. The new real-world history becomes hn+1 = hnd*ω and is pruned, so that the new tree is rooted in hn+1. The interest of pruning instead of starting with an empty tree in hn+1 is to exploit past simulations in the computation of the next decision. POMCP proposes strategies to select the input rollout and filters when the user has no knowledge on the process. A typical rollout strategy may simply involve selecting the decisions randomly from the set A. The following particle filter update procedure, included in procedure Simulate, is suggested: If h = ∅, sample . In the initial step of the algorithm, we simulate random particles from an arbitrary belief state. If h ≠ ∅, sample s ∼ Θp(h) where Θp(h) is the uniform discrete distribution on the finite nonempty set Bp(h). Indeed, if h ≠ ∅, this means that procedure Simulate(s′, h) has already been called at least once for some s′ and thus Bp(h)≠∅ (Bp(h) contains at least s′). Sample . This step is performed in line 27 of the POMCP algorithm. if |ω − ω′| = 0, update Bp(hd*ω) ← Bp(hd*ω)∪{s′}. Particle filters may pose problems of convergence, particle deprivation or even failure of the algorithm (as explained in Section “Adapted POMCP algorithm to the case of controlled PDMPs”). In their paper, [30] mentioned particle reinvigoration as a means of dealing with particle deprivation. Later, [45] proposed a generalization of the POMCP algorithm (POMCPOW) to continuous state, action and observation spaces, which use weighted beliefs, but rely on an explicitly known observation function (which we do not assume). [46] further extend POMCPOW, by exploiting analytical knowledge of the model to reduce the number of necessary particles in the filter. However, this comes to the price of increased computational complexity. In practice, we proposed a simple adaptation of the original filter of [30] to handle continuous observations and particle deprivation. Adapted POMCP algorithm to the case of controlled PDMPs It seems natural to apply a POMCP algorithm to optimize a PDMP control strategy in the context of disease control. Indeed, PDMPs have natural generator functions since algorithm SimulatePDMP can be naturally augmented with an observation simulator and a cost function, in order to obtain a generator function as described above. We propose three adaptations of the original POMCP algorithm to exploit the particular framework of controlled PDMPs. Rollout policies. The original POMCP algorithm describes the possible rollout policies as admissible policies, meaning that actions choices should only depend on the history of past actions and observations. Instead, we exploit here the PDMP generator which provides both hidden states and observations. This allows to design rollout policies exploiting hidden states instead of noisy observations. In our medical framework, one may build interesting rollout policies exploiting the hidden mode of the disease to provide good heuristics to the simulation part of POMCP. In practice, we propose to compare the two following rollout policies: The (admissible) uniform policy: The (non-admissible) mode policy: The mode policy, being based on the full observation of the process, is likely to underestimate the real cost of an optimal control policy, which is a useful property for the convergence of a heuristic search method [47]. In our simulation study we observed that a mode-based rollout policy can be particularly efficient. Observation space. The time, state and observation spaces of the disease control PDMP model are continuous. This means that the probability of simulating exactly the same history h twice is zero if we apply the POMCP procedure as such. Thus, the tree depth and the size of filters supports Bp(h) may never exceed 1. This is particularly annoying since the POMCP algorithm convergence proof only holds when Θ(h) is close to the true empirical belief state, which (approximately) holds when the size of the support Bp(h) tends to +∞. Indeed, the POMCP procedure in [30] requires that Bp(h) contains at least K particles, when h is non empty. In practice, for the controlled PDMP case, observations are made of pairs (y, t) of a continuous-value observed marker level and discrete time of current decision. Therefore, we discretize the observation space into a set of contiguous intervals and group together observations belonging to the same interval. The continuous nature of the model also prevents the exact computation of the filter, hence we resort to the use of particle or conditional filters. The construction of the former is slightly adapted from the initial POMCP algorithm to fit our model. Indeed, as the true process still produces continuous-valued observations, the last action of the particle filter update procedure is modified to updating Bp(hd*ω) with s′ only if (with chosen by the user, typically of the same magnitude as the discretization precision). It may still happen that this procedure selects only states with wrong mode m at step n (i.e. Θp(hn), which is only an approximation of the true filter, does not contain the true hidden mode m), and hence cannot generate compatible states at step n + 1 due to diverging dynamics of the process in the different modes. This is not a simple matter of statistical accuracy, but a very practical problem. When this happens, we say that Bp(hn+1) is deprived of particles and we cannot go on applying POMCP to hn+1. This problem of particle deprivation can be mitigated by a few modifications of the standard particle filter construction: Assume that Bp(hn+1) is empty or too small, whatever the number of simulations of sn ∼ Bp(hn) followed by a call to we perform. Then, we may go back in the history and resimulate sn−1 ∼ Bp(hn−1) followed by two successive calls to : and, provided that , , hoping that now, . A particle sn+ 1 is then added to Bp(hn+1) whenever the two above conditions are met. Assume that Bp(hn+1) is non-empty but still too small (|Bp(hn+1)| ≪ K) after the previous step was applied a large number of times. We may perform particle revigoration by resampling particles from Bp(hn+1) and duplicate them. Finally, when everything fails, we may perform a large number of sampled transitions from Bp(hn) (e.g. 1000 × K) and keep the K particles s′ in the generated samples(s′, ω′, c′) with minimal distance |ωn+1 − ω′|, with some arbitrary distance definition. In our experimental studies, we applied these three modifications in turn, whenever needed, until we got belief states Bp(hn) of cardinality at least K. Filters. We propose to modify the original particle filter of POMCP to incorporate the conditional filter. Hence the algorithm is modified as follows: starting from an initial arbitrary belief filter for h0 = ∅ as in the original POMCP, when the new history becomes hn+1 = hnd⋆ω, we compute the new filter as a deterministic function of and d⋆, ω (see [27] for its specific form) and sample K particles from to generate a set of plausible hidden states. Depending only on the current belief and the new observation (and not on simulations), this filter does not suffer from the propagation of approximations and particle deprivation. Moreover, the computational burden of the simulations is replaced by the computation of weighted sums which are particularly efficient in matrix programming languages. Datasets, parameters and code availability In order to propose a simulation study as realistic as possible we have used real data to infer the parameters of the design. The data come from the follow-up of 748 multiple myeloma patients registered in the 2009 IFM clinical trial described in the Results Section. The protocol was approved by the institutional ethics committee at the coordinating center (Purpan Hospital, Toulouse, France). All the patients provided written informed consent to have data from their medical records used in research. An example of data is given in Fig 1. From this data, we opted for the exponential form of the dynamics in the disease states with boundaries ζ0 = 1 and D = 40 for remission and death levels, simply calibrated as the minimal and maximal values in the data set. Then, as described in the Results Section, 3 parameters had to be calibrated, and all hyperparameters are explicitly given in the github repository [39]. For the risk function λ, we choose to distinguish the standard relapse (from remission to disease state) from the therapeutic escape (from a disease state under appropriate treatment to the other disease state). We then further separate the risk by disease and treatment, so that for any treatment ℓ and any state s = (m, ζ, u), one has , where the form of is specified in Table 1. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 1. Risk function of the controlled continuous time PDMP. https://doi.org/10.1371/journal.pone.0315661.t001 For the standard relapse, the risk μi for disease i was chosen as piecewise increasing linear functions calibrated such that the risk of relapsing increases until some duration τ1 (average of standard relapses occurrences), then remains constant, and further increases between say τ2 and τ3 years (to model late or non-relapsing patients). This function and corresponding density are illustrated in Fig 5 for disease b. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 5. Risk and density functions for standard relapse from remission condition to disease b condition (similar shapes for standard relapse to disease a). https://doi.org/10.1371/journal.pone.0315661.g005 For the therapeutic escape, we chose to fit a Weibull survival distribution of the form with to account for a higher relapse risk when the marker decreases. We arbitrarily chose and calibrated such that only about 5% of patients experience a therapeutic escape. The aggressiveness of the disease/treatment efficiency v/v′ may depend on both treatment ℓ and mode m. Specific values are thus denoted by , see Table 2. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 2. Flow Φ of the controlled continuous time PDMP. https://doi.org/10.1371/journal.pone.0315661.t002 After setting aside patients that do not relapse (about 20%), we first estimated remission and relapse times using maximal slope difference, and then fitted exponential regression models on each segment. We then clustered patients based on their relapse coefficient and chose the number of clusters using the slope heuristic on residual sum of squares. We obtained two groups and used the average values to obtain (22% of patients) and (78% of relapsing patients). We then computed the average of the treatment parameters for each group and obtained and . The data do not present patient relapsing under treatment, and therefore we could not estimate v for therapeutic escapes or inappropriate treatments. Because we assume the aggressiveness in those circumstances should be smaller than under standard relapse, we chose and . By separating the risk λ by disease, the kernel function Q is automatically fitted, since we assume that the marker level does not jump at relapses, and the mode is selected by the risk clocks leading to the jump, whichever rings first, see Table 3. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 3. Markov transition kernel Q for the controlled PDMP. https://doi.org/10.1371/journal.pone.0315661.t003 Finally, we arbitrarily selected a centered Gaussian distribution with noise parameter σ2 = 1 for the observation process. We resorted to extensive simulations study to select cost parameters that seemed reasonable (very few patients dying over our study horizon, on average not more that a fourth of the follow-up time spent under treatment, etc). We arbitrarily fixed the visit cost CV to 1. We then fixed the death cost M to 110 so that with visits every 15 days and early relapse, a patient would rather die that spend the entire horizon under treatment with numerous visits. We then fixed β = 0.1 so that the penalty of applying an unnecessary treatment would count as 1.5, 3 or 6 times the visit cost depending on the choice of next visit date r. Finally, we selected κ = 1/6 from extensive simulations so that for low marker observations (typically when it is hard to decide between relapse or remission with high noise) it might be preferable to wait for new data acquisition rather than treat by default. All codes and parameters are available at https://github.com/acleynen/pomcp4pdmp [39]. Reminder on controlled PDMPs Here is a description of how to simulate a trajectory of a controlled PDMP between two consecutive visits to the medical center. The controlled PDMP parameters are λ, the disease risk function (distribution of duration until the next jump i.e. condition change), Q, the Markov kernel, defining the stochastic transition to the state reached after the next jump and , the parameters of the exponential deterministic behavior of the marker between two jumps, defined from the current mode and treatment applied. For the sake of unified notation, we denote here , , and . Algorithm 1 Simulation of a trajectory between two consecutive decision times of a controlled PDMP. 1: procedure SimulatePDMP(m, ζ, u, ℓ, r) 2:  t ← 0 3:   4:  while t < r do 5:   S ∼ λ 6:   S ← min{S, t*(m, ζ, u, ℓ)} 7:   if t + S > r then 8:    return m, ζexp(vr), u + r 9:   else 10:    t ← t + S 11:    ζ ← ζexp(vS) 12:    u ← u + S 13:    m ∼ Q(⋅|m, ζ, u, ℓ) 14:    u ← 0 15:     Procedure SimulatePDMP takes as input an initial position Xt = s = (m, ζ, u) with mode m, marker level ζ, time since the last jump u, and a decision d = (ℓ, r) with treatment to be applied ℓ for a duration r until the next visit to the medical center and returns the state Xt+r = s′ = (m′, ζ′, u′) of the process at time t + r given that treatment ℓ was applied. At line 5, S ∼ λ means that S is sampled from the distribution with risk function λ, which means that it has the following survival function At line 6, t*(m, ζ, u, ℓ) is the (deterministic) time to reach either the nominal value ζ0 or the death level D from the current point (m, ζ, u) (if no change of condition or treatment occurs). The third variable u representing the time since the last jump allows transitions described in SimulatePDMP to be Markovian, i.e. to be independent of the previous transitions. Reminder on Partially Observed Monte-Carlo Planning In this section we give a description of the original POMCP algorithm. A Python implementation of the POMCP algorithm can be found here: https://github.com/GeorgePik/POMCP. The algorithm is called iteratively at each observation step n and requires several entries in order to output a decision dn to be used by the operator. The inputs include a set of possible decisions (denoted A), a history hn = 〈ω0d0ω1 … dn−1ωn〉 of successive observations and decisions up to step n, including the current observation ωn, an approximate (particle) filter Θp(hn) with support Bp(hn), a simulator of state and observation at step n + 1 together with their cost given a state s and decision d at step n, a stopping criterion Timeout and an arbitrary Rollout strategy to provide a heuristic evaluation of an history, whenever needed. Algorithm 2 Original POMCP algorithm [30]. 1: procedure POMCPhn 2:  repeat 3:   if hn = ∅ then 4:     5:   else 6:    s ∼ Θp(hn) 7:   Simulate(s, hn) 8:  until Timeout() 9:  d* ← argmindV(hnd) 10:  V(hn) ← mindV(hnd) 11:  return d* 12: procedure Rollout(s,h) 13:   if s.t = H then 14:    return 0 15:  d ∼ πrollout(h) 16:   17:  return c+Rollout(s′, hdω) 18:  procedure Simulate(s,h) 19:  if s.t ≥ H then return 0 20:   if then 21:    for all d ∈ A do 22:      23:    C ← Rollout(s, h) 24:    return C 25:   26:   27:  C ← c + Simulate(s′, hd*ω) 28:  Bp(h) ← Bp(h) ∪ {s} 29:  N(h) ← N(h) + 1 30:  N(hd*) ← N(hd*) + 1 31:   32:  return C The POMCP algorithm involves several data structures: Simulated states s = (m, ζ, u). Decisions d = (ℓ, r), belonging to a finite decision space A, preferably small. Observations ω = (y, t). The original algorithm assumes that they belong to a finite observation space Ω of limited size. Histories . Histories represent sequences of decisions and observations of variable lengths. They are the concatenation of the sequence of past observations/decisions hn = 〈ω0d0ω1d1…dn−1ωn〉 plus an arbitrary sequence of future observations/decisions, built using the rollout strategy and a simulation model of the POMCP (line 28, h ← hd*ω). is a tree data structure rooted at the initial history hn. Each node of will correspond to an extended history, ended by either a decision or an observation. Each simulation step creates a novel node in the tree, and histories h are attached to by appending the corresponding novel decision/observation to the parent node’s history. In addition, we also attach to each node (now denoted or ) (i) integer numbers N(h) or N(hd) where N(h) corresponds to the number of times the history h has been simulated and N(hd) to the number of times d has been selected after h was encountered, (ii) real numbers V(hd) corresponding to an estimate of the cost value of hd, that is the expected sum of future costs obtained if the optimal policy is applied after h has been encountered and d selected, until the final decision step and (iii) Θp(h), called a particle filter, which is a discrete uniform distribution on a set Bp(h) of states s compatible with the current history h. When Simulate is called with entry a history h that already belongs to , it updates the values of N(h) and Bp(h) as well as the values N(hd) and V(hd) for all the successor nodes hd. It is a property of the algorithm that whenever , as well. When Simulate is called with entry a history h which does not yet belong to (as is the case initially for hn), it appends h as well as all its successor nodes hd to and initializes their values N(h), Bp(h), N(hd) and V(hd). Procedure Simulate is based on a generator function, that generates a successor (hidden) state s′, an observation ω and an immediate cost c, from decision d applied in current (hidden) state s. Repeated calls to are used to progressively expand . Simulation sequences and updates are performed, starting from hn, until Timeout() function requires to stop (generally, after an arbitrary number nsearch of trajectories have been simulated or a fixed amount of time has been spent). Then, the decision d* which maximizes V(hnd*) is applied to the real-world system, and a real-life observation ω ∈ Ω is obtained. The new real-world history becomes hn+1 = hnd*ω and is pruned, so that the new tree is rooted in hn+1. The interest of pruning instead of starting with an empty tree in hn+1 is to exploit past simulations in the computation of the next decision. POMCP proposes strategies to select the input rollout and filters when the user has no knowledge on the process. A typical rollout strategy may simply involve selecting the decisions randomly from the set A. The following particle filter update procedure, included in procedure Simulate, is suggested: If h = ∅, sample . In the initial step of the algorithm, we simulate random particles from an arbitrary belief state. If h ≠ ∅, sample s ∼ Θp(h) where Θp(h) is the uniform discrete distribution on the finite nonempty set Bp(h). Indeed, if h ≠ ∅, this means that procedure Simulate(s′, h) has already been called at least once for some s′ and thus Bp(h)≠∅ (Bp(h) contains at least s′). Sample . This step is performed in line 27 of the POMCP algorithm. if |ω − ω′| = 0, update Bp(hd*ω) ← Bp(hd*ω)∪{s′}. Particle filters may pose problems of convergence, particle deprivation or even failure of the algorithm (as explained in Section “Adapted POMCP algorithm to the case of controlled PDMPs”). In their paper, [30] mentioned particle reinvigoration as a means of dealing with particle deprivation. Later, [45] proposed a generalization of the POMCP algorithm (POMCPOW) to continuous state, action and observation spaces, which use weighted beliefs, but rely on an explicitly known observation function (which we do not assume). [46] further extend POMCPOW, by exploiting analytical knowledge of the model to reduce the number of necessary particles in the filter. However, this comes to the price of increased computational complexity. In practice, we proposed a simple adaptation of the original filter of [30] to handle continuous observations and particle deprivation. Adapted POMCP algorithm to the case of controlled PDMPs It seems natural to apply a POMCP algorithm to optimize a PDMP control strategy in the context of disease control. Indeed, PDMPs have natural generator functions since algorithm SimulatePDMP can be naturally augmented with an observation simulator and a cost function, in order to obtain a generator function as described above. We propose three adaptations of the original POMCP algorithm to exploit the particular framework of controlled PDMPs. Rollout policies. The original POMCP algorithm describes the possible rollout policies as admissible policies, meaning that actions choices should only depend on the history of past actions and observations. Instead, we exploit here the PDMP generator which provides both hidden states and observations. This allows to design rollout policies exploiting hidden states instead of noisy observations. In our medical framework, one may build interesting rollout policies exploiting the hidden mode of the disease to provide good heuristics to the simulation part of POMCP. In practice, we propose to compare the two following rollout policies: The (admissible) uniform policy: The (non-admissible) mode policy: The mode policy, being based on the full observation of the process, is likely to underestimate the real cost of an optimal control policy, which is a useful property for the convergence of a heuristic search method [47]. In our simulation study we observed that a mode-based rollout policy can be particularly efficient. Observation space. The time, state and observation spaces of the disease control PDMP model are continuous. This means that the probability of simulating exactly the same history h twice is zero if we apply the POMCP procedure as such. Thus, the tree depth and the size of filters supports Bp(h) may never exceed 1. This is particularly annoying since the POMCP algorithm convergence proof only holds when Θ(h) is close to the true empirical belief state, which (approximately) holds when the size of the support Bp(h) tends to +∞. Indeed, the POMCP procedure in [30] requires that Bp(h) contains at least K particles, when h is non empty. In practice, for the controlled PDMP case, observations are made of pairs (y, t) of a continuous-value observed marker level and discrete time of current decision. Therefore, we discretize the observation space into a set of contiguous intervals and group together observations belonging to the same interval. The continuous nature of the model also prevents the exact computation of the filter, hence we resort to the use of particle or conditional filters. The construction of the former is slightly adapted from the initial POMCP algorithm to fit our model. Indeed, as the true process still produces continuous-valued observations, the last action of the particle filter update procedure is modified to updating Bp(hd*ω) with s′ only if (with chosen by the user, typically of the same magnitude as the discretization precision). It may still happen that this procedure selects only states with wrong mode m at step n (i.e. Θp(hn), which is only an approximation of the true filter, does not contain the true hidden mode m), and hence cannot generate compatible states at step n + 1 due to diverging dynamics of the process in the different modes. This is not a simple matter of statistical accuracy, but a very practical problem. When this happens, we say that Bp(hn+1) is deprived of particles and we cannot go on applying POMCP to hn+1. This problem of particle deprivation can be mitigated by a few modifications of the standard particle filter construction: Assume that Bp(hn+1) is empty or too small, whatever the number of simulations of sn ∼ Bp(hn) followed by a call to we perform. Then, we may go back in the history and resimulate sn−1 ∼ Bp(hn−1) followed by two successive calls to : and, provided that , , hoping that now, . A particle sn+ 1 is then added to Bp(hn+1) whenever the two above conditions are met. Assume that Bp(hn+1) is non-empty but still too small (|Bp(hn+1)| ≪ K) after the previous step was applied a large number of times. We may perform particle revigoration by resampling particles from Bp(hn+1) and duplicate them. Finally, when everything fails, we may perform a large number of sampled transitions from Bp(hn) (e.g. 1000 × K) and keep the K particles s′ in the generated samples(s′, ω′, c′) with minimal distance |ωn+1 − ω′|, with some arbitrary distance definition. In our experimental studies, we applied these three modifications in turn, whenever needed, until we got belief states Bp(hn) of cardinality at least K. Filters. We propose to modify the original particle filter of POMCP to incorporate the conditional filter. Hence the algorithm is modified as follows: starting from an initial arbitrary belief filter for h0 = ∅ as in the original POMCP, when the new history becomes hn+1 = hnd⋆ω, we compute the new filter as a deterministic function of and d⋆, ω (see [27] for its specific form) and sample K particles from to generate a set of plausible hidden states. Depending only on the current belief and the new observation (and not on simulations), this filter does not suffer from the propagation of approximations and particle deprivation. Moreover, the computational burden of the simulations is replaced by the computation of weighted sums which are particularly efficient in matrix programming languages. Supporting information S1 File. https://doi.org/10.1371/journal.pone.0315661.s001 (PDF) TI - A Monte-Carlo planning strategy for medical follow-up optimization: Illustration on multiple myeloma data JO - PLoS ONE DO - 10.1371/journal.pone.0315661 DA - 2024-12-19 UR - https://www.deepdyve.com/lp/public-library-of-science-plos-journal/a-monte-carlo-planning-strategy-for-medical-follow-up-optimization-LgN5qLlLCE SP - e0315661 VL - 19 IS - 12 DP - DeepDyve ER -