towards improving health decisions with reinforcement ... doshi-velez collaborators: sonali parbhoo,...

69
1 Towards Improving Health Decisions with Reinforcement Learning Finale Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman, Liwei Lehman, Matthieu Komorowski, Aldo Faisal, David Sontag, Fredrik Johansson, Leo Celi, Aniruddh Raghu, Yao Liu, Emma Brunskill, and the CS282 2017 Course

Upload: others

Post on 23-Apr-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

1

Towards Improving Health Decisionswith Reinforcement Learning

Finale Doshi-Velez

Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman, Liwei Lehman, Matthieu

Komorowski, Aldo Faisal, David Sontag, Fredrik Johansson, Leo Celi, Aniruddh Raghu, Yao Liu, Emma Brunskill, and the CS282 2017 Course

Page 2: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

2

Our Lab: ML Towards Effective, Interpretable Health Interventions

Predicting and Optimizing Interventions in ICU (Wu et al. 2015; Ghassemi et al. 2017; Peng 2018; Raghu 2018; Gottesman 2018; Gottesman 2019)

HIV Management Optimization (Parbhoo et al., 2017, Parbhoo et al. 2018)

Depression Treatment Optimization (Hughes et al., 2016; Hughes et al.2017; Hughes et al. 2018; Pradier 2019)

Page 3: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

3

Our Lab: ML Towards Effective, Interpretable Health Interventions

Predicting and Optimizing Interventions in ICU (Wu et al. 2015; Ghassemi et al. 2017; Peng 2018; Raghu 2018; Gottesman 2018; Gottesman 2019)

HIV Management Optimization (Parbhoo et al., 2017, Parbhoo et al. 2018)

Depression Treatment Optimization (Hughes et al., 2016; Hughes et al.2017; Hughes et al. 2018; Pradier 2019)

Today: How can reinforcement learninghelp solve problems in healthcare?

Page 4: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

4

Our Lab: ML Towards Effective, Interpretable Health Interventions

Predicting and Optimizing Interventions in ICU (Wu et al. 2015; Ghassemi et al. 2017; Peng 2018; Raghu 2018; Gottesman 2018; Gottesman 2019)

HIV Management Optimization (Parbhoo et al., 2017, Parbhoo et al. 2018)

Depression Treatment Optimization (Hughes et al., 2016; Hughes et al.2017; Hughes et al. 2018; Pradier 2019)

Today: How can reinforcement learninghelp solve problems in healthcare?Focus: Situations that require a

sequence of decisions

Page 5: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

5

Challenges in the Health Space

● The data are typically available only in batch● No control over the clinician policy!

● The data give very partial views of the process● Measurements, confounds missing● Intents missing

● Success is not always easy to quantify

BUT: We still want to extract as much from these data as we can!

Page 6: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

6

Problem Set-Up

action

observation,reward

WorldAgent

st

Page 7: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

7

Solutions: Train Model/Value Function

Solves the long-term problem (e.g. Ernst 2005; Parbhoo 2014; Marivate 2015), often in simulation/simplified settings.

CD4

ViralLoad

Mut.

CD4

ViralLoad

Mut.

state state

drugs drugs

CD4

ViralLoad

Mut.

state

drugs

Rewards: If V

t > 40:

rt = -.7 log V

t + .6 log T

t – .2|M

t|

Else: r

t = 5 + .6 log T

t – .2|M

t|

Page 8: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

8

Solutions: Nonparametric

Use the full patient history to predict immediate outcomes (e.g. Bogojeska 2012), but often ignore long term effects.

CD

4, e

tc.

, drug)= I( )f(

Patient Space

v<400in 21 days

Page 9: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

9

Our insight: These approaches have complementary strengths!

Patient Space

Page 10: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

10

Patient Space

Patients in clustersmay be best modeledby their neighbors

Our insight: These approaches have complementary strengths!

Page 11: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

11

Patient Space

Patients withoutneighbors may bebetter modeled with a parametric model

. . .

Our insight: These approaches have complementary strengths!

Page 12: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

12

New Solution: Ensemble the Predictors

POMDP Action

Kernel Action

Patient Statistics

ActualActionh( )=

Page 13: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

13

Application to HIV Management

● 32,960 patients from EU Resist Database; hold out 3,000 for testing.

● Observations: CD4s, viral loads, mutations

● Actions: 312 common drug combinations (from 20 drugs)

Approach DR Reward

Random Policy -7.31 ± 3.72

Neighbor Policy 9.35 ± 2.61

Model-Based Policy 3.37 ± 2.15

Policy-Mixture Policy 11.52 ± 1.31

Model-Mixture Policy 12.47 ± 1.38

Parbhoo et al., AMIA 2017; Parbhoo et al., PLoS Medicine 2018

Page 14: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

14

Application to HIV Management

● 32,960 patients from EU Resist Database; hold out 3,000 for testing.

● Observations: CD4s, viral loads, mutations

● Actions: 312 common drug combinations (from 20 drugs)

Approach DR Reward

Random Policy -7.31 ± 3.72

Neighbor Policy 9.35 ± 2.61

Model-Based Policy 3.37 ± 2.15

Policy-Mixture Policy 11.52 ± 1.31

Model-Mixture Policy 12.47 ± 1.38

Parbhoo et al., AMIA 2017; Parbhoo et al., PLoS Medicine 2018

. . .

??. . .

Extension: Putting the mixing in the model.

Page 15: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

15

Application to HIV Management

● 32,960 patients from EU Resist Database; hold out 3,000 for testing.

● Observations: CD4s, viral loads, mutations

● Actions: 312 common drug combinations (from 20 drugs)

Approach DR Reward

Random Policy -7.31 ± 3.72

Neighbor Policy 9.35 ± 2.61

Model-Based Policy 3.37 ± 2.15

Policy-Mixture Policy 11.52 ± 1.31

Model-Mixture Policy 12.47 ± 1.38

Parbhoo et al., AMIA 2017; Parbhoo et al., PLoS Medicine 2018

Page 16: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

16

And: Our hypothesis was correct!Model used when neighbors are far

2nd Quantile Distance

Kernel POMDP

History Length

Kernel POMDP

Page 17: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

17

Application to Sepsis Management

● Cohort of 15,415 patients with sepsis from the MIMIC dataset (same as Raghu et al. 2017); contains vitals and some lab tests.

● Actions: focus on vasopressors and fluids, used to manage circulation.

● Goal: reduce 30-day mortality; rewards based on probability of 30-day mortality:

Peng et al., AMIA 2018

r (o ,a ,o ' )=−logf (o ' )1−f (o ' )

f (o' )+ logf (o)1−f (o)

Page 18: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

18

Minor Adjustment: Values, not Models

LSTM+DDQN Action

Kernel Action

Patient Statistics

ActualActionh( )=

Generative models hard to build → LSTM+DDQN

LSTM+DDQN suggests never-taken actions → hard cap.

Page 19: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

19

Application to Sepsis Management

Page 20: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

20

Application to Sepsis Management

Page 21: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

21

Application to Sepsis Management

Page 22: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

22

Just the start: Statistical Methods have high variance

Gottesman et al. 2018, Raghu et al. 2018

Page 23: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

23

And select non-representative cohorts

Page 24: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

24

And select non-representative cohorts

More issues arise with poor representation

choices and poor reward functions

Page 25: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

25

How can increase confidence in our results?

a

o,r

WorldAgent

st

Reward

Representation

Off-policy Eval

Checks++

Page 26: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

26

Off-Policy Evaluation

a

o,r

WorldAgent

st

Reward

Representation

Off-policy Eval

Checks++

Page 27: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

27

Off-Policy Evaluation

Core question: Given data collected under some behavior policy π

b, can we estimate the value of some other evaluation policy π

e?

Three main kinds of approaches:• Importance-sampling: reweight current data (high variance)

• Model-based: build model with current data, simulate (high bias)• Value-based: apply value evaluation to current data (high bias)

ρn=∏t

πe(atn|stn)

πb(atn|stn)

Page 28: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

28

Off-Policy Evaluation

Core question: Given data collected under some behavior policy π

b, can we estimate the value of some other evaluation policy π

e?

Three main kinds of approaches:• Importance-sampling: reweight current data (high variance)

• Model-based: build model with current data, simulate (high bias)• Value-based: apply value evaluation to current data (high bias)

ρn=∏t

πe(atn|stn)

πb(atn|stn)

Page 29: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

29

Stitching to IncreaseSample Sizes

s1

Importance sampling-based estimators suffer because importance weights most importance weights get small very fast:

One way to ameliorate the issue: “stitch” trajectories with zero weight to get more non-zero weight trajectories.

s2 s3 s4

desired sequence πe

real weight-0 sequence

real weight-0 sequence

ρn=∏t

πe(atn|stn)

πb(atn|stn)

(Sussex et al, ICMLWS 2018)

Page 30: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

30

Stitching to IncreaseSample Sizes

s1

Importance sampling-based estimators suffer because importance weights most importance weights get small very fast:

One way to ameliorate the issue: “stitch” trajectories with zero weight to get more non-zero weight trajectories.

s2 s3 s4

desired sequence πe

real weight-0 sequence

real weight-0 sequence

ρn=∏t

πe(atn|stn)

πb(atn|stn)

Stitchednonzeroweight sequence

Page 31: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

31

Stitching to IncreaseSample Sizes

s1

Importance sampling-based estimators suffer because importance weights most importance weights get small very fast:

One way to ameliorate the issue: “stitch” trajectories with zero weight to get more non-zero weight trajectories.

s2 s3 s4

desired sequence pie

real weight-0 sequence

real weight-0 sequence

ρn=∏t

πe(atn|stn)

πb(atn|stn)

Stitchednonzeroweight sequence

Page 32: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

32

Off-Policy Evaluation

Core question: Given data collected under some behavior policy π

b, can we estimate the value of some other evaluation policy π

e?

Three main kinds of approaches:• Importance-sampling: reweight current data (high variance)

• Model-based: build model with current data, simulate (high bias)• Value-based: apply value evaluation to current data (high bias)

ρn=∏t

πe(atn|stn)

πb(atn|stn)

Page 33: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

33

Better Models: Mixtures help again!

(Gottesman et al, ICML 2019)

Well modeled area

Poorly modeled area

Accurate simulation

Inaccurate simulation

Real Transition

We use RL to bound the long-term accuracy of the value estimate.

Page 34: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

34

Better Models: Mixtures help again!

Well modeled area

Poorly modeled area

Accurate simulation

Inaccurate simulation

Real Transition

We use RL to bound the long-term accuracy of the value estimate.

Page 35: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

35

Bound on the Quality

|𝑔𝑇− �̂�𝑇|≤ 𝐿𝑟∑𝑡=0

𝑇

𝛾𝑡∑𝑡 ′=0

𝑡 −1

(𝐿𝑡 )𝑡′ 𝜀𝑡 (𝑡−𝑡′−1 )+∑

𝑡=0

𝑇

𝛾𝑡 𝜀𝑟 (𝑡 )

Total return error

Error due to state estimation

Error due to reward estimation

Closely related to bound in - Asadi, Misra, Littman. “Lipschitz Continuity in Model-based Reinforcement Learning.” (ICML 2018).

Page 36: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

36

Estimating Errors

Parametric Nonparametric

�̂�𝑡 ,𝑝 ≈max Δ (𝑥𝑡 ′+1 , 𝑓 𝑡(𝑥𝑡′ ,𝑎))

�̂� 𝑡′ (𝑥𝑡′ ,𝑎)

𝑥 𝑥

𝑥𝑡 ′∗

Page 37: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

37

Toy Example

Parametric model

Possible actions

Page 38: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

38

Example with HIV SimulatorWe use RL to bound the long-term accuracy of the value estimate.

Page 39: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

39

Better Models: Designed for Evaluation

Main objective: find a model that will minimize error in individual treatment effects:

where the value function is estimated via trajectories from an approximated model M. Question: Can we do better than just optimizing M for p(M|data)?

Show this can be optimized via a transfer-learning type objective:

(Es0[Vπ(s0)]−E s0 [V̂

π(s0)])

2

Es0[(Vπ(s0)−V̂

π(s0))

2]

L(M )=∑ntl(M ,n, t)+∑nt

ρnt l(M ,n ,t )+ ...

“on-policy” loss “reweighted for πe” loss

(Liu et al, NIPS 2018)

Page 40: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

40

Better Models: Designed for Evaluation

Main objective: find a model that will minimize error in individual treatment effects:

where the value function is estimated via trajectories from an approximated model M. Question: Can we do better than just optimizing M for p(M|data)?

Show this can be optimized via a transfer-learning type objective:

(Es0[Vπ(s0)]−E s0 [V̂

π(s0)])

2

Es0[(Vπ(s0)−V̂

π(s0))

2]

L(M )=∑ntl(M ,n, t)+∑nt

ρnt l(M ,n ,t )+ ...

“on-policy” loss “reweighted for πe” loss

Page 41: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

41

Checking the reasonablenessof our policies

a

o,r

WorldAgent

st

Reward

Representation

Off-policy Eval

Checks++

Page 42: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

42

Some Basic Digging

Page 43: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

43

Positive Evidence:Reproducing across sites (robust to covariate shift)

Our HIV results hold across two distinct cohorts.

EU

Res

ist

Sw

iss

HIV

C

ohor

t

Page 44: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

44

Positive Evidence: Check importance weights, variances

Sepsis: results hold with different control variates

Page 45: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

45

Ask the Experts

Page 46: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

46

Asking the Doctors

● HIV: Checking against standard of care:

● As well as three expert clinicians:

Page 47: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

47

Asking the Doctors

● HIV: Checking against standard of care:

● As well as three expert clinicians:

What’s the best way to “ask the doctors”?

Page 48: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

48

Detour: Summarizing a Treatment Policy

How can we best communicate a treatment policy to a clinical expert? Formalize as the following game:

Us: Present expert with some state-action pairs

Expert: Predict the agent’s action in a new state, s’

Our Goal: choose the state-action pairs so the expert predicts the best.

(Lage et al, IJCAI 2019)

Page 49: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

49

Example 1: Gridworld

What happens in states like:Given:

Page 50: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

50

Example 2: HIV Simulator

What happens in states like:

Given:

Page 51: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

51

Finding: Humans use different methods in different scenarios

HIV Gridworld

Page 52: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

52

…and it’s important to account for it!

GridworldHIV

Predicted by Model

Random

Page 53: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

53

Offering Options

Page 54: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

54

In Progress: Displaying Diverse Alternatives

(Masood et al, IJCAI 2019)

If policies can’t be statistically differentiated, share all the options.

Start

Goal

Page 55: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

55

In Progress: Displaying Diverse Alternatives

(Masood et al, IJCAI 2019)

If policies can’t be statistically differentiated, share all the options.

Start

Goal

Page 56: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

56

Applied to Hypotension Management

Page 57: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

57

Applied to Hypotension Management

Example for a single decision point

Page 58: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

58

Reward Design

a

o,r

WorldAgent

st

Reward

Representation

Off-policy Eval

Holistic Checks

Page 59: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

59

In Progress: IRL to Identify Rewards

(Lee and Srinivasan et al, IJCAI 2019; Srinivasan, in submission)

Page 60: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

60

Going Forward

RL in the health space is tricky, but has potential in several settings. Let’s

● Think holistically about how RL can provide value in a human-agent system.

● Be careful with analyses but not turn away from messy problems!

Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman, Liwei Lehman, Matthieu Komorowski, Aldo Faisal, David Sontag, Fredrik Johansson, Leo Celi, Aniruddh Raghu, Yao Liu, Emma Brunskill, and the CS282 2017 Course

Page 61: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

61

Going Forward

RL in the health space is tricky, but has potential in several settings. Let’s

● Think holistically about how RL can provide value in a human-agent system.

● Be careful with analyses but not turn away from messy problems!

Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman, Liwei Lehman, Matthieu Komorowski, Aldo Faisal, David Sontag, Fredrik Johansson, Leo Celi, Aniruddh Raghu, Yao Liu, Emma Brunskill, and the CS282 2017 Course

Page 62: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

62

Modeling Improvement #2: Personalizing to patient dynamicsAssume that there exists some small latent vector that would allow us to personalize to the patient’s dynamics (HiP-MDP).

Killian et al. NIPS 2017; Yao et al. ICML LLARLA workshop 2018

s s'

a

T(s'|s,a,θ)

s s'

a

T(s'|s,a,θ')

Page 63: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

63

Modeling Improvement #2: Personalizing to patient dynamicsAssume that there exists some small latent vector that would allow us to personalize to the patient’s dynamics (HiP-MDP).

Killian et al. NIPS 2017; Yao et al. ICML LLARLA workshop 2018

s s'

a

T(s'|s,a,θ)

s s'

a

T(s'|s,a,θ')

Consider two planning approaches: 1. Plan given T(s’|s,a,θ)2. Directly learn a policy a = π(s,θ)

Page 64: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

64

Modeling Improvement #2: Personalizing to patient dynamicsResults with a (simple) HIV simulator

Killian et al. NIPS 2017; Yao et al. ICML LLARLA workshop 2018

Page 65: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

65

Off-policy Evaluation Challenges:Sensitive to Algorithm Choices

Page 66: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

66

Off-policy Evaluation Challenges:Sensitive to Algorithm Choices

Sepsis: Neural networks definitely not calibrated...

Page 67: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

67

Off-policy Evaluation Challenges:Sensitive to Algorithm Choices

kNN is more calibrated

Calibration helps

Page 68: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

68

In Progress: Displaying Diverse Alternatives

(Masood et al, IJCAI 2019)

If policies can’t be statistically differentiated, give plausible alternatives.

Page 69: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,

69

SODA-RL Applied to Hypotension Management

Quantitative Results: Safety, quality are important to consider