close

Вход

Забыли?

вход по аккаунту

?

Job Shop Scheduling - University of Washington

код для вставкиСкачать
Concurrent Markov Decision
Processes
Mausam, Daniel S. Weld
University of Washington
Seattle
Planning
Environment
What action
next?
Percepts
Actions
Motivation
пЃ¬
Two features of real world
planning domains :
пЃ¬
Concurrency (widely studied in the Classical Planning
literature)
пЃ¬
пЃ¬
пЃ¬
пЃ¬
Uncertainty (widely studied in the MDP literature)
пЃ¬
пЃ¬
Some instruments may warm up
Others may perform their tasks
Others may shutdown to save power.
All actions (pick up the rock, send data etc.) have a
probability of failure.
Need both!
Probabilistic Planning
Probabilistic Planning typically modeled
as Markov Decision Processes.
 Traditional MDPs assume a “single action
per decision epoch”.
пЃ¬ Solving Concurrent MDPs in the naГЇve way
incurs exponential blowups in running
times.
пЃ¬
Outline of the talk
пЃ¬
пЃ¬
пЃ¬
пЃ¬
пЃ¬
пЃ¬
MDPs
Concurrent MDPs
Present sound pruning rules to reduce the
blowup.
Present sampling techniques to obtain orders
of magnitude speedups.
Experiments
Conclusions and Future Work
Markov Decision Process
S : a set of states, factored into Boolean
variables.
пЃ¬ A : a set of actions
пЃ¬ Pr (SВЈ AВЈ S! [0,1]): the transition model
пЃ¬ C (A! R) : the cost model
пЃ¬ пЃ§ : discount factor (пЃ§ 2 (0,1] )
пЃ¬ s0 : the start state
пЃ¬ G : a set of absorbing goals
пЃ¬
GOAL of an MDP
пЃ¬
Find a policy (S ! A) which:
minimises expected discounted cost
of reaching a goal
пЃ¬ for an infinite horizon
пЃ¬ for a fully observable
пЃ¬ Markov decision process.
пЃ¬
Bellman Backup
пЃ¬
пЃ¬
пЃ¬
Define J*(s) {optimal cost} as the minimum expected
cost to reach a goal from this state.
Given an estimate of J* function (say Jn)
Backup Jn function at state s to calculate a new
estimate (Jn+1) as follows
Value Iteration
Perform Bellman updates at all states in each iteration.
Stop when costs have converged at all states.
Bellman Backup
Qn+1(s,a)
Min
Jn
Jn
a1
Jn
Jn+1(s)
s
a2
Jn
a3
Ap(s)
Jn
Jn
Jn
RTDP Trial
Qn+1(s,a)
amin = a2
Min
Jn
Jn
a1
Jn
Jn+1(s)
s
a2
Jn
a3
Ap(s)
Jn
Jn
Jn
Goal
Real Time Dynamic Programming
(Barto, Bradtke and Singh’95)
пЃ¬
Trial : Simulate greedy policy;
Perform Bellman backup on visited states
пЃ¬ Repeat RTDP Trials until cost function converges
пЃ¬
пЃ¬
пЃ¬
пЃ¬
Anytime behaviour
Only expands reachable state space
Complete convergence is slow
Labeled RTDP (Bonet & Geffner’03)
пЃ¬
пЃ¬
Admissible, if started with admissible cost function.
Monotonic; converges quickly
Concurrent MDPs
Redefining the Applicability
function
пЃ¬ Ap : S!P(P(A))
пЃ¬
пЃ¬
Inheriting mutex definitions
from Classical planning:
пЃ¬
пЃ¬
пЃ¬
Conflicting preconditions
Conflicting effects
Interfering preconditions and
effects
a1 : if p1 set x1
a2 : if : p1 set x1
a1 : set x1 (pr=0.5)
a2 : toggle x1 (pr=0.5)
a1 : if p1 set x1
a2 : toggle p1 (pr=0.5)
Concurrent MDPs (contd)
пЃ¬
Ap(s) = {Ac Вµ A |
пЃ¬
пЃ¬
All actions in Ac are individually applicable in s.
No two actions in Ac are mutex.
}
пЃ¬
) The actions in Ac don’t interact with each
other. Hence,
Concurrent MDPs (contd)
пЃ¬
Cost Model
пЃ¬
пЃ¬
C : P(A)! R
Typically, C(Ac) < a2 AcC({a})
Time component
пЃ¬ Resource component
пЃ¬
пЃ¬
(if C(Ac) =  … then optimal sequential policy is
optimal for concurrent MDP)
Bellman Backup (Concurrent MDP)
Min
a1
Jn
Exponential blowup
to calculate a
Jn
Jn Bellman Backup!
Jn
a2
Jn
a3
Jn+1(s)
s
a1,a2
a1,a3
a2,a3
a1,a2,a3
Ap(s)
Jn
Jn
Jn
Jn
Jn
Jn
Jn
Jn
Jn
Jn
Jn
Jn
Jn
Outline of the talk
пЃ¬
пЃ¬
пЃ¬
пЃ¬
пЃ¬
пЃ¬
MDPs
Concurrent MDPs
Present sound pruning rules to reduce the
blowup.
Present sampling techniques to obtain orders
of magnitude speedups.
Experiments
Conclusions and Future Work
Combo skipping (proven sound pruning rule)
If d Jn(s)e < пЃ§1-kQn(s,{a1}) + func(Ac,пЃ§)
Then prune Ac for state s in this backup.
Use Qn(s,Aprev) as an
upper bound of Jn(s).
Choose a1 as the action with
maximum Qn(s,{a1}) to obtain
maximum pruning.
Skips a combination only for current iteration пЃЊ
Combo elimination (proven sound pruning rule)
If b Q*(s,Ac)c > d J*(s)e
then eliminate Ac from applicability set
of state s.
Use Qn(s,Ac) as a
lower bound of Q*(s,Ac).
Use J*sing(s) (the optimal cost
for single-action MDP as
an upper bound of J*(s).
Eliminates the combination Ac from applicable list
of s for all subsequent iterations. пЃЉ
Pruned RTDP
пЃ¬
RTDP with modified Bellman Backups.
Combo-skipping
пЃ¬ Combo-elimination
пЃ¬
пЃ¬
Guarantees:
Convergence
пЃ¬ Optimality
пЃ¬
Experiments
пЃ¬
Domains
пЃ¬
пЃ¬
пЃ¬
пЃ¬
NASA Rover Domain
Factory Domain
Switchboard domain
Cost function
пЃ¬
пЃ¬
Time Component 0.2
Resource Component 0.8
State variables : 20-30
пЃ¬ Avg(Ap(s)) : 170 - 12287
пЃ¬
Speedups in Rover domain
Stochastic Bellman Backups
Sample a subset of combinations for a Bellman
Backup.
пЃ¬ Intuition :
пЃ¬
пЃ¬
пЃ¬
Actions with low Q-values have high likelihood to be
in the optimal combination.
Sampling Distribution :
пЃ¬
пЃ¬
(i) Calculate all single action Q-values.
(ii) Bias towards choosing combinations
containing actions with low Q-values.
Best combinations for this state in the previous
iteration (memoization).
Sampled RTDP
Non-monotonic
пЃ¬ Inadmissible
пЃ¬
пЃ¬
пЃ¬
) Convergence, Optimality not proven.
Heuristics
Complete backup phase (labeling).
пЃ¬ Run Pruned RTDP with value function from
Sampled RTDP (after scaling).
пЃ¬
Speedup in the Rover domain
Close to optimal solutions
Problem
J*(s0) (S-RTDP)
J*(s0) (Optimal)
Error
Rover1
10.7538
10.7535
<0.01%
Rover2
10.7535
10.7535
0
Rover3
11.0016
11.0016
0
Rover4
12.7490
12.7461
0.02%
Rover5
7.3163
7.3163
0
Rover6
10.5063
10.5063
0
Rover7
12.9343
12.9246
0.08%
Art1
4.5137
4.5137
0
Art2
6.3847
6.3847
0
Art3
6.5583
6.5583
0
Fact1
15.0859
15.0338
0.35%
Fact2
14.1414
14.0329
0.77%
Fact3
16.3771
16.3412
0.22%
Fact4
15.8588
15.8588
0
Fact5
9.0314
8.9844
0.56%
Speedup vs. Concurrency
Varying the num_samples
Contributions
Modeled Concurrent MDPs
пЃ¬ Sound, optimal pruning methods
пЃ¬
Combo-skipping
пЃ¬ Combo-elimination
пЃ¬
пЃ¬
Fast sampling approaches
Close to optimal solution
пЃ¬ Heuristics to improve optimality
пЃ¬
пЃ¬
Our techniques are general and can be
applied to any algorithm – VI, LAO*, etc.
Related Work
пЃ¬
пЃ¬
пЃ¬
пЃ¬
пЃ¬
Factorial MDPs (Mealeau etal’98, Singh &
Cohn’98)
Multiagent planning (Guestrin, Koller, Parr’01)
Concurrent Markov Options (Rohanimanesh &
Mahadevan’01)
Generate, test and debug paradigm (Younes &
Simmons’04)
Parallelization of sequential plans
(Edelkamp’03, Nigenda & Kambhampati’03)
Future Work
Find error bounds, prove convergence for
Sampled RTDP
пЃ¬ Concurrent Reinforcement Learning
пЃ¬ Modeling durative actions (Concurrent
Probabilistic Temporal Planning)
пЃ¬
пЃ¬
Initial Results – Mausam & Weld’04, (AAAI
Workshop on MDPs)
Concurrent Probabilistic Temporal
Planning (CPTP)
пЃ¬
Concurrent MDP
пЃ¬
Our solution (AAAI Workshop on MDPs)
пЃ¬
CPTP
Model CPTP as a Concurrent MDP in an
augmented state space.
пЃ¬ Present admissible heuristics to speed up the
search and manage the state space blowup.
пЃ¬
Документ
Категория
Презентации
Просмотров
2
Размер файла
1 396 Кб
Теги
1/--страниц
Пожаловаться на содержимое документа