close

Вход

Забыли?

вход по аккаунту

?

An adaptive load distribution algorithm for resolving bursty workload

код для вставкиСкачать
CONCURRENCY: PRACTICE AND EXPERIENCE
Concurrency: Pract. Exper., Vol. 11(1), 1–20 (1999)
An adaptive load distribution algorithm for
resolving bursty workload
QIN LU1† AND SAU-MING LAU2∗
1 Department
of Computing, The Hong Kong Polytechnic University, Hung Hom, Hong Kong
(e-mail: csluqin@camp.polyu.edu.hk)
2 Department
of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong
(e-mail: smlau@cse.cuhk.edu.hk)
SUMMARY
Most existing dynamic load distribution (LD) algorithms assume fairly stable task arrival pattern. With this assumption, single task assignments are adequate to provide reasonably good
performance. They are, however, inadequate when tasks arrive in bursts. In this paper, we
propose a LD algorithm based on batch task assignments. The algorithm is tailored to systems
subject to bursty workload. The key of this algorithm is the dynamic negotiation on the amount
of workload to be transferred between a sender–receiver pair. Dynamic negotiations ensure the
algorithm’s adaptive behavior, thus allow task congestions to be resolved quickly. Consequently,
CPU utilization can be increased and average task response time reduced substantially. The
dynamic negotiations are conducted by the GR Protocol, which also avoids processor thrashing and state waggling – two undesirable phenomena that commonly exist in LD algorithms.
Copyright 1999 John Wiley & Sons, Ltd.
1. INTRODUCTION
In a distributed computing system (DCS) where processing nodes are connected by a
local area network, it is very likely that some nodes have higher transient workload than
others[1,2]. It is desirable for such workload imbalance to be smoothed out by making
use of the spare computing capacity of those idle or lightly loaded nodes, so that the
average task response time can be minimized. A dynamic load distribution (LD) algorithm
achieves this objective by making task distribution decisions based on current system load
information[3,4]. Existing LD algorithms commonly use single task assignments. That is,
at most one task is transferred from a sender to a receiver during each negotiation session.
The performance of most existing dynamic LD algorithms is measured with somewhat
stable task arrival patterns. For example, the exponential distribution is commonly used to
specify task interarrivals[3,5–8]. In such situations, single task assignments are adequate
to provide reasonably good performance.
In some situations, however, a large amount of tasks arrive at the system in a relatively
short time span. Such bursty arrival patterns commonly exist for applications showing
periodicity, such as daily clearance in commercial applications and assignment submission
periods in academic environments. As another example, the amount of Internet access
(Web surfing in particular) commonly reaches its peak during lunch hours, causing a
∗ Correspondence
to: S.-M. Lau, The Chinese University of Hong Kong, Shatin, Hong Kong.
with the Department of Computer Science and Engineering,
The Chinese University of Hong Kong, Shatin, Hong Kong.
† This work was done when Qin Lu was associated
CCC 1040–3108/99/010001–20$17.50
Copyright 1999 John Wiley & Sons, Ltd.
Received 22 April 1997
Accepted 7 May 1998
2
Q. LU AND S.-M. LAU
lot of CGI programs to be executed in Web servers. When bursty task arrivals occur,
severe task congestions appear and the performance of the system degrades significantly.
Algorithms using single task assignments are too slow to react to such congestion because
many negotiation sessions are needed to offload the congested tasks to idle or lightly loaded
nodes. Besides, the large amount of negotiation activities may further deteriorate the system
performance through their CPU overhead and network bandwidth consumptions.
In this paper, we propose a new LD algorithm called GR.batch. This algorithm is based
on batch tasks assignments. That is, multiple tasks can be relocated during a single sender–
receiver negotiation session. The key of this algorithm is the dynamic negotiation on
the amount of workload to be transferred between a sender–receiver pair. The dynamic
negotiations result in the adaptive behavior of the algorithm, ensuring good performance
even for systems with bursty workload. We have developed the Guarantee and Reservation
Protocol (GR Protocol) to conduct the dynamic negotiations. The core of this protocol is the
virtual workload model, which yields fast dispersements of congested tasks and at the same
time avoids processor thrashing. Processor thrashing is the situation when a receiver node is
flooded with too many incoming tasks transferred from busy nodes. Without proper control
mechanisms, the performance improvement achieved by workload distribution may be
cancelled out[5,7]. On the other hand, we pay special attention to the inherent performance
limitation due to static thresholds, which are commonly used in LD algorithms for workload
measurements. In brief, a small threshold is favored for lightly loaded systems, while a large
threshold is preferred for heavily loaded systems. A mechanism called begging negotiations
is proposed to rectify such performance limitations.
The rest of the paper is organized as follows. Section 2 discusses the design issues of
dynamic LD algorithms and some related work. Section 3 presents the GR.batch algorithm.
Section 4 describes the simulation model and explains how bursty workload is characterized. Simulation results and performance evaluations are also presented. Section 5 is the
conclusion.
2. BACKGROUND STUDY
In [9], Casavant and Kuhl proposed a classification scheme for LD algorithms. With static
algorithms, task assignment decisions are made a priori during compilation time and are
to remain unchanged during runtime. In contrast, dynamic algorithms use current system
load information for runtime assignments of tasks to processing nodes.
An early work on static algorithms by Stone[10] used a graph theoretic Max Flow/Min
Cut algorithm to find an assignment solution which minimizes total execution and communication costs. His work was continued by V. M. Lo in [11], where task execution
concurrency is improved by introducing interference costs to Stone’s model. Other graphtheoretic-based studies include [12] by Shen and Tsai and [13] by Chen and Yur. The static
assignment problem was formulated as state-space search. Heuristic-based solutions were
proposed. The queuing theoretic approach is also commonly used in static algorithms. In
[14], Ni and Hwang derived a probabilistically optimal task assignment algorithm. However, the neglect of task transfer overhead and communication delays limited the practical
value of their algorithm.
It is now commonly agreed that despite the higher runtime complexity, dynamic algorithms can potentially provide better performance than static algorithms. An important
category of dynamic algorithms under Casavant and Kuhl’s taxonomy is the dynamic
Copyright 1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 1–20 (1999)
LD ALGORITHM FOR BURSTY WORKLOAD
3
suboptimal heuristic-based algorithms. The majority of existing dynamic LD algorithms
fall into this category[3,4,7,8,15–18]. This is by no means accidental. Due to the intrinsic unavailability of accurate and timely global state information in a distributed system,
targeting for suboptimal performance and the use of heuristic-based algorithms are quite
reasonable.
The major components of a dynamic LD algorithm are: (i) the location policy, referring to
the strategy used to pair up task senders and receivers; (ii) the information policy, referring
to the strategy used to disseminate load information among processing nodes; and (iii) the
transfer policy, referring to the strategy used to determine whether task transfer activities
are needed by a node. We discuss these three policies below.
Most location policies proposed are based on polling. Polling involves the transmission
of a short query message from a sender node (or a receiver node) to a selected target
node in an attempt to obtain a task transfer consent[3,4,6,8]. Polling is also responsible
for load information dissemination because load information of the polling node can be
either piggybacked on the polling message or be deduced from the semantics of the polling
protocol. A polling session can be started either by a heavily loaded node in an attempt
to locate a lightly loaded node, or vice versa. The polling session in the former case is
said to be sender-initiated and the latter receiver-initiated[4,6]. If both approaches are
being used simultaneously, the location policy is said to be symmetrically initiated[8].
Eager et al. showed that sender-initiated location polices do not provide consistent performance improvement over the entire range of system loads[6]. A similar phenomenon
is found for receiver-initiated policies. In particular, sender-initiated algorithms provide
better performance than receiver-initiated algorithms when the system is lightly loaded,
whereas receiver-initiated algorithms are preferred when the system is heavily loaded.
In [8], Shivaratri and Krueger proposed an adaptive symmetrically initiated location policy. With this location policy, sender-initiated pollings are being used only when the system
is lightly loaded, whereas receiver-initiated pollings are conducted whenever appropriate.
This location policy shows performance advantages over a wide range of system loads.
An important location policy design issue is to avoid processor thrashing. Processor
thrashing refers to the situation when a receiver node is flooded with too many incoming
tasks transferred from busy nodes, so that the receiver itself becomes overloaded inadvertently[7]. In [18], Shivaratri and Singhal used a simple method to prevent processor
thrashing. In this method, when a receiver agrees to accept a task from a sender, the receiver raises a ‘task-in-transit’ flag so that it will not reply positively to the pollings of other
busy nodes. The ‘task-in-transit’ flag is lowered upon the receipt of the remote task from
the original sender. The receiver can then accept another transfer request from other busy
nodes. A similar approach has been used by Mirchandaney et al. in [19]. Such methods are
unnecessarily conservative as the receiver may have large enough spare processing capacity to serve multiple busy nodes simultaneously. Furthermore, this method may establish a
faulty knowledge in some busy nodes so that subsequent polling activities are directed away
from the receiver node, which may still have spare capacity after the original negotiation.
In [7], we showed that the GR Protocol is highly effective in avoiding processor thrashing.
Besides, it is less conservative as multiple simultaneous negotiation sessions are allowed.
Existing transfer policies are usually threshold-based – a node is regarded as busy (and
thus a potential task sender) if its load exceeds a certain threshold value. In contrast, a
node is regarded as lightly loaded (and thus a potential task receiver) if its load is below
the threshold[3,8,19,20]. However, the use of a single fixed threshold value leads to state
Copyright 1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 1–20 (1999)
4
Q. LU AND S.-M. LAU
waggling. State waggling refers to the situation when a node switches between the busy
and the lightly loaded states frequently. A node with state waggling is highly unstable, so
that it conducts a substantial amount of unnecessary task transfer activities[5]. Alternative
approaches are to use two-level thresholds[7], or to measure the load difference between two
nodes to determine whether task transfer is needed between them[21]. An algorithm design
issue closely related to transfer policy is workload index, which refers to the quantitative
measurement of how busy a node is. The most commonly used workload index is the
number of tasks residing in a node.
3. THE GR.batch ALGORITHM
The GR.batch algorithm is polling-based. It adopts the symmetrically initiated location
policy proposed by Shivaratri and Krueger in [8] to pair up sender–receiver partners. This
location policy has been briefly discussed in Section 2. As in most polling-based LD
algorithms, GR.batch employs a polling limit to control the number of retry polls that a
node can make[3,4,8].∗
3.1. The GR Protocol
The aim of the GR Protocol is to derive a reasonable batch size so that congested tasks
on sender nodes can be dispersed as fast as possible, while processor thrashing is avoided.
Incorporated into the GR Protocol are polling, reply and task transfer messages.
Each invocation of the GR Protocol is associated with two quantities, guarantee value
(gur) and reservation value (res). gur is the amount of workload that a sender is willing to
send away. res is the amount of workload that a receiver is willing to accept. The usage of
these quantities is discussed in detail in Section 3.3. We first explain the basic idea of the
GR Protocol.
• Sender side: A polling message produced by a sender contains the quantity gur.
After the polling message has been sent out, the workload of the sender is artificially
reduced by an amount equal to that of gur. By doing so, the sender pretends that the
workload is being catered for by the intended receiver. The sender virtually becomes
less busy. This method prevents overdrafting the sender node. Overdrafting occurs
if the sender keeps regarding itself as being heavily loaded, and replies positively to
too many receiver-initiated pollings.
• Receiver side: Upon receiving the sender’s polling message, the receiver derives
the value of res. Factors considered include gur from the polling message and the
receiver’s current workload. res is piggybacked on the reply message sent from
the receiver to the sender. Besides, res serves as a ‘quota’ to reserve the receiver’s
processing capacity on behalf of the sender. This is done by artificially elevating the
receiver’s workload by an amount of res. This method prevents the receiver from
replying positively to too many sender-initiated pollings and becomes flooded. That
is, processor thrashing is avoided.
∗ In [7], we introduced an early version of the GR.batch algorithm, with particular emphasis on reducing
processor thrashing. In that early version, a stable workload pattern was assumed and fixed double-threshold
values were used for all processing nodes. Most importantly, that early version did not have the capability to
resolve large amounts of task congestions, not to mention simultaneous bursty workload in multiple nodes.
Besides, begging negotiations, which have a crucial impact on the performance of our new algorithm, had not
been introduced at all.
Copyright 1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 1–20 (1999)
5
LD ALGORITHM FOR BURSTY WORKLOAD
Table 1. Definitions of workload states based on virtual workload
Workload state
Meaning
Definition
L-load
N-load
H-load
Lightly loaded
Normally loaded
Heavily loaded
V Wi ≤ Tlow
Tlow < V Wi ≤ Tup
V Wi > Tup
• Sender side: Upon receiving the receiver’s reply message, the sender derives the
batch size (amount of workload to send to the receiver). Factors considered include
gur, res and the sender’s current workload. After the task batch is composed, the
workload of the sender is adjusted according to the previously reduced value (gur)
and the actual batch size.
• Receiver side: Similarly, the artificially elevated workload of the receiver is adjusted
when the task batch arrives.
As symmetrically initiated location policy is used, a receiver can also start a polling
session. A receiver-initiated polling message carries the quantity res. The derivation of res
on the receiver side, in this case, is entirely internal to the receiver. That is, the receiver has
no gur value to consider during the process.
In order to disperse the workload of heavily loaded nodes as quickly as possible and to
fully utilize the spare capacities in lightly loaded nodes, we allow concurrent sessions of
the GR Protocol to be conducted in any single node. We define the aggregate reservation
value of node Pi , denoted by Ri , as the sum of res values under negotiation. We define the
aggregate guarantee value of Pi , denoted by Gi , as the sum of gur values under negotiation.
3.2. Workload measurement
The GR.batch algorithm uses the virtual workload, denoted by V Wi , as its workload index
to measure quantitatively how busy a node Pi is. V Wi is defined as
V Wi = Qi + Ri − Gi
(1)
where Qi is the total workload of the tasks residing on Pi ; Ri and Gi are the aggregate
reservation value and aggregate guarantee value of Pi , respectively.
In addition to considering the amount of workload residing in a node, virtual workload
also reflects the workload of those negotiation sessions in progress. This makes virtual
workload a more realistic workload measurement. Virtual workload, together with the GR
Protocol, avoids processor thrashing effectively. This method is less conservative than the
task-in-transit flag approach used in [18,19] because multiple negotiation sessions can be
conducted simultaneously.
In the GR.batch algorithm, the workload state of a node is defined in terms of its virtual
workload. The definition of the three workload states is given in Table 1. Tup and Tlow are
called the upper and the lower thresholds, respectively. Double thresholds are used to avoid
state waggling. The values of Tup and Tlow are algorithm design parameters.
3.3. Negotiation on batch size
In Section 3.1, the use of gur and res for task batch size negotiations is discussed briefly.
The detailed procedure is described in this section.
Copyright 1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 1–20 (1999)
6
Q. LU AND S.-M. LAU
1. Sender side
The strategy used to derive a guarantee value gur on the sender side has a profound effect
on the performance of GR.batch. A large gur means that the sender involved can negotiate
with only a few potential receiver nodes. The amount of workload distribution would be
limited. In contrast, a small gur limits the amount of workload to be transferred in a single
task batch, and advantages of batch assignments are deprived. gur is derived using the
following rationale.
Rationale: Upon the arrival of a task batch having a total workload α, receiver Pr should
not be promoted to the H-load state. That is, ¬ (V Wr + α ⇒ H-load). 2
According to Table 1, H-load appears when a node’s virtual workload is greater than Tup .
We obtain the following inequality easily:
V Wr + α ≤
α ≤
Tup
Tup − V Wr
(2)
Again from Table 1, the virtual workload of a receiver node (in the L-load state) must have
a value in the closed interval [0, Tlow ]. Thus, the two bounding values of V Wr are:
Smallest:
Largest:
Substituting into (2):
α≤
Tup
Tup − Tlow
V Wr = 0
V Wr = Tlow
if V Wr = 0
if V Wr = Tlow
(3)
The second case in (3) gives a tighter bound. To take a less optimistic approach, sender Ps
expects that the workload that Pr will agree to receive is at most (Tup − Tlow ). Hence, gur
should not be greater than this value.∗ Thus, the value of gur is given by
gur = Tup − Tlow
(4)
2. Receiver side
When the guarantee value gur arrives, Pr derives the corresponding reservation value res.
By neglecting local task arrivals, Pr tries to accept the largest possible batch size. The
following equation is thus obtained from inequality (2):
α = Tup − V Wr .
(5)
Obviously, Pr should not reserve its processing capacity by an amount greater than that Ps
wants to offload. We therefore have
α = Tup − V Wr
(6)
res = min
gur = Tup − Tlow
Since the largest value of V Wr for Pr is Tlow , the smallest value of α is (Tup − Tlow ). Thus,
∗ P may predict its future workload and thus guarantees more. This is particularly useful in real-time systems
s
or when bursty or excessive task arrivals are expected by Ps in a short time span.
Copyright 1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 1–20 (1999)
7
LD ALGORITHM FOR BURSTY WORKLOAD
gur is always at least as stringent as α. (6) can be simplified to res = gur = Tup − Tlow . For
receiver-initiated pollings, we take res = (Tup − V Wr ) because gur is simply unavailable.
Thus,
for sender-initiated pollings
Tup − Tlow
(7)
res =
for receiver-initiated pollings
Tup − V Wr
3. Sender side
In the next step, Ps determines the largest amount of workload to be transferred to Pr . This
amount is called the maximally allowed batch size and is denoted by θ. θ is derived by
using the following rationale.
Rationale: After removing a task batch having a total workload θ, sender Ps should not be
lowered to the L-load state. That is, ¬ (V Ws − θ ⇒ L-load). 2
Again from Table 1, L-load appears when a node’s virtual workload is smaller than or
equals Tlow . We obtain the following inequality easily:
V Ws − θ
>
Tlow
θ
<
V Ws − Tlow
(8)
Finally, we have the following two additional rationales to control θ:
Rationale: After transferring a task batch having a total workload θ from Ps to Pr , Ps should
not have a lower virtual workload than that of Pr . 2
This is expressed as below:
V Ws − θ
0
≥ V Wr + θ
0
θ
≤
V Ws − V Wr
2
(9)
0
where V Wr is the virtual workload of Pr as piggybacked on the reply message from Pr to
Ps in response to Ps ’s polling.
Rationale: The expected total queuing time of a task batch having workload θ, if executed
on Pr , should be less than that when executed on Ps . 2
If the task batch is executed in Ps , the expected queuing time of the first task in the task
batch is V Ws − θ, assuming a simple M/M/1 model[22];∗ that of the second task is
V Ws − θ + S1 , where S1 is the time needed for Ps to finish the first task in the task batch;
and that of the ith task is V Ws − θ + S1 + S2 + · · · + Si−1 . Assuming that the mean
workload imposed by the tasks is S, the total queuing time of the task batch is
b
X
[V Ws − θ + (i − 1)S]
(10)
i=1
where b = θ/S is the expected number of tasks in the task batch. On the other hand, if the
0
task batch is executed in Pr , the expected queuing time of the first task is COST (θ)+V Wr ,
∗ So far, we have intentionally used ‘workload’ as a generic term without giving a precise definition. In
subsequent discussions, workload refers to CPU time needed.
Copyright 1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 1–20 (1999)
8
Q. LU AND S.-M. LAU
where COST (θ) denotes the delay experienced by the task batch due to CPU overhead
and communication delays. We deliberately neglect the detailed specification of COST ()
so that different cost models can be applied. The expected queuing time of the second task
0
when executed in Pr is COST (θ) + V Wr + S1 ; and that of the ith task is COST (θ) +
0
V Wr + S1 + S2 + · · · + Si−1 . The total queuing time of the b tasks is thus
b
X
0
[COST (θ) + V Wr + (i − 1)S]
(11)
i=1
Following the last rationale and using (10) and (11), we have
b
X
0
[COST (θ) + V Wr + (i − 1)S] <
i=1
b
X
[V Ws − θ + (i − 1)S]
i=1
θ + COST (θ)
<
0
V Ws − V Wr
(12)
With high enough available network bandwidth so that the transfer cost is reasonably small,
0
inequality (12) can be simplified to θ < V Ws − V Wr , which is less constraining than (9).
Since Ps should not transfer more workload than the receiver has agreed to accept, we
have another constraint on θ:
θ ≤ res
(13)
Combining inequalities (8), (9), (12) and (13),

V Ws − Tlow



 V Ws −V Wr0
2
θ = min
0

V
W
− V Wr − COST (θ)

s


res
(14)
From (7), the largest possible value of res is Tup (receiver-initiated negotiation with
V Wr = 0). It follows from (14) that the largest θ is Tup . In other words, Tup is the upper
bound on a batch size.
After the maximally allowed batch size θ has been determined, Ps has to select the
tasks to be included into the final task batch according to a task batch composition (TBC)
scheme. This scheme can be formulated according to different performance objectives[23].
In this paper, we use a First-In-First-Select policy for task batch composition. More detailed
discussions on other TBC schemes can be found in [23].
3.4. Begging negotiations
From the above discussions, we know that the upper threshold Tup bounds the batch size.
For systems subject to bursty workload, a large Tup ensures better system performance by
allowing large task batches. However, this is only part of the story.
3.4.1. Background
Study [24] showed that the value of Tup affects the performance of an LD algorithm
differently across the spectrum of the system workload, as illustrated in Figure 1. We can
Copyright 1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 1–20 (1999)
9
LD ALGORITHM FOR BURSTY WORKLOAD
Mean
Task
Response
Time
Large T up
Small Tup
System workload
Figure 1.
Effect of Tup on performance of LD algorithms across system workload spectrum
see that, with a lightly loaded system, a small Tup provides better performance. In contrast,
a large Tup is better for heavily loaded systems.
When the system is lightly loaded, most nodes contain only a few tasks. If Tup is large,
it is difficult for those relatively busy nodes to leave the N-load state and reach the H-load
state. In other words, only a few nodes are eligible to act as task senders. Thus workload
distribution is very rare. The spare processing capacity cannot be used to further improve
mean task response time and is wasted. On the contrary, when Tup is small, there are more
sender nodes. Thus, workload distribution and performance improvement are more likely.
This explains why a lightly loaded system is in favor of a small Tup .
When the system is heavily loaded, the probability that a node having spare processing
capacity (i.e. a potential task receiver) exists in the system is relatively small. In this
situation, it is difficult for sender-initiated pollings to be successful. The use of a large Tup
helps to suppress the number of nodes in the H-load state, thus reducing the amount of
sender-initiated pollings. Consequently, unnecessary CPU overhead is avoided. Only those
nodes with substantially heavy workload are eligible to act as a task sender. This explains
why a heavily loaded system is in favor of a large Tup .
We choose a large Tup to ensure good system performance for both heavily loaded
systems and systems subject to bursty workload. We rectify Tup ’s conflicting behavior by
developing begging negotiations. Begging negotiations are used when the system is lightly
loaded, in which case receiver-initiated pollings dominate.
3.4.2. Begging negotiations
The use of begging negotiations is based on the following rationale.
Rationale: If a polling receiver fails to identify a heavily loaded node, it begs other normally
loaded or even lightly loaded nodes to feed it with some workload. By doing so, begging
negotiations encourage workload distribution which otherwise is impossible. 2
Suppose a lightly loaded node Pr attempts to locate a heavily loaded sender Ps . Pr
invokes the location policy (receiver-initiated component) to identify Ps for sending the
polling message. Pr first selects a node in the H-load state by studying its own local
Copyright 1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 1–20 (1999)
10
Q. LU AND S.-M. LAU
Table 2. Properties of the three LD algorithms evaluated
Transfer
mode
Begging
negotiation
Based on virtual
workload, V Wi
Batch
Used
as above
Based on the number
of application tasks
residing in a node
as above
Single
Not used
Not used
Algorithm
Location policy
Workload index
GR.batch
Shivaratri and Krueger’s
location policy is modified to
incorporate the GR Protocol
as above
Original Shivaratri and
Krueger location policy
GR.batch.nb
SK.single
workload state table (refer to [8] for the maintenance of this table by the location policy).
If this fails, a node in the N-load state is then picked. If this still fails, a node in the L-load
state is picked. If either one of the latter two cases happens, the system is considered lightly
loaded and begging negotiation starts.
When in a begging negotiation, an additional begging flag is piggybacked to each
receiver-initiated polling message. Upon receiving a polling message, Ps examines its
current workload state. Normally, Ps is eligible to offload its workload only if it is in
the H-load state. However, if the begging flag is set, Ps is allowed to behave as a task
sender even if it is in the N-load state or in the L-load state. In doing so, the batch size
negotiation procedure discussed in Section 3.3 still holds. This ensures that task transfers
are conducted only when it is beneficial to do so. Note that equation (14), which governs
the maximally allowed batch size θ, does not depend on the value of Tup . Thus the use of
begging negotiations and a large Tup does not affect the validity of the way θ is derived.
4. SIMULATION EXPERIMENTS AND PERFORMANCE EVALUATIONS
The performance of the GR.batch algorithm are evaluated using simulation experiments.
To provide comparative results, two other dynamic LD algorithms are also studied. These
two algorithms are labeled GR.batch.nb and SK.single:
• GR.batch.nb: This algorithm is identical to GR.batch except that begging negotiations
are not used. Comparing this algorithm with GR.batch reveals the significance of
using begging negotiations.
• SK.single: This algorithm uses single task assignments. It employs Shivaratri and
Krueger’s location policy. It is a highly successful LD algorithm when the system is
subject to ordinary workload fluctuations[8].
Table 2 summarizes the properties of the three algorithms evaluated.
4.1. Processing node model
We assume an asynchronous distributed system with no shared memory support. Message
passing is the only communication mechanism available between processing nodes. We
also assume that the network is fault-free and strongly connected. Messages are received
in the order sent.
The processing node model is shown in Figure 2. Each node consists of three task queues:
Local Queue, Remote Queue and Preemption Queue. When a task arrives locally to a
Copyright 1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 1–20 (1999)
11
LD ALGORITHM FOR BURSTY WORKLOAD
Local Queue
Local task arrivals
FCFS
Tasks assigned to remote nodes
CPU
Remote Nodes
Negotiation /
Task Transfer
Messages
Task completions
Remote Queue
Load Distribution
Algorithm
FCFS
Tasks assigned
from
remote nodes
Preemption
LD algorithm
overhead tasks
Preemption Queue
FCFS
FCFS First-Come-First-Serve
Figure 2. A processing node
node, it joins the Local Queue at the tail and waits to be served by the CPU in FIFO
discipline. Tasks in the Local Queue are candidates for remote transfer when the node
becomes busy. A LD algorithm can select task(s) in the Local Queue for remote transfer in
any arbitrary order it finds appropriate. For simplicity, we use the FIFO discipline in this
paper.
Remotely assigned tasks join the Remote Queue of the receiver node at the tail. From
there, they are also served by the CPU in the FIFO discipline. Tasks in the Local Queue have
higher priority over those tasks in the Remote Queue for getting the CPU. This avoids local
performance being sacrificed. The Remote Queue separates remotely transferred tasks from
those local tasks in the Local Queue, so that task reassignments are not allowed. By then,
a task is avoided from being continuously shuffled in the system without actual progress.
While many studies on dynamic LD algorithms failed to account for CPU overhead
generated by LD algorithms themselves, we perceive that modeling such overhead at
the appropriate level of complexity is essential[21]. Therefore, the Pre-emption Queue is
included to allow ‘overhead tasks’ created by LD algorithms to wait for the CPU in FCFS
discipline. As overhead tasks are usually small and time critical, they always pre-empt the
application task in execution (if there is one), regardless of whether that application task has
been dispatched from the Local Queue or the Remote Queue. The pre-empted application
task can resume its execution only when all pending overhead tasks have been completed,
including those generated during the course of its suspension.
4.2. Workload characterization
We characterize two kinds of workload for each node: background workload and bursty
workload.
4.2.1. Background workload
Background workload exists regardless of the existence of bursty workload. It represents
ordinary utilization of processing nodes, in contrast to the transient property of bursty workload. We use independent exponential distribution with mean 1/λi to model background
task inter-arrivals. The mean arrival rates of the n nodes in the system (i.e. λ1 , λ2 , . . . , λn )
themselves have a log normal distribution specified by (λBG , σBG ), where λBG and σBG are
the mean and standard deviation, respectively. This distribution reflects that some nodes
are subject to higher background workload than others. The standard deviation, σBG , is a
Copyright 1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 1–20 (1999)
12
Q. LU AND S.-M. LAU
Figure 3.
Bursty task arrival patterns having the same A and T but different σ
quantitative measurement of the workload imbalance, referred to as the imbalance factor.
The total CPU time needed by a task is called its service demand. Service demands in a
node Pi have an independent exponential distribution with mean Si . In our simulations, we
assume that Si = 1, ∀i ∈ 1 . . . n. This allows task response times to be reported in units of
mean task service demand.
4.2.2. Bursty workload
Bursty workload is modeled by a task arrival pattern which follows a time series of the
Gaussian distribution:
∞
X
1 t−mT 2
e− 2 ( σ )
(15)
Λ(t) = A
m=0
where Λ(t) is the task arrival rate at time t; A is the peak arrival rate, referred to as the
burst amplitude; T is the time gap between successive task bursts (and thus 1/T is called
the burst frequency); and σ is the burst duration which controls the dispersiveness of the
Gaussian distribution.
The intuitive meanings of these parameters are illustrated in Figure 3, where burst arrival
patterns of different σ’s are shown. A small σ value causes a sudden sharp increase in the
task arrival rate, while a large σ causes a more bell-shaped pattern. Note that with the same
A and T , a large σ results in more total task arrivals than a small σ does over the same time
span. By increasing either the burst duration or the burst frequency or both, successive task
bursts merge gradually until a prolonged elevation of workload is formed.
The purpose of using this bursty workload model is to enable systematic and comprehensive evaluations of the algorithms. The applicability and correctness of the GR.batch
are not in any way hindered by this model. The periodicity, in particular, is not a necessary
requirement.
Copyright 1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 1–20 (1999)
13
LD ALGORITHM FOR BURSTY WORKLOAD
Table 3.
Simulation parameters.
Simulation parameter
Value
Lower threshold, Tlow
Upper threshold, Tup
Polling limit, P L
5
15 (SK.single),
50 (GR.batch and GR.batch.nb)
d 13 ne
Number of nodes, n
Mean background task arrival rate, λBG
Std. dev. of background task arrival rates, σBG
Mean service demand, Si , ∀i ∈ 1 . . . n
30
0.50
0.01
1
Channel bandwidth
CPU overheads for each polling operation
Negotiation message size
CPU overhead per Kbyte of code length, ∆
Mean code length, l¯i = l¯j , for all 1 ≤ i, j ≤ n
10 Mbits/s
3 ms
40 bytes
3 ms per Kbyte
10 Kbytes
4.3. Algorithm costs and simulation parameters
Many studies on dynamic LD algorithms assumed that the execution and communication
overhead due to load distribution activities are negligible. As Kremien and Kramer[21]
pointed out, this assumption does not reflect runtime situations realistically. In our simulation experiments, both CPU overhead and network bandwidth consumptions due to load
distribution activities (negotiations and task transfers) are always accounted for.
Negotiation message overhead: A token-ring network having a bandwidth of 10 Mbit/s is
assumed.∗ Sending or receiving a negotiation message (polling and reply) incurs a CPU
overhead of 3 ms. All negotiation messages are assumed to have a fixed size of 40 bytes.
Task transfer overhead: For task transfers, the amount of CPU overhead relates to the code
length of the task(s) to be transferred and is modeled as follows:
X
lj
Task transfer CPU overhead = ∆ ·
j∈ task batch
where ∆ is the CPU overhead imposed per Kbyte task code to be transferred; lj is the
code length of task j in Kbytes. In the case of single task assignments, the above model
degenerates into ∆ · lj , where j is the task to be transferred. In the simulation experiments,
a typical value of ∆ is 3 ms per Kbyte. Thus, for a task batch having a total code length of
15 Kbytes, the amount of CPU overhead is 45 ms. Within a node Pi , lj has an independent
exponential distribution with mean l¯i . This distribution models the variations in task code
lengths. For simplicity, we assume that all nodes share the same mean code length, that is,
l¯i = l¯j , for all 1 ≤ i, j ≤ n.
Simulation parameters: The parameters used in the simulations are summarized in Table 3.
Note that the polling limit P L relates to the number of nodes in the system. Obviously,
when there are more nodes, more polling sessions are needed before a node can locate
∗ As we are interested in the relative performance of the algorithms, we use a token-ring network for its simplicity. In fact, the use of different network models does not have significant observable effect on the comparative
results, unless the network bandwidth is severely limited.
Copyright 1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 1–20 (1999)
14
Q. LU AND S.-M. LAU
an appropriate transfer partner successfully. One-third of the total number of nodes is a
reasonable amount experienced from previous studies[3,6,8,19]. A system size of 30 nodes
is large enough for meaningful evaluations while the time needed for running simulation
experiments can be kept reasonable.
The mean background task arrival rate, λBG , equals 0.5 task per unit time. Since the
mean service demand is 1, λBG = 0.5 results in a system workload of 50%. Such a
medium background workload provides representative performance results so that any
performance anomalies due to extremely light or extremely heavy workload can be avoided.
The imbalance factor, σBG , is deliberately set to a small value, 0.01, so that the nodes are
virtually subject to the same background workload. This allows us to focus on a bursty
workload.
For a bursty workload, we select the following parameter values: burst amplitude A = 50,
burst period T = 500 and burst duration σ = 1. This combination avoids overlapping
successive task bursts (see Figure 4(a)), and yet the burst amplitude is large enough to
study the algorithm’s ability to resolve congestions due to bursty workload. The large burst
period also makes the time between successive task bursts long enough to reveal a system
where only background workload exists (between successive bursts).
4.4. Simulation experiments I
To extensively evaluate the performance of the three LD algorithms, we have run numerous
simulations with different parameters. We select one set of representative results here for
detailed analysis. Other results are summarized in the next sub-Section. Among the 30
nodes in the system, only one is subject to bursty task arrivals. This reveals clearly the
ability of the algorithms in resolving task congestions, without worrying about the mutual
effect of the bursty nodes on each other. The node subject to bursty arrivals is referred to
as the bursty node. In the simulations, we created two task bursts at times 325 and 825,
respectively.
Figure 4 shows the performance comparisons between the three algorithms. Figures 4(a)
to (c) give the statistical results collected from the bursty node alone. Figures 4(d) to (f) are
statistics collected from the whole system, including the bursty node.
4.4.1. Performance at the bursty node
Resolving task congestions: Figure 4(a) shows the instantaneous task queue length (Local
Queue plus Remote Queue) of the bursty node. The two task bursts are reflected as two
sudden elevations in the queue length. In both task bursts, the longest queue length attained
by GR.batch is a couple of times smaller than that of the other two algorithms, and so is
the time needed for the congested tasks to be resolved. For example, the longest queue
length in the case of GR.batch in the second task burst is around 44% and 27% of that of
SK.single and GR.batch.nb, respectively. Less than 50 time units are needed to lower the
queue length with GR.batch, while 90 and 125 time units are needed with SK.single and
GR.batch.nb, respectively. In other words, the time needed for GR.batch to resolve task
congestions is around 56% and 40% of that of SK.single and GR.batch.nb, respectively.
Since the total processing capacity of the bursty node is a constant, regardless of the
LD algorithm running on it, the only reason for GR.batch’s higher ability in resolving task
congestions is that the algorithm can offload the congested tasks from the bursty node to
Copyright 1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 1–20 (1999)
15
Instantaneous Queue Length
LD ALGORITHM FOR BURSTY WORKLOAD
140
120
GR.batch
GR.batch.nb
SK.single
100
80
60
40
20
0
300
400
500
600
700
800
900
1000
Time
12
GR.batch
GR.batch.nb
SK.single
10
Average Queue Length
Average Task Response Time
(a)
14
8
6
4
2
0
0
200
400
600
800
Time
1000
1200
5
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
1400
GR.batch
GR.batch.nb
SK.single
0
200
400
1200
1400
4.5
4
5
4
3
2
GR.batch
GR.batch.nb
SK.single
1
3.5
3
2.5
2
1.5
GR.batch
GR.batch.nb
SK.single
1
0.5
0
0
0
200
400
600
800
Time
1000
1200
1400
0
200
(d)
Instantaneous Queue Length
1000
(c)
6
Resp. Time Std. Dev.
Average Task Response Time
(b)
600
800
Time
400
600
800
Time
1000
1200
1400
(e)
7
6
GR.batch
GR.batch.nb
SK.single
5
4
3
2
1
0
300
400
500
600
700
800
900
1000
Time
(f)
Figure 4. Performance comparisons between the three LD algorithms under bursty workload:
A = 50, T = 500, σ = 1 (other simulation parameters are shown in Table 3): (a) bursty node –
instantaneous queue length; (b) bursty node – average task response time; (c) bursty node – average
queue length; (d) system wide – average task response time; (e) system wide – task response time std.
dev.; (f) system wide – instantaneous queue length
Copyright 1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 1–20 (1999)
16
Q. LU AND S.-M. LAU
other nodes more efficiently. This gives GR.batch the highest percentage of tasks remotely
executed among the three algorithms – GR.batch’s value in this metric is 5.2 and 6.5 times
those of SK.single and GR.batch.nb, respectively.
Improved task response times: Figure 4(b) shows the variations of the average task response
time at the bursty node. Note that GR.batch always provides the best task response time.
Before the first task burst, the three algorithms provide almost identical performance. This
is because, during this period, there are only a few LD activities, so that the impact due
to the algorithm’s performance deviations is negligible. In fact, the system during this
period is only subject to background workload, which is characterized by the distribution
(λBG = 0.5, σBG = 0.01). By assuming the simple M/M/1 model, the theoretical mean
1
= 2. The slightly better mean task response time obtained
task response time is 1−λ
BG
from the simulations (as shown in Figure 4(b)) can be attributed to the LD activities. The
simulation results agree with the theoretical value roughly.
The sudden elevation in the mean task response time after a task burst is due to the
prolonged queueing time at the bursty node. This is reflected by the corresponding sudden
elevations in the (accumulated) average task queue lengths at the bursty node, as shown in
Figure 4(c). We further observe that, in the case of GR.batch, the task bursts only have a
very slight effect on the average queue length, in contrast to the other two algorithms. The
relatively stable average queue length provided by GR.batch is attributed to its ability to
initiate a larger amount of remote executions in a shorter time span, as discussed earlier.
This ability resolves the task congestions at the bursty node more quickly. The shorter
average queue length, and thus the shorter queueing time, are the primary reasons for
GR.batch’s performance advantage over the other two algorithms.
Note that, when comparing GR.batch and SK.single, there is a divergence between the
magnitude of improvement over mean task response time and the magnitude of reduction
in average queue length. The reduction in the queue length (by at least 35%) only results
in a slight reduction in the average task response time. This divergence is explained by
the fact that a larger CPU overhead is required by GR.batch to handle the large amount of
remote task transfers to resolve the congestions. Despite this divergence, GR.batch does
provide an improvement of over 30% to the system-wide task response time. This is further
discussed in sub-Section 4.4.3.
4.4.2. Evaluation of begging negotiations
We observe from Figure 4(b) that the mean task response time in the case of GR.batch.nb
is much higher than that of the other two algorithms. For example, after the second task
burst, the mean task response time of GR.batch.nb is 2.5 times that of GR.batch (12.5 vs.
5.1). This is due to GR.batch.nb’s relatively poor ability in initiating remote executions,
so that its task queue length is significantly higher than that of the other two algorithms,
as shown in Figures 4(a) and 4(c). A close examination of Figure 4(a) reveals that the
highest instantaneous queue length reached by GR.batch.nb is 3.8 times that of GR.batch
(126 vs. 33). GR.batch.nb can reach such a long queue length because the bursty node is
not regarded as a potential sender until the queue length is longer than the large Tup (value
50). In other words, GR.batch.nb is too slow to react to bursty arrivals, resulting in more
accumulated tasks and thus higher queue length.
Note also that, once the queue length of GR.batch.nb drops below 40, it decreases much
Copyright 1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 1–20 (1999)
17
LD ALGORITHM FOR BURSTY WORKLOAD
more slowly than before. For example, in the second task burst, GR.batch.nb needs only 5
time units (time 824 to 829) to reduce the queue length from 126 to 38, approximately 18
tasks per unit time. However, it takes a further 89 time units (until time 918) to reduce the
queue length to 9, approximately 0.3 tasks per unit time. This property is due to the lack of
begging negotiations, so that the accumulated tasks in the bursty node are processed locally
after the queue length drops below Tup .
Since the only difference between GR.batch and GR.batch.nb is that the latter does not
use begging negotiations, we can conclude that begging negotiations are critical to the
success of the GR.batch algorithm. To summarize, a large Tup ensures efficient resolution
of task congestions due to bursty workload, while begging negotiations compensate for the
performance drawback due to the use of a large Tup .
4.4.3. Overall system performance evaluation
Figure 4(d) shows the variations of the system wide average task response time. It can be
seen that the average task response time in the case of GR.batch is around 2, while that
of SK.single is around 3. In other words, in terms of mean task response time, GR.batch
provides a performance improvement of around 30% over SK.single. This is also reflected
in Figure 4(f), where GR.batch consistently provides the lowest instantaneous queue length
throughout the entire simulation period. Since the total available processing capacity of the
entire system is identical regardless of the LD algorithms used, we can further infer that the
system’s processing capacity has been more fully utilized when GR.batch is employed. In
other words, the spare processing capacity of those lightly loaded nodes is better utilized.
This is made possible by GR.batch’s better ability to offload surplus workload from the
bursty node to the potential task receivers.
Another advantage of GR.batch is that it provides lower response time standard deviation
than the other two algorithms, as shown in Figure 4(e). In other words, GR.batch provides
better predictability on the task response time. Obviously, the major factor contributing to
this property is that the surplus workload in the bursty node is offloaded to other nodes
and thus processed more quickly, so that their waiting time (the time they wait for being
processed) is reduced and evens out.
4.5. Simulation experiments II
In the second set of simulation experiments, we focus on the effect of different bursty
workload parameters (A, T, σ) on the performance of the LD algorithms. To assist visualizing the performance comparisons, we define the improvement factor as the percentage
improvement of mean task response time given by GR.batch over that of SK.single. (The
GR.batch.nb algorithm is no longer of interest to us.) That is,
Improvement Factor =
RT (SK.single) − RT (GR.batch)
× 100%
RT (SK.single)
A positive improvement factor denotes a performance improvement, while a negative value
implies a degradation.
We have run numerous simulation experiments with different combinations of bursty
workload parameters. Figure 5 depicts a summary of the simulation results. A projection
Copyright 1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 1–20 (1999)
18
Figure 5.
Q. LU AND S.-M. LAU
Improvement of mean task response time by GR.batch over SK.single with different bursty
workload parameters
line corresponding to an improvement factor of zero is drawn for each diagram. Those data
points ‘above’ this projection line correspond to situations where GR.batch performs better
than SK.single. Those points ‘below’ the projection line correspond to situations where
GR.batch performs worse than SK.single.
Copyright 1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 1–20 (1999)
LD ALGORITHM FOR BURSTY WORKLOAD
19
Figure 5(a) with burst duration 0.1 shows a rather flattened surface with an improvement
factor of around 30%, except with very small burst periods. Within this flattened area,
GR.batch is insensitive to variations of the bursty workload parameters. Figures 5(a) to (d)
show the performance of GR.batch under progressively increasing burst durations from 0.1
to 5. It is obvious that the smaller the burst duration, the larger the number of points ‘above’
the projection line. In other words, GR.batch is particularly suitable for bursty workload
with small burst durations.
We further observe from Figures 5(a) to (d) that, as the burst duration increases, the flattened area in Figure 5(a) deforms to a curved area. With higher burst durations, GR.batch’s
improvement factor drops towards increasing burst amplitude and towards decreasing burst
period, i.e. towards increasing bursty workload. When the bursty workload is heavy enough
so that the capacity of batch assignments is exceeded, the higher CPU overhead incurred
by GR.batch (due to the more complex negotiation protocols and the higher percentage
of remote executions) cannot be justified. In general, however, GR.batch does provide
significant performance improvement over most bursty workload situations.
5. CONCLUSION
An adaptive load distribution algorithm named GR.batch is proposed for distributed systems subject to bursty workload. The key of this new algorithm is the use of batch task
assignments, i.e. each sender–receiver negotiation session allows a number of tasks to be
relocated. This is in contrast to the commonly used single task assignments. The carefully
designed GR Protocol ensures dynamic negotiations on appropriate batch sizes, so that task
congestions and workload imbalance can be resolved quickly. By then, task response times
can be substantially reduced. An additional advantage of the GR Protocol is that processor
thrashing can be avoided.
We also discussed the inherent performance limitations of static threshold values used
for workload state measurement. In brief, a small threshold is favored for lightly loaded
systems, while a large threshold is preferred for heavily loaded systems. The use of begging
negotiations is proposed to rectify such performance limitations. Begging negotiations
allow the use of a large threshold value without sacrificing the performance of a lightly
loaded system. The use of a large threshold, in the meantime, ensures that batch assignments
effectively resolve task congestions when the system is subject to bursty workload.
The GR.batch algorithm has been extensively evaluated using simulations. We have
found that the algorithm provides performance improvement of around 30% (over
SK.single) in most cases, in terms of mean task response time. The key is that GR.batch
can resolve task congestions at bursty nodes more efficiently in a shorter time span with
less negotiation sessions.
REFERENCES
1. J. R. Lyle and C. Lu, ‘Load balancing from a UNIX shell’, Proc. 13th Conf. Local Computer
Networks, Oct. 1988, pp. 181–183.
2. M. M. Theimer and K. A. Lantz, ‘Finding idle machines in a workstation-based distributed
systems’, IEEE Trans. Softw. Eng., 15(11), (1989).
3. D. L. Eager and E. D. Lazowska, ‘Adaptive load sharing in homogeneous distributed systems’,
IEEE Trans. Softw. Eng., 12(5), (1986).
Copyright 1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 1–20 (1999)
20
Q. LU AND S.-M. LAU
4. Y. T. Wang and R. J. T. Morris, ‘Load sharing in distributed systems’, IEEE Trans. Comput.,
34(3), (1985).
5. I. Ahmad, A. Ghafoor and K. Mehrotra, ‘Performance prediction of distributed load balancing
on multicomputer systems’, Proc. Supercomputing ’91, 1991, pp. 830–839.
6. D. L. Eager and E. D. Lazowska, ‘A comparison of receiver-initiated and sender-initiated
adaptive load sharing’, Perform. Eval., 6, 53–68 (1986).
7. C. Lu and S.-M. Lau, ‘An adaptive algorithm for resolving processor thrashing in load distribution’, Concurrency: Pract. Exp., 7(7), 653–670 (1995).
8. N. G. Shivaratri and P. Krueger, ‘Two adaptive location policies for global scheduling algorithms’, Proc. 10th Int. Conf. Distributed Computing Systems, May 1990, pp. 502–509.
9. T. L. Casavant and J. G. Kuhl, ‘A taxonomy of scheduling in general-purpose distributed
computing systems’, IEEE Trans. Softw. Eng., 14(2), 141–154 (1988).
10. H. S. Stone, ‘Multiprocessor scheduling with the aid of network flow algorithms’, IEEE Trans.
Softw. Eng., 3, 85–93 (1977).
11. V. M. Lo, ‘Heuristic algorithms for task assignment in distributed systems’, IEEE Trans. Comput., 37(11), 1384–1397 (1988).
12. C.-C. Shen and W.-H. Tsai, ‘A graph matching approach to optimal task assignment in distributed
computing systems using a minimax criterion’, IEEE Trans. Comput., 34(3), 197–203 (1985).
13. G.-H. Chen and J.-S. Yur, ‘A branch-and-bound-with-underestimates algorithm for the task
assignment problem with precedence constraint’, Proc. Int. Workshop Modeling, Analysis and
Simulation of Comput. and Telecomm. Syst., Feb. 1994.
14. L. M. Ni and K. Hwang, ‘Optimal load balancing in a multiple processor system with many job
classes’, IEEE Trans. Softw. Eng., 11(5), 491–496 (1985).
15. M. D. Feng and C. K. Yuen, ‘Dynamic load balancing on a distributed system’, Proc. 6th IEEE
Symposium on Parallel and Distributed Processing, Dallas, TX, 1994, pp. 318–325.
16. L. M. Ni, C. W. Xu and T. B. Gendreau, ‘Drafting algorithm – a dynamic process migration
protocol for distributed systems’, Proc. 5th Int. Conf. Distributed Computing Systems, IEEE,
1985, pp. 539–546.
17. N. G. Shivaratri, P. Krueger and M. Singhal, ‘Load distributing for locally distributed systems’,
IEEE Comput., 25(12), 33–44 (1992).
18. N. G. Shivaratri and M. Singhal, ‘A load index and a transfer policy for global scheduling tasks
with deadlines’, Concurrency: Pract. Exp., 7(7), 671–688 (1995).
19. R. Mirchandaney, D. Towsley and J. Stankovic, ‘Adaptive load sharing in heterogeneous systems’, J. Parallel Distrib. Comput., 9(4), 331–346 (1990).
20. J. A. Stankovic and I. S. Sidhu, ‘An adaptive bidding algorithm for processes, clusters, and
distributed groups’, Proc. 4th Int. Conf. Distributed Computing Systems, May 1984, pp. 13–18.
21. O. Kremien and J. Kramer, ‘Methodical analysis of adaptive load sharing algorithms’, IEEE
Trans. Parallel Distribut. Syst., 3(6), 747–760 (1992).
22. L. Kleinrock, Queueing Systems, Vol. 1: Theory, Wiley, New York, 1975.
23. C. Lu and S.-M. Lau, ‘An adaptive load balancing algorithm for heterogeneous distributed
systems with multiple task classes’, Proc. 16th Int. Conf. Distributed Computing Systems, Hong
Kong, May 1996.
24. C. Lu and S.-M. Lau, ‘A performance study on dynamic load balancing algorithms’, Department
of Computer Science and Engineering, The Chinese University of Hong Kong, Unpublished
report, June 1995.
Copyright 1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 1–20 (1999)
Документ
Категория
Без категории
Просмотров
2
Размер файла
253 Кб
Теги
resolving, distributions, workload, adaptive, burst, algorithms, load
1/--страниц
Пожаловаться на содержимое документа