CONCURRENCY: PRACTICE AND EXPERIENCE, VOL. 9(8), 821–836 (AUGUST 1997) Parallel join for IBGF partitioned relational databases M. BOZYIGIT, S. A. MOHAMMED∗ AND M. AL-TAYYEB† Information & Computer Science Department, King Fahd University of Petroleum & Minerals, PO Box 1619, Dhahran 31261, Saudi Arabia SUMMARY This study is concerned with a parallel join operation where the subject relations are partitioned according to an interpolation based grid file (IBGF) scheme. The partitioned relations and directories are distributed over a set of independently accessible external storage units, together with the partitioning control data. The join algorithms executed by a mesh type parallel computing system allow handling of uniform as well as nonuniformly partitioned relations. Each processor locates and retrieves the data partitions it is to join at each step of the join process, in synchronisation with other processors. The approach is found to be feasible as the speedup and efficiency results found by simulation are consistent with theoretical bounds. The algorithms are tuned to join-key distributions, so that effective load balancing is achieved during the actual join. 1997 by John Wiley & Sons, Ltd. 1. INTRODUCTION Join is a frequently used database operation. It is also the most expensive, when compared to other database operations such as selection and projection. Join, unlike other operations, involves more than one relation. The sequential indexing methods such as nested loops, hybrid hash and join index are fashionable, but they can be slow, especially when large databases are involved[1,2]. Emerging parallel and distributed computing systems have, during the last two decades, created an important incentive for efficient database operations[3,4]. Parallel join algorithms, such as parallel hybrid hash and join index, partition relations just before the join operation. This makes a real-time join costlier. Also, frequent deletions, additions or updates must not require complete reorganisation of the database. Frequent join operations will be more efficient if the involved relations, once partitioned, remain unchanged and ready for spontaneous access. A parallel join operation is expected to perform well if the involved relations are properly organised[5,6]. The individual processors should, independently and in parallel, be able to detect and interpret data allocated to them for local joins and then co-operate with others to achieve the global join. This is the main reason for choosing an IBGF based relation organisation. It imposes a structured partitioning on the database, yet allows for its maintenance dynamically[7,8]. The resultant decomposition is also suitable for parallel processing as the distribution of disjoint data partitions over a parallel system can be formulated. This study is just a ∗ Currently † Currently with the University of Melbourne, Australia. with Saudi ARAMCO oil company, Dhahran. CCC 1040–3108/97/080821–16 $17.50 1997 by John Wiley & Sons, Ltd. Received 18 April 1994 Revised 13 May 1996 822 M. BOZYIGIT, S. A. MOHAMMED AND M. AL-TAYYEB beginning, as it excludes implementation of the developed framework on a real distributed memory parallel system. Various parallel computer interconnection schemes such as shuffle exchange, hypercube, ring, pipeline and mesh can be considered for a parallel join[1,9,10]. In this study, the emphasis is on a mesh type architecture with a multi-unit external storage system. There are important commercially available mesh type parallel systems, such as DAP, Paragon XP/S, Cray T3D, J-Machine and Tera Computer[11]. Architectures such as ring and hypercube can be considered as special forms of the mesh type[11]. They are expected to perform equally well. Given a multi-unit disk storage system, the issue of how to map partitions in order to maximise the concurrency, during a join, is closely related to the efficiency[12]. The partition mapping is controlled by the size and topology of the storage subsystem, the parallel computing system, as well as the relation characteristics. The remainder of the paper covers the following topics: Section 2 introduces relation partitioning, according to IBGF. The layout of the parallel mesh architecture used as a testbed for the simulation is presented in Section 3. The parallel join algorithms for uniform and nonuniform (skewed) relations are presented in Sections 4 and 5, respectively. Finally, the results obtained by simulation are introduced in Section 6 followed by conclusions in Section 7. 2. RELATION PARTITIONING An IBGF based data organisation is basically an indexing scheme. Details of an IBGF are not within the scope of this study. However, its main features are discussed to provide the basis for its use in the parallel join under study. A relation is viewed as a file of n tuples and d attributes. It is partitioned into buckets, based on d-dimensional hashing. The size of a bucket (partition) is normally imposed by the physical characteristics of the available disk storage and memory, and therefore is a matter of granularity. It is possible for a relation to fit into one bucket, if the relation is relatively small and the physical characteristics of the external storage and main memory are appropriate. Basically, the value of each attribute of a tuple is mapped to a real number in the half open interval [0, 1) by scaling, using the domain range. The decimal point is then ignored, retaining as many bits as necessary to its right. The number of bits retained, for mapping, is a function of partition and relation sizes, as in dynamic hashing[13]. For example, consider a 2-attribute tuple t : (100, 25), where the values of attributes 1 and 2 are 100 and 25, respectively (refer to Table 1 for frequently used symbols and abbreviations). Also, let the domain ranges for the attributes 1 and 2 be 200 and 100, in that order. First, scale the tuple by mapping it to a floating point binary number, as 100 25 , −→ (0.1, 0.01) t : (100, 25) −→ 200 100 Then, retain one bit to the right of the decimal point, ignoring the decimal point, to provide the co-ordinates of the tuple in binary. Thus, the final mapping, after skipping the intermediate steps, can be represented as t : (100, 25) −→ (1, 0) PARALLEL JOIN FOR IBGF PARTITIONED RELATIONAL DATABASES Table 1. Ai ai d f IBGF K Ki ki k LAIL ` m n pij p q Ri r s sb t u αi ∆i −→ 823 List of main symbols used in the paper value of attribute i axis (attribute) interval label, on axis i, in binary number of attributes of a relation skew factor of a nonuniform relation interpolation based grid file total number of partitions in a uniform relation, 2k total number of axis intervals on axis i, 2ki partition level of axis i P ki , for all is sum of partition levels, largest axis interval label partition size number of rows of a mesh number of columns of a mesh processor (i, j) in a mesh partition label partition level at the time of creation of a partition relation i number of partitions in a join class number of external storage units scaled binary tuple relation size, in number of tuples minimum value of attribute i maximum domain interval of attribute i implies a mapping which will be referred to as a scaled-binary (sb) form, from now on. The procedure starts with one partition, which is dubbed as 0 (zero) partition level, along all axes. The partitioning is conducted along all the attribute axes in a round-robin fashion. An insertion in excess of partition capacity, along an attribute axis, causes splitting along that axis. Each split increments the partition level by one along the subject axis. In the uniform relation case, all these partitions will eventually be filled to their full capacities (bucket size). The next partition level, after partition level 0, is 1; the number of partitions is now 2 (= 21 ). The second split will result in four partitions (= 22 ). The partitioning procedure ends with a partition level of ki along attribute axis i, where i ∈ (d − 1). A partition level of ki implies 2ki axis (or partition) intervals. Therefore, for a d-attribute uniform relation case, there will be a total of 2k partitions, where k = k0 + k1 + . . . + kd−1 . For a nonuniform relation case, some partitions may be empty, contrary to the uniform case where all the partitions are of equal size. The algorithm developed for nonuniform relations should avoid bucket allocations for empty partitions. Hence, a directory structure is required to keep a record of the existing partitions. Such a directory structure can be viewed as a binary tree where terminal nodes constitute data partitions and internal nodes constitute directory or subdirectory partitions. A partition label can be computed from the corresponding axes intervals. The tuples in the same bucket produce the same partition label. To generalise, let t be a tuple represented by t : (At0 , At1 , At2 , . . . , Atd−1 ), αi be minj∈u−1 (Aji ), and domain interval ∆i be maxj∈u−1 (Aji − αi ), for i ∈ (d − 1). A tuple-to-co-ordinate (axis interval) mapping can be written as t : (At0 , At1 , At2 , . . . , Atd−1 ) −→ (a0 , a1 , a2 , . . . , ad−1 ) 824 M. BOZYIGIT, S. A. MOHAMMED AND M. AL-TAYYEB A2 A2 t1 t1 0 0 t3 t2 t2 p0 A1 p0 p1 0 0 A1 1 (b) Two partitions: 3 tuples in 2 partitions exist. (a) One partition: 2 tuples in 1 partition exist. A2 t5 t6 11 t7 t10 10 0 t3 t1 p0 t2 t4 01 p3 0 p1 00 A1 p3 p6 t11 t2 t3 t4 p1 p0 00 1 (c) Four partitions: 4 tuples in 3 partitions exist. Figure 1. t1 t9 t8 p2 A2 1 t12 p10 01 10 A1 11 (d) Sixteen partitions: 12 tuples in six partitions exist. Partitioning by tuple insertion: a new tuple will cause a split if the partition mapped is full (ti indicates tuple i, pi indicates partition i, and Ai indicates attribute i) where ai is the axis interval of the partition to which t is mapped, on axis i, and it is the sb scaled-binary representation computed as [(Ati − αi )/∆i ]. The maximum number of tuples that can map to a partition is limited to its size, `. Conversely, given co-ordinates of a partition in a d-dimensional system, there is a unique partition label p, that can be computed as p= d−1 X i=0 2i kX i −1 aij 2(ki −j−1) (1) j=0 where ki is the partition level and aij is the jth bit of the axis interval label on axis i. These partition levels will be used to take decisions about partition-to-processor mapping during the parallel join operation. The upper bound on the total number of partitions is the product of the upper bounds on the number of axis intervals, which is equal to 2k , where k is the sum of partition levels, i.e. k0 + k1 + . . . + kd−1 . For the uniform relation case, the number of partitions is exactly equal to 2k . For the nonuniform case, 2k is only an upper bound. As an example, assume that there are four tuples in a 2-attribute relation R. Also, assume that the bucket (partition) size is 2 tuples. During successive tuple processing, each bucket overflow causes a split and thus doubles the number of intervals (or increments the partition level by 1), expanding the binary interval labels by another bit. Figure 1(a) shows a single partition, after insertion of the first two tuples. Now the partition level is 0 and the partition interval is 1 (= 20 ), along attribute axes A1 and PARALLEL JOIN FOR IBGF PARTITIONED RELATIONAL DATABASES 825 A2. Each insertion requires scaling, as discussed before, on both attributes. Figure 1(b) shows the situation after insertion of a third tuple, which causes a partition-split. The fourth insertion causes the formation of four partitions, one of which is empty, as shown in Figure 1(c). Figure 1(d) shows the state after an insertion of 12 tuples. Only the nonempty partitions are shown. If the relation was uniform, all 16 partitions would have been shown. Note that the partition labels are computed by equation (1). This example demonstrates a nonuniform case, which requires the implementation of a multiple level directory structure to avoid storage allocations for empty partitions[13]. The directories are treated like data files, except that the number of tuples per bucket will be larger, as a directory tuple is normally shorter than a data tuple. 3. PARALLEL ARCHITECTURE The parallel system used in this study is an m× n mesh topology with s multi-storage units, where m and n are some powers of 2, as shown in Figure 2. The basic system architecture is typical of some commercially available parallel systems. The computation model is appropriate for an MIMD (multiple instruction multiple data) parallel architecture[11]. Each processor, Pij , has a local memory and four buffered communication channels managed by a communication processor. Apart from modular growth and simple pipelining, there is no special restriction on the topology used. The mesh interconnection network can be replaced by a ring, hypercube, or any other parallel system. The crossbar switch, proposed as a storage-to-processor interconnection network, can also be replaced by other interconnection schemes. In the given topology, the boundary processors are Pi1 and P1j , where i ∈ m and j ∈ n. They are special processors, which are connected to a multi-unit external storage system by an interconnection network, as shown in Figure 2. The size of the interconnection network in terms of the number of switches indicates the degree of I/O concurrency. The processors share the switches (or the storage units) if the number of switches is less than n or m. Proper concurrency control protocols are employed so that conflicting I/O requests are minimised and the load is balanced[14,15]. This is part of the parallel join algorithm which must be aware of the data distribution on the external storage units and the parallel system topology. 4. PARALLEL JOIN ALGORITHM FOR UNIFORM RELATIONS The main parameters that need to be known about the physical structure of a given relation, before actually accessing data, are its partition levels and its uniform/nonuniform status. If the partition count is equal to 2k , where k indicates the total partition level, at the completion of the tuple insertion process, then the relation is uniform. Otherwise, it is nonuniform. Once the relations are IBGF partitioned, the directory structure will allow any attribute of any relation to be chosen as a join attribute. In this study, each join is subject to the overhead imposed by the parallel algorithm only. Otherwise, the directory maintenance cost incurred by tuple insertions or deletions is excluded. A study concerned with the total cost would have considered the effect of directory maintenance cost per update as well as the initial partitioning cost. For clarity, let R1 and R2 be two equal cardinality relations with attributes of equal domains and partitioned into an equal number of axis intervals during the partitioning process. For the join of R1 and R2, it is sufficient to match the partitions in the corresponding 826 M. BOZYIGIT, S. A. MOHAMMED AND M. AL-TAYYEB Multi Disk Units P11 P12 P13 P14 P21 P22 P23 P24 P31 P32 P33 P34 P41 P42 P43 P44 Cross_bar switch Processors Figure 2. A parallel computing system intervals on the join axes. For example, the partitions from interval i on a join attribute of R1 are matched with the partitions in interval i on the corresponding join attribute of R2. Let the set of partitions in a common axis interval form an interval class; then there is a direct mapping from a partition to its interval class. Given the binary representation of a partition p as bk−1 bk−2 . . . b1 b0 , the interval class label on axis i (Ii ) to which the partition belongs is easily computed using the following equation: Ii = kX i −1 2j bi+dj (2) j=0 For a uniform relation, the number of partitions in each interval class on axis i is equal to 2k−ki . This many partitions map to the same interval class. The reverse mapping, i.e. a computation of the partition label from the axis interval labels, was given by equation (1). For the simple two relations case, the join involves matching the corresponding interval classes. If the number of interval classes in R1 is more than that in R2, then more than one interval class in R1 must be matched with a single interval class in R2. Let the term join class indicate compatible interval classes in different relations to be joined. The tuples in a join class can only match the tuples in the corresponding (or compatible) join class. In the uniform relation case, while a join class with one interval class belongs to the smaller relation, a join class with a set of interval classes belongs to the larger relation. The join class concept is generalised so that it covers uniform as well as nonuniform relations of any size and domain. The concept of join classes forms the basis of the parallel join algorithm. The parallel computing system (Figure 2) used for the join operation can be viewed as an extended pipeline system where the partitions move along the horizontal and vertical processor pipelines. A partition, having reached a processor, remains resident until its join with PARALLEL JOIN FOR IBGF PARTITIONED RELATIONAL DATABASES 827 PROGRAM UniformJoin; /* executed by processor i */ BEGIN /* Initialise */ Input Relation Data; Input Computing System Data; Build Join Classes; WHILE (Join is not complete) BEGIN Compute Current Partitions to Join; /* schedule partitions */ Input Scheduled Partitions Along Row and Column; Pipeline Partitions Along Row and Column; Join Scheduled Partitions; IF ( the processor is on boundary row) /* schedule next partition from R2 */ BEGIN WHILE (Current join class is not complete) BEGIN Input Next Scheduled Partition; Pipeline Partition Along Column; Join Scheduled Partition; END END IF (the processor i is on boundary column) /* schedule next R1 partition */ BEGIN WHILE (Current join class is not complete) BEGIN Input Next Scheduled Partition; Pipeline Partition Along Row; Join Scheduled Partition; END END END END Figure 3. Summary of the parallel join algorithm (in Pascal-like notation) compatible partitions from the other relation is complete. There are no important provisions about the operating system. It is assumed that there is a small kernel at each node which boots up the nodes. It enables each processor to load the algorithms and handle partition input/output as well as the data routing along the vertical and horizontal pipelines, as commanded by the parallel join algorithm. The join classes of R1 are handled by the vertical pipelines and those of R2 are handled by the horizontal ones. They are balanced over the column and row processors, respectively. The excess partitions, if any, after equal allocations, are allocated one per processor, until exhausted. This simple heuristic minimises the completion time of the join, as there cannot be better load balancing. Note that each processor is able to deduce its own partitions and its own join classes independently of other processors. This is important for the parallelism. As previously mentioned, the labeling scheme allows the processors to carry out unique mapping using bit manipulation of the partition, axis interval and join class labels. A concise form of the algorithm is given in Figure 3. In summary, a processor i is able to infer the partitions to input, the partitions to join, the partitions to pipeline and the pipeline (row or column) to choose for send (or receive), from the partitioning information available. Any attribute can be a join attribute as the partitioning covers all the attributes. 828 M. BOZYIGIT, S. A. MOHAMMED AND M. AL-TAYYEB 5. NONUNIFORM RELATIONS For a relation to be nonuniform, the number of partitions will be less than the theoretical maximum (2k ). In this case, there is only a subset of possible data partitions that actually exist per relation. Some partitions, which would otherwise exist in the uniform case, are now treated as being collapsed into one partition. A set of empty adjacent partitions in a continuous range of an axis interval is treated as being embedded into the smallest label nonempty partition adjacent to that range. Similarly, some axis intervals may not exist. They too are embedded in the preceding adjacent ones[7]. For example, the unoccupied axis intervals i + 1, i + 2, . . . , i + j would be embedded in the occupied axis interval i. With the implementation of a directory structure, storage is not allocated for the empty partitions. Despite its more complicated partitioning process, the join class concept developed earlier for the uniform relation case is also maintained for the nonuniform relation case. However, the computation of the compatible join classes and the constituting partitions is not straightforward. The simplest method is to use equation (2) to find the compatible join classes from the partition labels, taking embedding into consideration. For example, in Figure 1(d), partition p0 embeds p4 , p8 and p12 on axis A1. It also extends over axis intervals 0 and 1. The highest axis interval label covered by the last partition in a join class is taken as the basis for computing the compatible join classes. Two join classes are compatible if the highest axis interval label in one join class is equal to that of the other. For example, in Figure 1(d), the highest label axis interval covered by join class 0 (J0 ) is 1. J0 would have been compatible with J0 of another relation only if its highest axis interval was also 1. If they are not compatible they are forced to be so. Therefore, the compatible join classes are computed during each join operation, according to the criterion just mentioned. After this, the join process continues in the same way as in the uniform case. Given the binary representation of a partition label, the largest axis interval label (LAIL) covered can be computed. If the binary representation of a partition is bk−1 bk−2 . . . b1 b0 , then the LAIL on axis i (hi ) can be computed as hi = q−1 X 2j bi+dj + 2ki −q − 1 (3) j=0 where i is the attribute axis label, bt is the tth bit of the partition label, and q is the partition level at the time of its creation. The smallest axis interval label (h0i ) of a partition is equal to that of the uniform case. Thus, coverage of a partition is equal to (hi − h0i ), in terms of the number of axis intervals. This information allows the common LAIL for partitions in a join class to be computed, which will, in turn, be used to compute compatible join classes. Figure 4 shows a non-uniform relation with nine existing partitions (out of 32, had the relation been uniform). Tables 2 and 3 show the partitioning information compiled using the mapping details discussed above. The scheme used allows us to treat the join of nonuniform relations as a general case, of which the join of uniform relations is a special case. The parameters such as partition levels, number of partitions in each join class, and the axis intervals covered by each partition are essential for each processor in the system 829 PARALLEL JOIN FOR IBGF PARTITIONED RELATIONAL DATABASES Figure 4. Partitioned state of a nonuniform relation with two attributes (pi indicates ith partition; binary numbers indicate axis interval labels) to perform the correct join. The partitioning information for each relation, having been recorded upon completion of the partitioning, is input by the boundary processors and pipelined or broadcast through horizontal and vertical processor pipelines during the join operation. In fact, for a wraparound mesh system, the communication complexity of oneto-all broadcasting is logarithmic in system size[11]. A parallel partition input protocol is devised to handle input at the pipeline boundaries. Simply, boundary processors of the vertical pipelines read in the join classes of one relation, and those of the horizontal ones read in the join classes of the other relation, as in the uniform case. The uniform join algorithm, given in Figure 3, can also be used for the nonuniform case, with a new interpretation of the join classes. In particular, the initialising and mapping routines need to consider the directory structure that implements the partitioned nonuniform relations. Table 2. Partitioning details of the nonuniform relation shown in Figure 4, along attribute A1 Existing partitions, p Partition level at creation, q Largest axis interval covered 0 3 0 1 1 7 2 1 3 3 2 5 4 2 7 7 2 7 8 2 1 15 3 1 16 2 7 830 Table 3. M. BOZYIGIT, S. A. MOHAMMED AND M. AL-TAYYEB Join class details of the nonuniform relation shown in Figure 4, along attribute A1 Join class label Partitions covered Axis intervals covered LAIL 0 0 2 8 (0) (0,1,2,3) (0,1) (0) 1 2 8 16 (0,1,2,3) (0,1) (1) (1) 2 2 4 (0,1,2,3) (2,3) (2,3) 4 1 3 (4,5,6,7) (4,5) (4,5) 6 1 7 15 (4,5,6,7) (6,7) (6,7) (6,7) The algorithm allows even distribution of partitions. If the number of partitions in a join class (r) is not a multiple of the row and/or column sizes, some of the processors would remain idle. For example, let the number of excess partitions of a join class be µ (= r mod m). The last cycle of the join involves µ out of the m row processors only. If m is a multiple of µ, repeating a copy of the excess partitions on the m/µ row processors provides exact load balancing. The join of each partition of R1 will involve ( nr )/( m µ) partitions from R2, instead of nr . 6. PERFORMANCE 6.1. Simulation To study the performance of the algorithm, the entire system (operational behaviour of the parallel system and the algorithms) was simulated. The simulation consists of two main modules. Module-1 creates a directory structure for IBGF, by mapping the tuples to partitions. The depth of a directory tree depends on the skewedness of the relation and the bucket sizes. Following the construction of the directory tree, the partitions are distributed over the multiple storage units, according to the mapping function. For uniform relations, the binary representation of a partition identification is sufficient to optimally map it to a storage unit, in the uniform case. During the join operation, the partitions can be retrieved from the storage units using the same mapping function. For the nonuniform relation case, the directory tree is traversed in post-order form, while assigning the partitions to storage units evenly. At the same time, an index is created for partition allocation, to be used during the join operation. In all cases, a partition is assigned to one storage unit only, i.e. no duplications are allowed. Module-2 of the simulation package models the join operation. For each relation it handles the partition flow and manages the join on the system. Note that the partitions of a join class on one relation are joined with the partitions in the compatible join class(es) on the other. The simulation assumes 56 byte long data tuples, an 8 ms external storage access time, a 16 Mbit per second effective data transmission rate, and a 4 µs tuple join time. The size PARALLEL JOIN FOR IBGF PARTITIONED RELATIONAL DATABASES 831 of a partition is computed so that the time it takes to retrieve two tuples is equal to the time it takes to join them. This restriction is used to simplify the synchronisation required to overlap I/O with processing time. In our case, partition size is computed to be 73 tuples. The processing and data transmission are synchronised using event-based virtual clocks, where each action is represented by an event. The uniform relations are formed such that tuple attribute values show uniform frequency in a particular domain. The nonuniform relations are formed using a skew factor (f). For the sake of simplicity, the attributes of the relations are assumed to be numeric, targeted to numeric databases. The main performance metrics used in this study are speedup and efficiency. The speedup is equal to the sequential join time divided by the parallel join time. The efficiency is computed as speedup divided by the number of processors[11]. The parameters whose effects on the efficiency and speedup were studied are system size, relation size, number and size of partitions, number and size of interval classes, number and size of join classes, number of storage units, number of storage-to-processor (s-p) access paths, and the degree of data skew for nonuniform relations. 6.2. Results on uniform relations Let r and s be the number of partitions in join compatible classes of relations R1 and R2, respectively; let m and n be the number of rows and columns of the parallel system. If r < m and s < n, then mn − rs processors will remain idle. In an ideal situation, r and s are multiples of m and n, in which case all the processors will be kept busy, during the join operation. Otherwise, in the last iteration, the mn − r0 s0 processors will still remain idle, where r0 = (r mod m) 6= 0 and s0 = (s mod n) 6= 0. Figure 5 shows the effect of the number of partitions on overall efficiency. The smoother parts of the lower curve correspond to the situations where the number of partitions per join class remains fixed but the number of join classes increases. In this case, the pipelining is join class by join class. The results show that the pipelining is more susceptible to the number of partitions rather than to the number of join classes. However, the flow of the join classes can be synchronised so that the processors do not remain idle in between. In this case, the efficiency shows a continuous upward trend, because of the reduced pipelining effect, as shown by the upper curve. In general, a performance increase with relation size is to be expected, as parallelism in pipeline processing shows itself to be better for relatively longer data streams. Figure 6 shows the effect of the system size on efficiency. The number of partitions in each relation is 1024. A 16-fold increase in the mesh size produces a 13-fold speedup. However, the efficiency drops to 77 from 93, due to the long pipelining effect, especially, at the beginning of the first join class (head effect) and the end of the last one (tail effect). The stability of the system is tested by a simultaneous proportional increase in the system and relation sizes. The ratio of relation size to system size is kept fixed, at 64, but their respective sizes are increased by 4, 16 and 64. The observed efficiency was 87.5 ∓ 0.3. These results show that system or relation size has no adverse effect on the efficiency. The drop in the efficiency was only 0.6% while the increase in system size (in terms of number of processors) and relation size (in terms of number of partitions) was 64-fold. This indicates that the overhead incurred by long pipelining can be outweighed by continuous input streams of larger join classes, making the head and tail effects of pipelining negligible. 832 M. BOZYIGIT, S. A. MOHAMMED AND M. AL-TAYYEB Efficiency (in %) 90 80 70 Simple pipelining Synchronised pipelining 60 50 1000 Figure 5. 2000 Number of partitions 3000 4000 Effect of relation size on efficiency Efficiency & Speed-up 80 60 Efficiency, % (in percentage) Speed-up (in number of processors) 40 20 10 Figure 6. 20 30 40 Number of processors 50 Effect of system size (number of processors) on efficiency 60 833 PARALLEL JOIN FOR IBGF PARTITIONED RELATIONAL DATABASES Efficiency & Sopeed-up 70 Speed-up ( for 64 processors) Efficiency in % (percentage) 60 50 40 10 Figure 7. 20 Number of disk units 30 Effect of number of storage units The number of storage units is another factor, an increase of which is expected to increase the parallelism in input, thus increasing the throughput. An increase in the degree of data distribution over a large number of storage units reduces contention on individual storage units and the interconnection network, allowing increased parallelism, reflected in the speedup and efficiency as shown in Figure 7. The tests are made on the relations of 1024 partitions and a system of 64 processors, under varying numbers of storage units. However, no performance improvement is expected after some optimum value of the number of storage units. Similarly, experiments showed a proportional efficiency increase with higher storage-toprocessor (s-p) connectivity, under a fixed number of storage units, which strengthens the scalability of the algorithm. 6.3. Results on nonuniform relations Contrary to the uniform case, the sizes of interval classes and join classes may show considerable variations in performance for the nonuniform case. The degree of nonuniformity can be represented by a skew in attribute values, as mentioned before. For the same number of tuples, a highly skewed relation will have more join classes of different sizes than a moderately skewed one. A high skew implies the presence of a large number of join classes, many with fewer partitions. To test the effect of the skewedness, an experiment was carried out with two relations of varying skewedness, represented by a degree of skew factor, f . The factor f is allowed to vary from 0.1 to 0.9 where a value of 0.5 indicates a uniform relation and a value under or above 0.5 indicates a left or right inclination, respectively. Table 4 shows efficiency 834 M. BOZYIGIT, S. A. MOHAMMED AND M. AL-TAYYEB Table 4. f of R2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Effect of nonuniformity (f = skew factor) on efficiency 0.1 0.2 0.3 0.4 f of R1 0.5 0.6 0.7 0.8 0.9 29.0 41.4 58.7 61.8 67.9 45.5 33.4 26.1 18.7 40.9 34.8 46.4 53.2 60.9 42.0 35.9 29.5 26.4 58.6 45.0 40.9 47.6 59.6 40.4 36.9 33.4 33.4 62.2 53.0 46.2 43.8 60.3 41.1 40.3 41.5 43.7 65.0 64.4 63.1 62.4 70.2 62.7 63.7 64.8 75.9 48.0 42.6 43.9 44.1 61.5 40.8 44.6 51.2 62.2 32.4 34.9 37.0 42.2 61.0 45.9 35.8 42.0 59.9 26.1 29.3 34.5 40.2 61.5 50.8 40.9 27.8 31.6 18.5 26.4 36.4 45.5 66.1 62.1 58.1 32.1 22.2 as affected by f . The highest efficiency is experienced by the uniform relations where f 1 = f 2 = 0.5. The smallest efficiency is experienced when the data-skew is at its extreme ends, that is f = 0.1 or f = 0.9. Table 4 is not symmetrical as R1 and R2 may not give the same size join classes at the same f values. The number of storage units and s-p connections show similar trends to those of the uniform case. An increase in the size of the interconnection network or number of the storage units, under fixed application conditions, is not expected to improve the performance after a turning point, beyond which further parallelism is not possible. The uniform algorithm determines the available partition labels from the partition levels. The same algorithm cannot deduce this information for nonuniform relations. The partition level for the nonuniform relations can only indicate the range of the partition labels. In highly skewed relations, this range could be very large. For example, a possible range of partition labels for an eight partition uniform relation with f = 0.5 is 0 to 7, whereas for an equal size relation with f = 0.9 the same range can be from 0 to 63, resulting in an eight-fold difference. This is because most of the partitions in the range are either empty or have low occupancies. The nonuniform algorithm can handle this variance at the expense of directory maintenance overhead as part of the partitioning process, which is excluded from the performance computations. If the parallel join algorithm developed for the uniform case is applied to the nonuniform case, the processors that take part in the join of one join class may remain idle during the join of another or vice versa. Thus, the cumulative processor idle time may amount to a level where parallel processing is not an advantage. For example, the application of the uniform algorithm to the join of nonuniform relations, with f = 0.3 or f = 0.7, experienced an efficiency decline 700 times greater than that of the nonuniform algorithm, whereas for f = 0.5 the ratio was nearly 1. The uniform algorithm reads all the possible partitions, most of which may be nonexistent, if it is applied to the nonuniform relations. 7. CONCLUSION This study is about exploiting a parallel join operation in relational databases where the relations are organised according to an IBGF scheme. The relations with uniform or nonuniform key distributions are partitioned and stored over a multi-unit external storage medium, along with various partitioning parameters. The parallel algorithms for both uniform and nonuniform relations are implemented, assuming that each processor hosts an instance of the kernel algorithm. From the partition PARALLEL JOIN FOR IBGF PARTITIONED RELATIONAL DATABASES 835 level and the join class labels, an instance of a kernel algorithm is able to determine the partitions it should input for the current join. For any attribute, it will know to or from which pipeline to send or receive a particular partition. The simulation results have shown that the entire approach is feasible, with good scalability characteristics. An increase in the system or relation sizes reflects a higher efficiency or speedup. A balanced distribution of data and processing shows a similar trend. The results also show that, for the best performance, uniform and nonuniform relations must be handled by their respective parallel algorithms. The nonuniform algorithm balances the inherited imbalance, but it cannot eliminate the effect of the skew factor completely. The uniform algorithm with uniform relations still shows a better performance compared to the nonuniform algorithm with nonuniform relations. The formation of an IBGF directory tree and partition distribution costs are excluded from the time complexity (cost) of the parallel join algorithms. These costs are considered negligible for the relations, where transient changes are either infrequent or nonexistent. Otherwise, especially for nonuniform cases, it may be difficult to ignore the effect of dynamic directory maintenance and the relation re-partitioning. However, the simulation results are to create a base for further research on the issue of dynamic directory maintenance and testing the algorithms on real parallel systems. ACKNOWLEDGEMENTS In particular, the authors would like to acknowledge the contribution of the author of [7] on IBGF-based relation partitioning. Acknowledgement is also due to the King Fahd University of Petroleum & Minerals, Dhahran, Saudi Arabia. We extend our sincere thanks to the anonymous reviewers for their helpful comments. REFERENCES 1. D. Dewitt and R. Gerber, ‘Multiprocessor hash based join algorithms’, Proceedings of VLDB (Stockholm), 1985. 2. V. Patrick, ‘Join indices’, ACM Trans. Database Syst., 12, (2), (1987). 3. M. Negri and G. Pelagatti, ‘Distributed join: A new algorithm for joining relations’, ACM Trans. Database Syst., 16, (4), (1991). 4. F. T. Leighton, Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes, Morgan Kaufmann Publishers, Inc., 1992. 5. M. Bozyigit, S. A. Mohammed and M. Tayyeb, ‘Parallel join algorithms for multicomputer systems’, Proceedings of 7th ISCIS – International Symposium on Computer & Information Sciences, Antalya, Nov. 1992. 6. M. C. Murphy and D. Rotem, ‘Multiprocessor join scheduling’, IEEE Trans. Knowl. Data Eng., 5, (2), (1993). 7. M. Ouksel, ‘The interpolation based grid file: a multidimensional dynamic order-preserving partition scheme’, Proc. ACM SIGMOD Symp. on Principles of Database Systems, Portland, Oregon, March 1985. 8. E. A. Ozkarahan and C. H. Bozsahin, ‘Join strategies using dataspace partitioning’, New Gener. Comput., (June 1988). 9. D. H. Bitton, H. Boral and D. J. Witt, ‘Parallel algorithms for execution of relational database operation’, ACM Trans. Database Syst., 8, (3), (1983). 10. J. Richardson, H. La and K. Mekkilineni, ‘Design and evaluation of parallel pipelined join algorithms’, SIGMOD Management of Data Conference, ACM, 1987. 11. V. Kumar, A. Grama, A. Gupta and G. Karypis, Introduction to Parallel Computing: Design and Analysis of Algorithms, Benjamin/Cummings Publishing Company, Inc, 1994, p. 597. 836 M. BOZYIGIT, S. A. MOHAMMED AND M. AL-TAYYEB 12. E. R. Omiecinski and E. T. Lin, ‘Hash-based and index-based join algorithms for cube and ring connected multicomputers’, IEEE Trans. Knowl. Data Eng., 1, (1), (1989). 13. R. J. Enbody and H. C. Du, ‘Dynamic hashing’, ACM Comput. Surv., 20, (2), (1988). 14. K. A. S. Abdul-Ghaffar and A. El Abbadi, ‘Optimal disk allocation for partial match queries’, ACM Trans. Database Syst., 18, (1), (1993). 15. E. Rahm, ‘Empirical performance evaluation of concurrency and coherency control protocols for database sharing systems’, ACM Trans. Database Syst., 18, (2), (1993).

1/--страниц