CONCURRENCY: PRACTICE AND EXPERIENCE, VOL. 9(10), 967–973 (OCTOBER 1997) Linear array for spelling correction STEFKA FIDANOVA Center of Information & Computer Technology, Bulgarian Academy of Science, Acad. G. Bonchev str., bl.25A, Sofia 1113, Bulgaria SUMMARY This paper introduces a linear array for spelling correction using 15 processors. Many architectures have been proposed to solve similar string correction problems such as speech recognition or nucleic acid sequence computation. It is known that the hypercube, de Bruijn and grid networks contain a Hamiltonian path, a path which contains all the vertices of the network. The execution time of spelling correction on all of these networks is equal. 1997 by John Wiley & Sons, Ltd. 1. INTRODUCTION The increasing use of computer systems for the handling of textual data has led to a great deal of interest in software for the detection and correction of spelling errors[1]. As a matter of fact, 80% of all errors found in English texts are of the following types[2]: • • • • one letter is wrong one letter is missing one extra letter is inserted two adjacent characters are transposed. These errors are respectively referred to as substitution, deletion, insertion and transposition errors. These errors are mainly due to keyboard mistyping. Even detecting an erroneous word is quite an easy task; proposing suitable corrections generally requires complex processing. The correction process consists in comparing the target word with words from a dictionary and in keeping only those words which have the best likeness. The computational difficulty lies in calculating this likeness in order to propose good corrections. The method we use to determine likeness provides good results compared with other techniques[2–4]. The major drawback of the algorithm comes from the amount of calculation required. With dictionaries of a few hundred thousand words, the computation time on conventional personal computers and on currently available workstations can be very long and unacceptable for real-time spelling correction. Many architectures have been proposed to solve similar string correction problems such as speech recognition[5–8] or nucleic acid sequence comparison[9]. The result of the comparison process is a sequence of words which seem probable candidates for the correct word. The hypercube, de Bruijn network and grids contains a Hamiltonian path (a path that consists of all nodes only one time). We can execute spelling correction on all these networks. CCC 1040–3108/97/100967–07 $17.50 1997 by John Wiley & Sons, Ltd. Received 20 January 1996 Revised 8 April 1996 968 S. FIDANOVA 1.1. Brief presentation of the topologies • Linear array. The vertices of a linear network with length n are labeled by the elements of the set {0, 1, . . . , n − 1} and there is an edge between vertex i and the vertices (i − 1) and (i + 1) if these vertices exist. • Hypercube. The d-dimensional hypercube, denoted H(d), is a graph of order 2d whose vertex set consists of all binary words of length d. There exists an edge between any two nodes that binary representation differs in exactly one position. • De Bruijn. A de Bruijn graph of order 2d , denoted by B(2, d), is a digraph whose vertex set is the set of all words of length d over {0, 1}. There is an arc from any vertex x1 , x2 , . . . , xd to vertices x2 , x3 , . . . , xd , α for α equal to 0 or 1. The paper is organized as follows: Section 2 presents the method used for spelling correction and Section 3 presents its implementation on a linear array. Section 4 presents the system architecture. 2. STRING COMPARISON ALGORITHM The basis we use for comparing two strings of characters is a measure of distance between these two strings. This distance is the number of differences between two strings. Our algorithm is based on comparing a target string with strings from a dictionary. The words in the dictionary are sorted into groups by length. Let z be a measure of the differences that exist between two words. If n is the length of the target word and for some word from the dictionary z = 0, then the target word is correct. If for all words with length n z 6= 0, then only those for which z ≤ 2 are considered next. The valid word is from this subset if the target word contains only one wrong letter or two adjacent characters are transposed. Subsequently the target word is compared with words with length n − 1 and n + 1. The valid word is from this subset if there is one missing letter or one extra letter is inserted. The strings to be compared must have equal length. First the shorter string is completed with a special character (zero) on the left side and the two strings are compared. After that the shorter string is compared with the special character (zero) on the right side and the two strings are compared. Only those words for which z < n2 are considered next. In this way it is unnecessary for all the words from the dictionary to be checked. Comparing two words of n characters represents n steps. With a dictionary of p words and where p1 , p2 , p3 are words with length n, n − 1, n + 1, respectively, the total amount of calculation for a correction process is (p1 + 2p2 + 2p3 )n < 2np. 2.1. Pseudocode description of the algorithm Let a = a(1) . . . a(n) and b = b(1) . . . b(n) be two strings: s=0 1 s(i) = 0 if(a(i) = b(i)) then s(i) = 0 else s(i) = 1 s = s + s(i) i=i+1 go to 1 LINEAR ARRAY FOR SPELLING CORRECTION 969 s is the distance between the strings a and b. Figure 1 is a graphical representation of the computation of the difference from the test word syctolic and the reference word systolic. Figure 1. Computation of the difference between the words The total edit distance is then equal to 1. There is one wrong letter. The computation of distance requires one comparing and one addition, i.e. two elementary operations. Comparing two words of n characters represents 2n operations. With a dictionary of p words the total amount of calculation for a correction process is equal to 2np operations. The complete English dictionary represents about 700,000 words when different forms are considered. If a word of ten characters has been detected as erroneous it will be necessary to compare this string with all the words in the dictionary whose lengths are nearly similar. Taking words ranging from eight to 12 characters implies comparing the string with approximately 300,000 target strings. The total number of operations is then equal to 10 million. Real-time spelling correction is thus computionally too expensive for conventional machines, especially when large dictionaries are involved. Similar string correction problems are speech recognition and nucleic acid sequence comparison. In a nucleic acid sequence there are some million different strings. This problem is very important in medicine for early diagnosis of some illnesses. 3. PARALLEL IMPLEMENTATION ON A LINEAR ARRAY This section indicates how the algorithm described above can be implemented on a linear array. Let X = (x1 , x2 , . . . , xn ) and Y = (y1 , y2 , . . . , yn ) be two strings to compare. Consider an array of n processors connected as indicated in Figure 2. A single cell is shown in Figure 3. Processor i reads data xi and yi at time step n − i, calculates value an , the elementary distance calculation, reads data z produced by its neighbor processor i + 1, and propagates 970 S. FIDANOVA Figure 2. Linear array its result z + an to processor i − 1. An array cycle consists of the elementary distance calculation, that is to say the data acquisition, the comparison, the addition and the data propagation. On such an array, parallel execution of a two string comparison requires n cycles since the difference z is updated by each processor and every processor can run concurrently. Since a complete comparison lasts n cycles and since a processor is used for one cycle each processor is available during the other cycles for other tasks. Thus it is possible to start a new comparison on each cycle. Pipelining the string comparison in this way allows one to increase considerably the efficiency of the network since, once the array has been initialized, one comparison result is produced every cycle. The speed-up in comparison with a sequential machine is given by (p1 + 2p2 + 2p3 )n ≈n Γ= n + p1 + 2p2 + 2p3 with p n. Compared to the Lavenier’s method described in [2] this method is faster and uses a smaller number of processors. The last approximation is valid in the applications since the number of references can reach hundreds of thousands of items as compared to a maximum of 20 characters in a string. Figure 3. Organization of the cell LINEAR ARRAY FOR SPELLING CORRECTION 971 Our algorithm uses n + p1 + 2p2 + 2p3 < n + p time steps on n processors. Lavenier’s algorithm[2] uses 2n + p − 2 time steps on n2 processors. One time step consists of two elementary operations in our algorithm and nine elementary operations in Lavenier’s algorithm. It is known that the interconnection networks as rings, de Bruijn networks, grids and hypercubes contain a Hamiltonian path. Consequently we can execute our algorithm on all the networks which contain more than n processors. 4. SYSTEM ARCHITECTURE Spelling correction implies processing character strings which are not very long. The average length of the words found in the English dictionary is about eight characters. Only a few words are longer than 15 characters and require significant computational resources. For these exceptional words, the correction process can be done on conventional machines. A linear array of 15 processors seems a good compromise for this specific application. To be able to process strings of different lengths on a fixed 15 processor array, strings shorter than 15 characters are completed with special characters (zeros). Erroneous words are always corrected by words whose lengths are similar within a range of ±1 characters. The linear array only does comparisons between two strings. Each erroneous string must be compared with a large set of references. The data attached to the erroneous string are considered as constant values during the whole correction process. Consequently, to perform its local computation, a processor i needs to receive on each network cycle a distance from its neighbor i + 1 and a character xki corresponding to the ith character of the kth reference. Theoretically, the cost associated with insertion and deletion errors depends on the nature of the characters. In practice, they are identical and can be replaced by constant values. As the characters of the erroneous string do not move during a complete dictionary comparison, the memory size is then equal to two registers, one for the character of the erroneous string and other for the result of the comparison. The basic calculations are performed by the processors of the computation array zi = ai + zi−1 where zi−1 is the result using the neighboring processor and ai is the result of comparing the ith characters of the reference and erroneous strings. Figure 4 sketches the processing element architecture. The IN input stores data, z coming from the other processor into a dedicated I/O register file data; xik is the ith character of the reference string. LO is two registers of local memory, one for the ith character of the erroneous string and the other for the result of the comparison. As the arithmetic operations involved are very limited (only addition), one specific unit is implemented, namely an adder. COM is the comparison unit. The memory that holds the dictionary is divided into blocks. The words with the same length are in the same block. We use separate memory (blocks) for result. Figure 5 shows the system architecture. The processor counts the characters of the word. If the number of characters is more than 15, then the processor corrects the word sequentially. If the number of characters is less than 15, then the word is corrected by linear array. 972 S. FIDANOVA Figure 4. Internal processing element architecture Memory1 stores the vocabulary, and consists of 16 blocks. The first 15 blocks are blocks for words of length from 1 to 15 characters. The 16th block is for words with length more than 15. Memory2 stores the result set of the correct word. The processors of the array are activated by microcommands which control actions such as I/O and LO register selection. These microcommands are specified by an instruction which is received from the outside and decoded. Processor array and reference memory operate synchronously since they each execute one instruction every machine cycle. One instruction specifies actions to be realized concurrently on all these units. In that sense, the processor array can be considered to have an SIMD execution mode: all the cells are executing the same instruction at the same time. Figure 5. System architecture LINEAR ARRAY FOR SPELLING CORRECTION 973 5. CONCLUSION A linear array for spelling corrections is described in this paper. Compared to the Lavenier method[2] this method is faster and uses a smaller number of processors. We have studied the execution of spelling correction on the hypercube, de Bruijn, linear and grid networks. If there are real processors we can know the real time of the algorithm. This problem is very important in medicine for early diagnosis of some illnesses. VLSI (very large scale integration) technology permits construction of cheap specialized architectures. Our architectures are very regular and are extremely well suited to VLSI implementation. The regularity can be exploited to help the designer during the different design steps. The system architecture is highly suitable for direct implementation on VLSI chips. ACKNOWLEDGEMENT The author is financially supported by the BG MSHE YS25. REFERENCES 1. P. A. V. Hall and G. R. Dowling, ‘Approximate string matching’, Comput. Surv., 12, 381–402 (1980). 2. D. Lavenier, ‘An integrated 2D systolic array for spelling correction’, J. Integration, 15, 97–111 (1993). 3. M. Gokhale, B. Holmes, A. Kosper, D. Lopresti, S. Lucas, R. Nimich and P. Olsen, ‘SPLASH: A reconfigurable linear logic array’, Technical Report SRC–TR–90–012, Supercomputing Research Center, Maryland, USA, April 1990. 4. R. Lowarance and R. A. Wagner, ‘An extension of the string to string correction problem’, J. Assoc. Comput. Mach., 22, (2), 177–183 (1975). 5. J. P. Banatre, P. Frison and P. Quinton, ‘A network for the detection of words in continuous speech’, in J. P. Gray (Ed.), VLSI81 Int. Conf., Academic Press, 1981. 6. D. J. Burr, B. D. Ackland and N. Weste, ‘Array configurations for dynamic time warping’, IEEE Trans. Acoust. Speech Signal Process., ASSP–332, (1), 119–127 (1984). 7. F. Charot, P. Frison and P. Quinton, ‘Systolic architectures for connected speech recognition’, IEEE Trans., ASSP–34, (4), 765–779 (1986). 8. P. Frison and P. Quinton, ‘An integrated systolic machine for speech recognition’, in P. Beertolazzi and F. Luccio (Eds.), VLSI:Algorithms and Architectures, North-Holland, Amsterdam, 1985, pp. 175–186. 9. D. P. Lopresti, ‘P–NAC: A systolic array for comparing nucleic acid sequences’, Computer, 20, 98–99 (1987).

1/--страниц