close

Вход

Забыли?

вход по аккаунту

?

278

код для вставкиСкачать
CONCURRENCY: PRACTICE AND EXPERIENCE, VOL. 9(10), 967–973 (OCTOBER 1997)
Linear array for spelling correction
STEFKA FIDANOVA
Center of Information & Computer Technology, Bulgarian Academy of Science, Acad. G. Bonchev str., bl.25A,
Sofia 1113, Bulgaria
SUMMARY
This paper introduces a linear array for spelling correction using 15 processors. Many architectures have been proposed to solve similar string correction problems such as speech recognition
or nucleic acid sequence computation. It is known that the hypercube, de Bruijn and grid networks contain a Hamiltonian path, a path which contains all the vertices of the network. The
execution time of spelling correction on all of these networks is equal. 1997 by John Wiley &
Sons, Ltd.
1. INTRODUCTION
The increasing use of computer systems for the handling of textual data has led to a great
deal of interest in software for the detection and correction of spelling errors[1]. As a matter
of fact, 80% of all errors found in English texts are of the following types[2]:
•
•
•
•
one letter is wrong
one letter is missing
one extra letter is inserted
two adjacent characters are transposed.
These errors are respectively referred to as substitution, deletion, insertion and transposition errors. These errors are mainly due to keyboard mistyping. Even detecting an
erroneous word is quite an easy task; proposing suitable corrections generally requires
complex processing.
The correction process consists in comparing the target word with words from a dictionary
and in keeping only those words which have the best likeness. The computational difficulty
lies in calculating this likeness in order to propose good corrections. The method we use to
determine likeness provides good results compared with other techniques[2–4].
The major drawback of the algorithm comes from the amount of calculation required.
With dictionaries of a few hundred thousand words, the computation time on conventional personal computers and on currently available workstations can be very long and
unacceptable for real-time spelling correction.
Many architectures have been proposed to solve similar string correction problems such
as speech recognition[5–8] or nucleic acid sequence comparison[9]. The result of the
comparison process is a sequence of words which seem probable candidates for the correct
word.
The hypercube, de Bruijn network and grids contains a Hamiltonian path (a path that
consists of all nodes only one time). We can execute spelling correction on all these
networks.
CCC 1040–3108/97/100967–07 $17.50
1997 by John Wiley & Sons, Ltd.
Received 20 January 1996
Revised 8 April 1996
968
S. FIDANOVA
1.1. Brief presentation of the topologies
• Linear array. The vertices of a linear network with length n are labeled by the
elements of the set {0, 1, . . . , n − 1} and there is an edge between vertex i and the
vertices (i − 1) and (i + 1) if these vertices exist.
• Hypercube. The d-dimensional hypercube, denoted H(d), is a graph of order 2d
whose vertex set consists of all binary words of length d. There exists an edge
between any two nodes that binary representation differs in exactly one position.
• De Bruijn. A de Bruijn graph of order 2d , denoted by B(2, d), is a digraph whose
vertex set is the set of all words of length d over {0, 1}. There is an arc from any
vertex x1 , x2 , . . . , xd to vertices x2 , x3 , . . . , xd , α for α equal to 0 or 1.
The paper is organized as follows: Section 2 presents the method used for spelling
correction and Section 3 presents its implementation on a linear array. Section 4 presents
the system architecture.
2. STRING COMPARISON ALGORITHM
The basis we use for comparing two strings of characters is a measure of distance between
these two strings. This distance is the number of differences between two strings.
Our algorithm is based on comparing a target string with strings from a dictionary.
The words in the dictionary are sorted into groups by length. Let z be a measure of the
differences that exist between two words. If n is the length of the target word and for some
word from the dictionary z = 0, then the target word is correct. If for all words with length
n z 6= 0, then only those for which z ≤ 2 are considered next. The valid word is from
this subset if the target word contains only one wrong letter or two adjacent characters are
transposed. Subsequently the target word is compared with words with length n − 1 and
n + 1. The valid word is from this subset if there is one missing letter or one extra letter
is inserted. The strings to be compared must have equal length. First the shorter string is
completed with a special character (zero) on the left side and the two strings are compared.
After that the shorter string is compared with the special character (zero)
on the right side
and the two strings are compared. Only those words for which z < n2 are considered
next. In this way it is unnecessary for all the words from the dictionary to be checked.
Comparing two words of n characters represents n steps. With a dictionary of p words and
where p1 , p2 , p3 are words with length n, n − 1, n + 1, respectively, the total amount of
calculation for a correction process is (p1 + 2p2 + 2p3 )n < 2np.
2.1. Pseudocode description of the algorithm
Let a = a(1) . . . a(n) and b = b(1) . . . b(n) be two strings:
s=0
1 s(i) = 0
if(a(i) = b(i)) then s(i) = 0
else s(i) = 1
s = s + s(i)
i=i+1
go to 1
LINEAR ARRAY FOR SPELLING CORRECTION
969
s is the distance between the strings a and b. Figure 1 is a graphical representation of the
computation of the difference from the test word syctolic and the reference word systolic.
Figure 1.
Computation of the difference between the words
The total edit distance is then equal to 1. There is one wrong letter.
The computation of distance requires one comparing and one addition, i.e. two elementary operations. Comparing two words of n characters represents 2n operations. With a
dictionary of p words the total amount of calculation for a correction process is equal to
2np operations.
The complete English dictionary represents about 700,000 words when different forms
are considered. If a word of ten characters has been detected as erroneous it will be
necessary to compare this string with all the words in the dictionary whose lengths are
nearly similar. Taking words ranging from eight to 12 characters implies comparing the
string with approximately 300,000 target strings. The total number of operations is then
equal to 10 million.
Real-time spelling correction is thus computionally too expensive for conventional machines, especially when large dictionaries are involved.
Similar string correction problems are speech recognition and nucleic acid sequence
comparison. In a nucleic acid sequence there are some million different strings. This
problem is very important in medicine for early diagnosis of some illnesses.
3. PARALLEL IMPLEMENTATION ON A LINEAR ARRAY
This section indicates how the algorithm described above can be implemented on a linear
array.
Let X = (x1 , x2 , . . . , xn ) and Y = (y1 , y2 , . . . , yn ) be two strings to compare. Consider
an array of n processors connected as indicated in Figure 2.
A single cell is shown in Figure 3.
Processor i reads data xi and yi at time step n − i, calculates value an , the elementary
distance calculation, reads data z produced by its neighbor processor i + 1, and propagates
970
S. FIDANOVA
Figure 2.
Linear array
its result z + an to processor i − 1. An array cycle consists of the elementary distance
calculation, that is to say the data acquisition, the comparison, the addition and the data
propagation.
On such an array, parallel execution of a two string comparison requires n cycles since
the difference z is updated by each processor and every processor can run concurrently.
Since a complete comparison lasts n cycles and since a processor is used for one cycle each
processor is available during the other cycles for other tasks. Thus it is possible to start a
new comparison on each cycle.
Pipelining the string comparison in this way allows one to increase considerably the
efficiency of the network since, once the array has been initialized, one comparison result
is produced every cycle. The speed-up in comparison with a sequential machine is given
by
(p1 + 2p2 + 2p3 )n
≈n
Γ=
n + p1 + 2p2 + 2p3
with p n.
Compared to the Lavenier’s method described in [2] this method is faster and uses a
smaller number of processors. The last approximation is valid in the applications since the
number of references can reach hundreds of thousands of items as compared to a maximum
of 20 characters in a string.
Figure 3. Organization of the cell
LINEAR ARRAY FOR SPELLING CORRECTION
971
Our algorithm uses n + p1 + 2p2 + 2p3 < n + p time steps on n processors. Lavenier’s
algorithm[2] uses 2n + p − 2 time steps on n2 processors. One time step consists of
two elementary operations in our algorithm and nine elementary operations in Lavenier’s
algorithm.
It is known that the interconnection networks as rings, de Bruijn networks, grids and
hypercubes contain a Hamiltonian path. Consequently we can execute our algorithm on all
the networks which contain more than n processors.
4. SYSTEM ARCHITECTURE
Spelling correction implies processing character strings which are not very long. The
average length of the words found in the English dictionary is about eight characters. Only
a few words are longer than 15 characters and require significant computational resources.
For these exceptional words, the correction process can be done on conventional machines.
A linear array of 15 processors seems a good compromise for this specific application.
To be able to process strings of different lengths on a fixed 15 processor array, strings
shorter than 15 characters are completed with special characters (zeros). Erroneous words
are always corrected by words whose lengths are similar within a range of ±1 characters.
The linear array only does comparisons between two strings.
Each erroneous string must be compared with a large set of references. The data attached
to the erroneous string are considered as constant values during the whole correction
process. Consequently, to perform its local computation, a processor i needs to receive on
each network cycle a distance from its neighbor i + 1 and a character xki corresponding to
the ith character of the kth reference.
Theoretically, the cost associated with insertion and deletion errors depends on the nature
of the characters. In practice, they are identical and can be replaced by constant values. As
the characters of the erroneous string do not move during a complete dictionary comparison,
the memory size is then equal to two registers, one for the character of the erroneous string
and other for the result of the comparison.
The basic calculations are performed by the processors of the computation array
zi = ai + zi−1
where zi−1 is the result using the neighboring processor and ai is the result of comparing
the ith characters of the reference and erroneous strings.
Figure 4 sketches the processing element architecture. The IN input stores data, z coming
from the other processor into a dedicated I/O register file data; xik is the ith character of
the reference string. LO is two registers of local memory, one for the ith character of the
erroneous string and the other for the result of the comparison. As the arithmetic operations
involved are very limited (only addition), one specific unit is implemented, namely an
adder. COM is the comparison unit.
The memory that holds the dictionary is divided into blocks. The words with the same
length are in the same block. We use separate memory (blocks) for result. Figure 5 shows
the system architecture.
The processor counts the characters of the word. If the number of characters is more
than 15, then the processor corrects the word sequentially. If the number of characters is
less than 15, then the word is corrected by linear array.
972
S. FIDANOVA
Figure 4.
Internal processing element architecture
Memory1 stores the vocabulary, and consists of 16 blocks. The first 15 blocks are blocks
for words of length from 1 to 15 characters. The 16th block is for words with length more
than 15. Memory2 stores the result set of the correct word.
The processors of the array are activated by microcommands which control actions such
as I/O and LO register selection. These microcommands are specified by an instruction
which is received from the outside and decoded.
Processor array and reference memory operate synchronously since they each execute
one instruction every machine cycle. One instruction specifies actions to be realized concurrently on all these units. In that sense, the processor array can be considered to have an
SIMD execution mode: all the cells are executing the same instruction at the same time.
Figure 5.
System architecture
LINEAR ARRAY FOR SPELLING CORRECTION
973
5. CONCLUSION
A linear array for spelling corrections is described in this paper. Compared to the Lavenier
method[2] this method is faster and uses a smaller number of processors. We have studied
the execution of spelling correction on the hypercube, de Bruijn, linear and grid networks.
If there are real processors we can know the real time of the algorithm. This problem is
very important in medicine for early diagnosis of some illnesses.
VLSI (very large scale integration) technology permits construction of cheap specialized
architectures. Our architectures are very regular and are extremely well suited to VLSI
implementation. The regularity can be exploited to help the designer during the different
design steps. The system architecture is highly suitable for direct implementation on VLSI
chips.
ACKNOWLEDGEMENT
The author is financially supported by the BG MSHE YS25.
REFERENCES
1. P. A. V. Hall and G. R. Dowling, ‘Approximate string matching’, Comput. Surv., 12, 381–402
(1980).
2. D. Lavenier, ‘An integrated 2D systolic array for spelling correction’, J. Integration, 15, 97–111
(1993).
3. M. Gokhale, B. Holmes, A. Kosper, D. Lopresti, S. Lucas, R. Nimich and P. Olsen, ‘SPLASH: A
reconfigurable linear logic array’, Technical Report SRC–TR–90–012, Supercomputing Research
Center, Maryland, USA, April 1990.
4. R. Lowarance and R. A. Wagner, ‘An extension of the string to string correction problem’, J.
Assoc. Comput. Mach., 22, (2), 177–183 (1975).
5. J. P. Banatre, P. Frison and P. Quinton, ‘A network for the detection of words in continuous
speech’, in J. P. Gray (Ed.), VLSI81 Int. Conf., Academic Press, 1981.
6. D. J. Burr, B. D. Ackland and N. Weste, ‘Array configurations for dynamic time warping’, IEEE
Trans. Acoust. Speech Signal Process., ASSP–332, (1), 119–127 (1984).
7. F. Charot, P. Frison and P. Quinton, ‘Systolic architectures for connected speech recognition’,
IEEE Trans., ASSP–34, (4), 765–779 (1986).
8. P. Frison and P. Quinton, ‘An integrated systolic machine for speech recognition’, in P. Beertolazzi and F. Luccio (Eds.), VLSI:Algorithms and Architectures, North-Holland, Amsterdam,
1985, pp. 175–186.
9. D. P. Lopresti, ‘P–NAC: A systolic array for comparing nucleic acid sequences’, Computer, 20,
98–99 (1987).
Документ
Категория
Без категории
Просмотров
1
Размер файла
76 Кб
Теги
278
1/--страниц
Пожаловаться на содержимое документа