close

Вход

Забыли?

вход по аккаунту

?

933

код для вставкиСкачать
INTERNATIONAL JOURNAL OF CIRCUIT THEORY AND APPLICATIONS, VOL. 24,93- 109 (1996)
A CNN UNIVERSAL CHIP IN CMOS TECHNOLOGY?
s. ESPEJO, R. CARMONA. R. DOM~NGUEZ-CASTROAND A. RODR~GUEZ-VAZQUEZ
Centro Nacional de Microelectrdnica, Universidad de Sevilla, Edifrcio C K A , CITarfia sin, E-41012 Sevilla, Spain
SUMMARY
This paper describes the design of a programmable cellular neural network (CNN) chip with added functionalities
similar to those of the CNN universal machine. The prototype contains 1024 cells and has been designed in a 1 .O pm,
n-well CMOS technology. Careful selection of the topology and design parameters has resulted in a cell density of
31 cells mm-* and around 7-8 bits accuracy in the weight values. Adaptive techniques have been employed to ensure
accurate external control and system robustness against process parameter variations.
1. INTRODUCTION
Massively parallel analogue processing systems are natural candidates in those application fields where
nature has proven their outstanding capabilities.'32In particular, the processing front-end of biological
vision systems (retina) has inspired the development of highly parallel computation algorithms,
representing a potential breakthrough in artificial vision application^.^ However, the practical use of these
algorithms is conditioned to their efficient and feasible physical implementation. In this context the cellular
neural network (CNN) ~ a r a d i g m represents
~-~
an interesting alternative, with a demonstrated wide range
of applications'-' and particularly well suited for monolithic 1C realizations. Also, the recent extension
towards the definition of a programmable analogic array computer, the CNN universal machine, has
opened up many new application fields which can be handled through spatial and temporal task
sequentiation controlled by a stored programme. l o
A key feature of CNNs is their potential for high operation speed in the processing of array signals.
However, this does not manifest itself if the CNNs are realized in the form of software on conventional
computers, but only if they are realized as VLSI chips, The few CNN chips that have been reported so far
in the literature differ in their complexity and functionality. Most of them include only a limited number of
cells, about 30,"-'4 while a few feature complexity levels up to 1024 cell^.'^.^^ These chips also vary
significantly in their functional capabilities: some of them have fixed f ~ n c t i o n , ' while
~
others are
electrically ~ o n t r o l l a b l e . ' ~Recently,
-~~
Harrer and Nossek reported a chip with some of the functionalities
of a CNN universal machine," based on a discrete-time CNN model. This paper reports the first CNN
universal machine chip based on continuous-time CNN models, and describes the major trends, obstacles
and solutions adopted after careful analysis of many alternatives. An additional feature of the reported chip
is its capability of being initialized either in electrical or optical form.
Section 2 contains general considerations about the mathematical CNN model used in this prototype,
implementation trends and extended capabilities towards the CNN universal machine. Section 3 is dedicated
to the cell architecture, from the selected synapse topology to the operating mode of the analogue and
digital circuitry, while Section 4 describes the system architecture and control strategies. Finally, Section 5
presents some examples of experimental and simulated results and Section 6 provides a summary of the
paper.
t Part of
this research has been reported in the Proceedings of the 1994 IEEE International Workshop on Cellular Neural Networks
and Their Applications held in Rome.
CCC 0098-9886/96/010093- 17
0 1996 by John Wiley & Sons, Ltd.
Received 15 January 1995
Revised 6 April I995
94
S. ESPEJO ET AL.
2. GENERAL CONSIDERATIONS
2.1. Mathematical model
The presented chip, whose microphotograph is shown in Figure 1, uses the so-called full signal range
(FSR) CNN model
x<-1
1x1 d 1
m ( x - 1 ) + 1, x > 1
m(x+l)-1,
first proposed by the authors in Reference 14 and discussed in detail in Reference 18. Cell state variables
. x ' ( r ) in this model are restricted to the same variation interval as input and output variables, [ - 1 , l J (after
normalization). This differs significantly from Chua and Yang's model, where the interval for state
variables is larger and its amplitude varies with the values of the templates. In the new model the state and
output variables of each cell merge for all practical purposes, the block that realizes the output nonlinearity is discarded and the complexity of the associated circuitry decreases. Consequently, it yields larger
Figure 1. Photomicrograph of the complete chip
95
CNN CHIP IN CMOS TECHNOLOGY
cell density and smaller power consumption than the original model-prerequisites to increase complexity
in the production of CNN chips. In addition, it is insensitive to uniform variations in CNN coefficients, an
important feature in a VLSI context, where process parameters are always subject to variations. Finally, for
a given time constant the processing speed is approximately doubled in common applications.
2.2. Implementation trends
Most problems found in the electronic implementation of CNN systems relate to the high number of
pixels (cells) required in practical applications, in particular for image processing. Given the limitations on
chip dimensions (about 1 cm x 1 cm) imposed by technological and economical constraints (mainly the
yield of the process), the high cell count translates into a requirement of high area efficiency in the cell
circuitry design. Cell power dissipation must also be kept within some limits, imposed either by the thermal
properties of the package or by battery life specifications in the case of portable systems.
Another important objective is to maximize the accuracy in the emulation of the mathematical CNN
model. This consideration is particularly important considering the strong area-efficiency requirement,
owing to the inverse relationship between mismatch errors and device area.” Although a compromise must
finally be accepted between the two objectives, the achievable error and area figures depend strongly on the
particular circuit blocks employed in the implementation. For this reason a detailed analysis of mismatch
effects as a function of area for alternative circuit structures was necessary.
Robustness of system behaviour against process parameter variations is also a necessary condition,
particularly when using standard digital technologies for which process parameter windows are commonly
large.
Finally, the control of the analogue chip must be robust and as simple as possible. External setting of
internal analogue parameters required to programme a particular application must be accurate and simple
and hence independent of process parameter values.
2.3. Functionality and characteristics
The functionality of the chip is similar to that of a CNN universal chip.” Its central capability is that of
a completely programmable CNN: controllable feedback, control and offset coefficients. In addition, every
cell is equipped with a programmable digital gate, digital memory and other 1/0 and control circuitry as
shown in Figure 2.
The CNN network is constructed on a rectangular grid, is assumed uniform and its neighbourhood radius
PROGRAMMABLE
APR
CNN
CELL
I 1
LLM
sensor
I
I
LLM-Addnss
Control
f
-
CNN
Control
-1
Logic
f-
Unit
’1
I
LLM-VO
Multiplexer
Control
Figure 2. Schematic architecture of one cell of the prototype
LPR
96
S. ESPEJO ET AL.
is always unity, i.e. connectivity between cells is limited to adjacent neighbours in vertical, horizontal and
diagonal directions.
Strong emphasis has been placed on low area and power consumption in the implementation of the cells,
as well as on accuracy. Extensive analysis has been performed to optimize the architecture and basic
building blocks against statistical on-die parameter variations (mismatch). Strong effort is made to obtain
robustness against wafer-level parameter variations. For this purpose many of the internal variables are
automatically tuned on chip.
The final topology forecasts weight and offset accuracies in the range of 7-8 bits, while area
requirements allow a complete 32 x 32 cell system to be integrated in a 7.7 mm x 6.8 mm prototype using a
1.0 pm, n-well, digital technology. In particular, cell area is about 171 pm x 187 pm. Array dimensions are
5 . 5 mm x 6.0 mm. The remaining area is dedicated to boundary cells, weight adaptation stages, memories
for template coefficients and local-logic-unit truth tables, adaptive bias stages, 1/0circuitry and bonding
pads.
Chip management is completely digital, facilitating its control and communications. No external
analogue signals are required for chip control or references. Analogue weights are specified and stored in
digital form. For each template value, an adaptive stage transforms the digital code to an analogue voltage,
which is then transmitted to the network. This methodology results in weight independence of particular
process parameter values, as well as accurate external control. The quantization error on the coefficients is
lower than the expected statistical error of the analogue multipliers. Hence it is not relevant.
Every cell includes a photosensitive device, which allows the system to be optically initialized. These
devices are CMOS compatible and incorporate a tuning scheme for automatic adaptation to different
illumination conditions.20 Electrical initialization is also possible, while the output image is always
downloaded in electrical form. Input and output images are assumed to be binary in every case. Electrical
image uploading and downloading are realized row by row through 32 1/0bonding pads.
The digital circuitry at each cell includes a 4 bit static memory (LLM), a completely programmable twoinput digital gate (LLU) and initialization and control circuitry for many different operations. The 4 bit
memory at each cell allows the network to store four complete images. Two additional ‘read-only
memories’ with fixed + 1 and - 1 values are also available.
In addition to the CNN processing capability and the programmable digital gate, every cell can perform
the following operations, which are generally realized simultaneously at every cell in the array:
(a) storing an electrically loaded image into a specific memory location-this operation is realized
sequentially, row by row
(b) storing an optically captured image into a specific memory location
(c) storing the result y(=) of a CNN process into a specific memory location
(d) storing the result of a particular digital operation among two images into a specific memory location
(e) Moving the content of one memory location to another
(f) using a specific memory location as input u of the cell
(g) using a specific memory location as initial condition x(0) of the cell
(h) downloading a particular memory location through the 1/0bonding pads-this operation is realized
sequentially, row by row
The programme control functionality is based on a large static digital memory located at the periphery of
the cell array. This memory is used to store up to eight complete sets of coefficients. Each set specifies all
the parameters required to define the CNN operation (APR) and the truth table of the programmable digital
gate (LPR). The content of each set of coefficients is described in Table I.
The combined APR and LPR memory can be viewed as a digital memory of eight words of 160 bits
(each describing a complete set of coefficients). The loading process is performed through an 8 bit data bus
on a sequential schedule controlled by two complementary address buses, one indicating the particular 8 bit
data being loaded and the other indicating the set of coefficients. After the internal memory has been
loaded, the eight sets of coefficients can be used in any order, any number of times.
The state and input values of boundary cells and the truth table of the local logic unit are grouped in a
97
CNN CHIP IN CMOS TECHNOLOGY
Table I. Information content of one set of coefficients in the joined APR and LPR static memory
Data description
Feedback coefficients
Control coefficients
Offset term
Boundary cell state
Boundary cell input
LLU truth table
Symbol
Count
Bits
Values
Increment
4
h:
d
x,
u,
Tr
single 8 bit word. Hence the memory can be viewed as a group of 20 blocks of 8 x 8 bit RAM memories.
Each of the 20 blocks corresponds to a particular coefficient and contains eight different values of 8 bits.
Each different value corresponds to a different 'set' of coefficients.
3. CELL CIRCUITRY
3.1. Multiplier selection
Programmable scaling blocks or multipliers are the key block in a programmable CNN cell, owing to the
large number of them required (one per coefficient).
We have selected the multiplier shown in Figure 3.2' This structure is a fully differential, four-quadrant
multiplier with high linearity and presents an excellent area/accuracy figure. The four transistors are
identical and operate in the triode (ohmic) region. Nodes I,, and I, are connected to the differential input
of the cells in the neighbourhood. The state variable of the cell is represented differentially by signals Vx,
and Vxn.The differential weight signal is compounded of signals V,, and VPn.The scaled signal, in current
form, is differential as well and given by the difference of the currents flowing out of the multiplier from
nodes I,, and Ion.Analysis of the structure in Figure 3 yields
lop-]on=2B(Vpp- Vpn>(Vxp- Vxn)
(2)
were is the large-signal transconductance of one transistor (B = pockxW/2L).
Note that all transistors are n-channel, resulting in high area efficiency (no wells required). Both input
signals (state variable and weight) are given as voltages, while output is represented as current. These
characteristics facilitate the following three operations: (a) the distribution of the state variable of each cell
to the multipliers used to implement the different coefficients (one per neighbour); (b) the distribution of
the programmed coefficient values to every cell in the array (weights are invariant from cell to cell, since
the network is uniform); (c) the summation of the contributions coming into every cell from the different
cells in the neighbourhood.
The high input impedance of the state variable input nodes of the multiplier avoids the use of on-cell
vPP
VPn
Figure 3. Analogue four-quadrant multiplier used in the prototype
98
S. ESPEJO ET AL.
low-impedance buffers. On the other hand, the weight signals must be driven by buffers with an extremely
low output impedance, since the already low input impedances of a high number of multipliers (one per
cell) are connected in parallel. The input nodes of the cells must have an extremely low input impedance,
since the output impedance of the multipliers is low and a number of multipliers (one per cell in the
neighbourhood) are also connected in parallel at the input node of every cell.
The common mode current at the output of the multiplier is given by
Jop+lon=-B(Vpp-Vpn)*
(3)
which is small and depends not on the state variable but only on the weight value. This allows elimination
of the common mode signal in a preprocessing step, storing its value in analogue memories and avoids the
use of continuous-time common mode feedback circuitry within the cell.
The input capacitance at the state variable nodes of the multiplier can be used as integrating capacitors,
as explained later.
3.2. Cell architecture and operating mode: analogue circuitry
Figure 4 shows the analogue core of the cell, excluding the multipliers. The architecture is fully
differential. Contributions from all neighbours are added at the low impedance nodes I,, and Z,,. This is
accomplished by connecting the output nodes I,, and I,, of the corresponding multipliers in the
neighbouring cells to nodes I,, and I,,, respectively. Correspondingly, multipliers implementing the scaled
replicas of the state variable of this cell, with output directed towards the cells in the neighbourhood, have
their input nodes V,, and V,, connected to the nodes with the same name in Figure 4, whose voltages
represent the differential state variable of the cell.
Switches labelled Lo are used to set the cell in open- or closed-loop configuration. That is, while the
signal Lo is high (switches are of), no dynamic evolution occurs.
Diodes in the figure are actually realized by a p-n-p vertical device and a diode-connected n-channel
transistor as shown in Figure 5 . Their function is to implement the outer segments of the non-linear loss
term in equation (1).22 The central piece of equation (1) is realized by redefining the central feedback
coefficient a: as a',- 1. Each rail of the state variable is limited to the interval [ V,, - V,r,( Vref),Vref+ V,],
where V , is the 'on-voltage' of the emitter-base junction of the vertical p-n-p device and VTn(Vmf)is the
threshold voltage of the diode-connected n-channel transistor. The aspect ratio of this device is chosen to
be very high to obtain a sharp non-linearity.
Integrating capacitors C at nodes Vxp and V,, are realized by parasitic capacitors, corresponding to the
sum of the gate capacitance of the multipliers, as mentioned earlier. Since all the transistors in the
D
?
8
vc',,c
A
4
P,
Figure 4. Schematic diagram of the analogue core of the cell. Multipliers are not included
CNN CHIP IN CMOS TECHNOLOGY
99
"ref
Figure 5. Diodes and feedthrough compensation capacitor realization in Figure 4
multipliers are permanently in strong inversion, this capacitance is quite linear. In addition, area efficiency
is higher than with the usual two-poly structures in analogue technologies.
Only nine analogue multipliers (not shown in Figure 4) are included in each cell. These multipliers
generate the nine scaled contributions (one for each neighbour) of the cell input or state variable signal. In
a first step the control contributions are generated by connecting the multipliers' input nodes to a pair of
voltage levels representing the input value ui of the cell and setting the weight signals to the values
corresponding to the control template. In this manner, each cell receives the sum of the control
contributions from its neighbourhood and stores it in an analogue memory. In a second step the same
multipliers are used to generate the feedback contributions. Weight signals are set to the values
corresponding to the feedback template, initial state values x'(0) are introduced in the integrating capacitors
and the network is allowed to perform its dynamic evolution. This strategy halves the number of multipliers
in every cell. Furthermore, the input-stage offset is stored together with the control contributions and
cancelled during the dynamic evolution. A structure similar to that of the multiplier is used to generate the
offset term contribution.
The required analogue memories are implemented by the p-channel transistors in Figure 4 with gates
driven by switches labelled A,,,. The n-channel transistors located in the lower part of the figure provide the
necessary bias shifting to allow the memories to store either positive or negative currents. Most of the
remaining circuitry in Figure 4 is used to eliminate the common mode component of the differential
current. An adaptive feedthrough cancellation technique is used which involves the capacitor C,,,. This
capacitor is implemented by a shorted p-channel transistor as shown in Figure 5 .
For the sake of clarity, let us describe a typical analogue operation of the cell, step by step;
1. Set the state variable (nodes V,, and V,,) of every cell i to the saturated value (binary images are
assumed) corresponding to the cell input ui.This is done by the inverters driven by local data lines
(see Figure 6 below) D and b and through the switches labelled R at the output of these inverters in
Figure 4.Excess current from the inverters is taken care of by the voltage limiter.
2. Connect the control template weight voltages to the nine multipliers in every cell. This step is
performed simultaneously with the previous one. During this process, signals A,, c,, and R are low,
while c, and Lo are high. This part of the process ends with the rising edge of A,, which results in
storing the control contributions from the cells in the neighbourhood in the analogue memories. A
sign inversion is present in this process, which is internally solved using opposite-sign control weights
at every cell. A significant feedthrough error from the memories will be present. However, its
dominant component corresponds to a common mode error in the stored differential current. This is
cancelled in next steps.
3. Set the state variable (nodes V,, and Vxn) of every cell i to the saturated value (binary images are
assumed) corresponding to the initial conditions x'(0). This is again done by the inverters driven by D
and D.
4. Connect the feedback template weight voltages to the nine multipliers in every cell. This step is
performed simultaneously with the previous one.
5. This step begins with the rising edge of c,, and the falling of c,. As a result, the common mode of
the currents going into the cell, which corresponds to the common mode of the control contributions
(stored in the memories), plus the feedthrough error of the memories, plus the common mode of the
feedback contributions at the initial point, flows through the two p-channel transistors enabled by the
100
S. ESPEJO ET AL.
switch controlled by cme,which are 'diode connected' by the switches controlled by c,. An additional
bias shifting is provided by the n-channel transistors connected through the n-channel switches
controlled by cme.
6 . After a short settling time, signal c, rises, storing the common mode signal in the associated pchannel transistors. Introduced feedthrough error represents a significant common mode error. Unlike
in fully differential linear structures, common mode errors are significant owing to the non-linear
nature of the system, since the limits of the state variable signal range do not move with the common
mode voltage. The rather particular switching configuration, involving three switches, results in strong
feedthrough attenuation. Nevertheless, the remnant error levels are still the limiting factor for system
accuracy. Thus an identical stage is replicated out of the cell array, which reproduces this feedthrough
error and, using a common mode adaptive loop, modifies the voltage at the gate of the p-channel
transistors through capacitor C,,,. Even though different cells have different control and feedback
contributions owing to different input ui and initial conditions x'(O), this is irrelevant because the
common mode of the multipliers being used depends only on the weight values (uniform throughout
the array) and not on the signal levels. This scheme results in an extremely accurate cancellation of
the common mode.
7. Signal Lo falls, closing the loop of the cells. However, the network remains 'tied' to the initial
conditions until signal R falls in the next step. This is done so to avoid charge redistribution between
the nodes at the sides of the Lo switches, which would result in a large error in the initial conditions.
This is due to two facts. First, after the switches controlled by c, are turned off, the impedance at
these nodes is high and hence the voltage at these nodes settles close to the bias supplies. Second, the
capacitance at the drain of the p-channel transistors is quite high, since these transistors must necessarily be wide due to the high current levels that may flow into the cell under some combinations of
weights and signals. The technique described at the beginning of the paragraph avoids this problem.
8. The final step consists of raising signal R , which releases the network allowing its evolution.
3.3. Cell architecture and operating mode: digital circuitry
The digital circuitry of the cell is composed of three major blocks, shown in Figures 6-8.
Figure 6 shows the local logic memory. It consists of four flip-flops, used as a 4 bit memory and has the
combined functionality of the LLM and LAM of a general CNN universal machine." A 'strong inverter'
driven by signal A, is used to force the 'weak' inverters in the flip-flops. Any of the four bits can be
addressed at any time using global signals m , , m2, m 3 and m4. Signals W and R ( i ) t are used to control the
writing procedure, which can be time multiplexed row by row when an image is being loaded electrically or
completely parallel if it is being loaded in optical form. Signals D and are used to initialize the cells (see
Figure 4) and are obtained from one of the flip-flops or directly from the power supplies when a uniform
Figure 6. Local logic memory: merged functionality of LLM and LAM
't in what follows, i andj are used to denote rows and columns, respectively.
CNN CHIP IN CMOS TECHNOLOGY
101
LO
Dn 1
D
Pl
Dn2
DP2
b
b
D l l D1o
b
b
Dm
Figure 7. Local logic unit: LLU
vxn +oc
Figure 8. Local communication and control unit: LCCU
+ 1 or -1 input or initial condition is desired (signals m band m,). Using the parasitic capacitance in the
initialization circuitry as an intermediate memory, the content of a given flip-flop or a uniform + 1 or - 1
can be moved to any memory location.
Figure 7 illustrates the implementation of the local logic unit (LLU), which is a two-input digital
operator with fully programmable truth table, transmitted through global signals D,, Do,, D,,and D , , . The
two input signals of the LLU are always taken from memories 1 and 2.
Figure 8 describes the local communication and control unit (LCCU). It is a digital multiplexer which
selects the origin of the signal to be stored in one of the digital memories. Possible sources are the
photosensor ( I F ) , the external 1 / 0 signal (EIOU)) corresponding to a particular cell column j , the cell
Figure 9. Photomicrograph o f one cell of the FSR CNN universal chip
102
S. ESPEJO ETAL
output obtained from node V,, with the help of a simple inverter (a bipolar to unipolar converter, B-U)
and the output of the logic unit (LO). In addition, by selecting two signal paths simultaneously, the
multiplexer permits downloading of the cell output ( E and C) on a row-by-row schedule, as well as reading
of the output of the photosensors ( E and F ) or the local logic unit ( E and 15).The content of any memory
location can also be read by first using it as an initial cell condition. In this manner the output of the cell
contains the memory information (before the evolution is allowed) readable from UC.
This concludes a brief description of the cell architecture and its operating mode. A final observation
concerns the lay-out strategy. Over 50 global signals must be connected to every cell, including weight
voltages, biasing, analogue adaptive signals, digital signals and control. For this reason, cell routing is
realized using only polysilicon, diffusion and the first metal layer. Global signals are transmitted using the
second metal layer over the transistors and local routing area. This approach results in an area saving which
can be estimated from the fact that almost the whole surface of the cell is covered by metal 2 global lines,
as shown in Figure 9.
4. SYSTEM ARCHITECTURE AND CONTROL
4.1. Architecture
The system architecture is represented schematically in Figure 10. As mentioned previously, system
operation relies on a number of adaptive stages to tune electrical variables, compensate for inaccuracies
and perform automatic weight adjustment.
Bias and tuning stages generating analogue reference voltages are located at every corner of the chip
area and interconnected. This is intended to produce biasing levels which correspond to the average value
of the electrical parameters in the die. This is an important consideration given the large area of the
prototype and the unavoidable spatial variation of the process parameters. Other miscellaneous circuitry,
such as digital gates used to generate some control signals and the feedthrough adaptive stage, are located
on the left side of the array.
32 I/O cells
Bias-Adaptive
stage
a‘
Bias- Adaptive
stage
1 -
1 DBias-Adaptive
stage
, , , 0Bias-Adaptive
I \
//I
Adaptive stages
(weights)
SRAM
Figure 10. Schematic system architecture
\
stage
103
CNN CHIP IN CMOS TECHNOLOGY
A digital decoder placed on the right side of the cell array is used to generate the 32 signals R ( i ) required
for the row-by-row 1/0 protocol. The decoder includes an ‘all-enabled’ signal, since signals R ( i) must all
be high during normal parallel operations and during optical image uploading from the photosensors.
The 32 1/0 cells located at the top of the chip include input and output digital buffers as well as the
circuitry required to multiplex the input and output signals through the same 32 lines EZO(j).
The circuitry located at the bottom of the cell array can be divided into two large sections. The first,
located below, consists of a set of 20 SRAM blocks, each with eight 8 bit words. The second, located
above, contains 10 adaptive stages. A more detailed block diagram of this circuitry is shown in Figure 11.
Considered as a whole, the 20 SRAM blocks constitute an SRAM memory of eight 160 bit words. Each
of these words contains all the information required to describe a CNN network and a truth table of the
LLU. Hence the system stores up to eight complete programmes which, after being loaded in a preinitialization step, can be selected any number of times in any order. If each of these complete
programmes is considered as an instruction of the analogue processor, the content of the SRAM memory
can be interpreted as the microcode defining the set of instructions available. This microcode is userprogrammable in an initialization process.
A or B Weight Voltages
*
.
.
.
.
.
.
I
.
A-Template
.
.
*
I
,
.
RAM
&Template
RAM
Blocks
Blocks
,
,
.
.
.
.
q
*
.
,
,
.
Offset term
RAM Block
Boundary
Conditions
& LLU T.T.
/’
Buffers
\
Inverters
Buffers
R A M Block
d
bd
8x8
RAM
8x8
G<07>!-
I
I
-
I
0
Program
Selection
for reading
& writing
D<o:7>! - DataInput
Figure 11. Programme memories (APR and LPR) architecture
104
S . ESPUO ET AL.
The 160 bits of each instruction are distributed as follows. Each coefficient of the feedback and control
templates is codified by a group of 8 bits, seven plus sign, as described in Table I. This accounts for
144 bits storing the 18 weights of the feedback and control templates. The next 8 bits codify the value of
the offset term, again seven plus sign. Finally, the last 8 bits are used to store three different parameters: the
state variable of the boundary cells (2 bits), the input to the boundary cells (2 bits) and the truth table of
the LLU (4bits). The state and input of the boundary cells can take the values - 1, 0 or + I . A zero value
results in the absence of boundary conditions.
Nine of the 10 adaptive stages are identical and tune the weight voltages of the feedback or control
templates. Bear in mind that the feedback and control templates are not used simultaneously by the
network. The input to the adaptive stages is connected to the corresponding feedback or control coefficient
by global signals A and B. In this manner, switches are avoided in the analogue weight signal paths. Given
the extremely low impedance required from the analogue weight signals, switches would produce large
errors if inserted in the weight signal paths. The last adaptive stage differs slightly from the others and
tunes the value of the offset term.
4.2. Weight control
The use of either analogue or digital programmability presents advantages and disadvantages. Even for a
reduced number of bits, digitally programmable multipliers require much larger areas than common
analogue multipliers. In addition, a large number of control lines is required, which typically results in a
dominant area requirement. On the other hand, analogue multipliers are sensitive to process parameter
variations, making accurate setting of the coefficients difficult. In addition, on-chip storage of the analogue
weight values requires analogue memories, which generally present time degradation problems.
The disadvantages of analogue programmability relate to the control and storage of weight values and
their dependence on process parameters, while digitally programmed multipliers require large areas and an
excessive number of control lines. We have used a hybrid approach in this design, based on the use of
analogue programmable multipliers within the cells and digital control from the exterior of the network.
This combines the advantages of analogue and digital programmability as summarized in Table II.'6.23
The use of analogue programmable multipliers within the cells provides higher area efficiency and a low
number of control lines, while the external digital control facilitates the control of the weights and their onchip storage. The analogue weight signal is generated from the digital word using an adaptive loop, which
involves a linear D/A converter and an analogue multiplier identical with those used within the cells. The
adaptive control eliminates the dependence of the weight values on process parameters.
This hybrid control strategy relies on the proper behaviour of the adaptive stages used to generate the
analogue weight voltages from digital words. The use of multipliers based on MOS transistors in the ohmic
region requires that some of the input terminals be driven by low-impedance nodes. We have used the highimpedance input nodes of the multipliers as the signal input to avoid the use of low-impedance buffers
within the cells. As a result, the weight signals must be introduced through the low-impedance nodes of the
Table II. Simplified comparison of different alternatives for programmable CNNs
Programmability alternatives for CNNs
Effective resolution
Area consumption
Number of internal signals
Power consumption
Process variation effects
Design effort
External weight control
Global linearity
On-chip weight storage
Analogue
7-8 bits digital
Hybrid
7-8 bits
Low
Low
Variable
High
High
Difficult
Difficult
Difficult
7-8 bits
Very high
Very high
High
Low
Low
Simple
Simple
Simple
7-8 bits
Low
Low
Variable
Low
Very high
Simple
Difficult
Simple
CNN CHIP IN CMOS TECHNOLOGY
105
multipliers and hence be driven by low-impedance nodes. Furthermore, since the weight signals drive all
the cells in the array (1024 cells in our system), the weight signal buffers must have extremely low output
impedance and high output current capability.
The architecture of one adaptive stage is illustrated schematically in Figure 12 and can be briefly
described as follows. Two low-impedance input stages, identical with those used in the cells, are driven
by one multiplier, also identical with those used in the cells. The differential output current of the
multiplier is made single-ended by a p-channel current mirror. The resulting single-ended current is
compared with the current generated by a D/A converter with current form output. The current generated
by the D/A converter always flows in the same direction and its value is 2pI,,,, where p is the digitally
programmed weight in the range [0,4] and I,,, is the current corresponding to the saturation level of one
of the rails of the differential state variable of the cell. The input signals of the multiplier correspond to
the voltage saturation levels of the state variable, i.e. the limits of the signal range and are generated by
the same 'diodes' as described in Figure 5 and used in the cells to limit the voltage range of the state
variable.
The weight signals of the multipliers are driven by two analogue buffers, whose output voltages are the
weight signals transmitted to every cell in the system. During a precalibration step controlled by $of, these
buffers are disconnected from the multiplier and the differential weight signals are shorted at the input of
the multiplier. This has the effect of setting the common mode voltage V,, to the voltage V , at the input of
Vpni
-0
1
lr' '
r
4
Figure 12. Architecture of a weight adaptive stage
106
S. ESPEJO ET AL.
the low-impedance input stages. The current flowing out of the p-channel current mirror is due to the offset
of the low-impedance input stages and the current mirror itself. This error is stored in the p-channel
transistors with gates driven by switches controlled by
which constitute the analogue memories. After
this calibration step the output of the current memories is added to the difference of the output signals of
the analogue multiplier and the D/A converter and the resulting current is integrated at the input nodes of
the analogue buffers. The output of these buffers controls the weight signals of the multiplier, thus closing
a feedback loop which settles after the analogue weight signals have the correct, adapted value. Output
impedance is extremely low owing to the feedback loop. Also, the offset of the buffers is irrelevant, since
their effect is included in the loop.
The sign of the programmed weight is introduced by swapping the connections of the input and output
nodes of the buffers with respect to the adaptive core of the circuitry. The signals driving the cells have no
switch in their path, to avoid output impedance degradation.
The common mode of the weight signals is cancelled with the help of two resistors. These resistors can
be implemented with polysilicon, n-well regions or MOS transistors in the ohmic region. Mismatch among
these resistors has little effect on the behaviour of the multipliers, which are almost insensitive to the
common mode of the weight signals.
The D/A convener is implemented by a binary-weighted array of n-channel transistors in saturation.
Errors of the D/A converter are not cancelled and hence, the unitary transistor must be carefully chosen.
All current sources in Figure 12 are realized by MOS transistors in saturation. Cascode structures are
actually used for these current sources and for the p-channel current mirror.
A final important issue is the extremely high power dissipation of the buffers. Note that most of the
current flowing in the system actually originates at the buffers of the adaptive stages, since the state
variables of the cells drive capacitive nodes. The amount of current required from the output of one buffer
depends on the value of the programmed weight. For this reason the implementation of the buffers,
illustrated at the top right of Figure 12, includes a weight-dependent bias current controlled by the most
significative bits of the digital word encoding the weight value. A fixed bias current is permanently present
to ensure proper functioning of the buffer. An additional power dissipation reduction is achieved by
maintaining the bias current of the buffer generating the positive rail of the weight signal fixed to the
minimum weight-independent level.
5. RESULTS
The testing of the prototype has been affected by an unexpectedly low yield of the technology and the
reduced sample set (30 units). Optical and electrical 1/0 functionalities, internal (cell level) memory,
digital and data-transfer operations, adaptive stage behavior and global control and memory functionalities
have been separately confirmed on reduced subsets of samples. The analogue processing circuitry has also
been verified to the limits allowed by the faulty digital shell, which masks the internal analogue behavior.
Random catastrophic faults have also been observed on the analogue circuitry of isolated cells within every
sample.
The verification of the different functionalities of the prototype chip is complete, although none of the
samples is completely free of isolated defects. These results validate the design approach and the employed
circuit techniques.
As an example, Figure 13 illustrates simulated results of the analogue processing circuitry, using
extracted net lists from the lay-out of a reduced 4 x 4 cell network employed during the design process for
verification. The use of this reduced network is imposed by simulation constraints on CPU time and
memory. The example corresponds to a diagonal connected component detection process. The wave-forms
in the lower part of the figure, spatially distributed like the cells in the array, correspond to the positive rail
of the state variables of the different cells. The two sets of wave-forms in the upper part of the figure
correspond to selected control lines (right) and to the four I/O data lines of the network, one per column
of cells. Serial input and output processes take place at a clock frequency of 10 MHz. Thus the loading or
downloading of a complete image in the real prototype (32 x 32) takes 3-2 ps. The image-processing time
107
CNN CHIP IN CMOS TECHNOLOGY
1. .i
.......
OUTPUT
EZO( 1)
2.0:
0. -.
.;
'
4.0\,,'-
EZO(2) 2
._:
. ,
~
.:'-.,,f,
o ................................
0. -
n
EZO(3)
5.0-
EZO(4)
l -
2.0u
0.
3.4.04--7--+
5o = , . .
4.OT7
-r
3*50_'
4
. . . . 0. . . . . T. .
~
...............
I
3.0~
. . . . . .." ......... -
50_'.................
3*o:
tlme ( I i n )
1
!
-
3.0_
J
I
. . . . . . . . . .,
#
I
,
3.0_'
,
. . . . . . . . . . ..^. .....,
>
,
I
,
.....-
4.0r-!
3.501
-
3.07. . . . . . . . . . .
i
I
i
,
i
..
3.50d
f
\
3.0;
.....
I
.
...... -
............
. . . . .
3.50:
..
...
3.0:
.
-
.............
3.0k
2.80;
2.801.
0.
2.0u
time (Iln)
0.
tine I I m )
2.80;
3.0:
2.0u
2.0u
3.0~
(._IT-d
3.0~
0.
time ( I l n )
3.0~
[
0.
,
,~
,
2.0u
time ( I i n )
L
.
4
3.0~
Figure 13. Result example: diagonal connected component detection in a 4 x 4 network
is strongly dependent on the particular CNN application, ranging from less than 1 ps to several
microseconds.
Figure 14 shows a microphotograph and measured results of the static transfer characteristic of the
adaptive stages (output weight signals versus time-swept digital code). The settling time of the adaptive
stages is in the range of 0.5 ps, which represents an increase in image-processing time of 1 ps due to the
sequential use of the control and feedback templates.
6. CONCLUSIONS
This paper describes the design of a programmable cellular neural network (CNN) chip with added
functionalities similar to those of the CNN universal machine. The prototype contains 1024 cells and has
108
S. ESPEJO ET AL.
Figure 14. Photomicrograph (a) and measured output of an adaptive stage: (b) differential weight voltages; (c) difference of weight
voltages
been designed and manufactured in a 1.0 pm, n-well CMOS technology. A modified CNN algorithm which
is significantly advantageous from an implementation point of view has been employed.2’ This, together
with careful selection of the circuit topology and design parameters has resulted in a cell density of
31 cells mm - 2 and approximately 7 bits accuracy in the weight values. Adaptive techniques have been
employed to ensure accurate external control and system robustness against process parameter variations.
Measurements of the prototype validate the circuit techniques employed for the implementation of the
different functionalities required by a CNN universal chip. We are currently finishing a redesign of the
prototype using a high-reliability process and yield-oriented lay-out techniques, which will surely result in
completely operative units.
ACKNOWLEDGEMENT
The research of Ricardo Carmona has been partially supported by Iberdrola S.A. under contract
INDES-94/377.
REFERENCES
I . J. M. Zurada, Inrroducrion ro Artificial Neural Systems, West, 1992.
2. C. Mead and M. Ismail (eds), Anologrre VLSI Implemenration of Neural Systems, Kluwer, Dordrecht, 1989.
3. M. M. Gupta, and G. K. Knopf, (eds), Neuro-Vision Systems, Principles aridApplic~otiori.s,IEEE, Ncw York, 1994.
4. L. 0. Chua and L. Yang, ’Cellular neural networks: theory’. IEEE Trans. Circuim ond Systems, CAS-35, 1257- 1272 (1988).
5 . L. 0. Chua and L. Yang, ‘Cellular neural networks: applications’, IEEE Trans. Circuits and Spsiems, CAS-35, 1273- 1290
(1988).
CNN CHIP IN CMOS TECHNOLOGY
109
L. 0. Chua and T. Roska, ‘The CNN Paradigm’, IEEE Trans. Circuits and Systems I , CAS-40, 147- 156 (1993).
Proc. IEEE Int. Workshop on Cellular Neural Networks and Their Applications, Budapest, December 1990.
Proc. IEEE Int. Workshop on Cellular Neural Networks and Their Applications, Munich, December 1992.
Proc. IEEE Int. Workshop on Cellular Neural Networks and Their Applications, Rome, December 1994.
T. Roska and L. 0. Chua, ‘The CNN universal machine: an analogic array computer’. IEEE Trans. Circuits and Systems 11,
CAS-40, 163-173 (1993).
1 1 . J. M. Cruz and L. 0. Chua: ‘A CNN chip for connected component detection’, IEEE Trans. Circuits and Systems, CAS-38,
812-817 (1991).
12. H. Harrer and J. Nossek, ‘An analog implementation of discrete-time cellular neural networks’, IEEE Trans. Neural Networks,
”-3,466-476 (1992).
13. P. Kinget and M. S. J. Steyaert, ‘A programmable analogue cellular neural network CMOS chip for high speed image
processing’, IEEE J.Solid-State Circuits, SC-30, 235 -243 (1995).
14. A. Rodriguez-VBzquez, S. Espejo, R. Dominguez-Castro, J. L. Huertas and E. Shchez-Sinencio. ‘Current-mode techniaues for
the implementation of continuous-time and- discrete-time cellular neural networks’, IEEE Trans. Circuits and Sysients / I ,
CAS-40, 132-146 (1993).
15. S. Espejo, A. Rodriguez-Vazquez, R. Dominguez-Castro, J. L. Huertas and E. Shchez-Sinencio, ‘Smart-pixel cellular neural
networks in analog current-mode CMOS technology’, IEEE J. Solid-State Circuits, SC-29,895 -905 (1994).
16. S. Espejo, ‘VLSI design and modeling of CNNs’, Ph.D Dissertation, University of Seville, 1994.
17. H. Harrer, J. A. Nossek, T. Roska and L. 0. Chua, ‘Measurement results of the DTCNN-universal chip’, Proc. 4th Int. Conf. on
Microelectronics for Neural Networks and Fu::y Systems, Turin, September 1994, pp. 95-99.
18. S. Espejo, A. Rodriguez-VBzquez, R. Domhguez-Castro and R. Carmona, ‘A VLSI-oriented continuous-time CNN model’, Int.
J . Cir. Theor. Appl., 24, (1 966), to appear.
19. M. J. M. Pelgrom, A. C. J. Duinmaijer and A. P. G. Welbers, ‘Matching properties of MOS transistors’, IEEE J . Solid-State
Circuits, SC-24, 1433-1440 (1989).
20. S. Espejo, A. Rodriguez-VAzquez, R. Dominguez-Castro, J. L. Huertas and E. Shchez-Sinencio, ‘An analogue design technique
for smart-pixel CMOS chips’, Proc. Eur. Solid-state Circuits Conf., Seville, September 1993, pp. 78-81.
21. N. I. Khachab and M. Ismail, ‘Linearization techniques for nth-order sensor models in MOS VLSI technology’, IEEE Trans.
Circuits and Systems, CAS-38, 1439-1449 (1991).
22. S. Espejo, A. Rodriguez-Vizquez, R. Dominguez-Castro and R. Carmona, ‘Convergence and stability of the FSR CNN model’,
Proc. Third IEEE Int. Workyhop on Cellular Neural Networks and Their Applicaiions, Rome, December 1994, pp. 41 1-416.
23. S. Espejo, R. Dominguez-Castro, A. Rodriguez-VBzquez and R. Carmona, ‘Weight-control strategy for programmable CNN
chips’, Proc. Third IEEE Int. Worbhop on Cellular Neural Networks and Their Applications, Rome, December 1994,
pp. 405-4100.
6.
7.
8.
9.
10.
Документ
Категория
Без категории
Просмотров
3
Размер файла
1 261 Кб
Теги
933
1/--страниц
Пожаловаться на содержимое документа