INTERNATIONAL JOURNAL OF CIRCUIT THEORY AND APPLICATIONS, VOL. 24,93- 109 (1996) A CNN UNIVERSAL CHIP IN CMOS TECHNOLOGY? s. ESPEJO, R. CARMONA. R. DOM~NGUEZ-CASTROAND A. RODR~GUEZ-VAZQUEZ Centro Nacional de Microelectrdnica, Universidad de Sevilla, Edifrcio C K A , CITarfia sin, E-41012 Sevilla, Spain SUMMARY This paper describes the design of a programmable cellular neural network (CNN) chip with added functionalities similar to those of the CNN universal machine. The prototype contains 1024 cells and has been designed in a 1 .O pm, n-well CMOS technology. Careful selection of the topology and design parameters has resulted in a cell density of 31 cells mm-* and around 7-8 bits accuracy in the weight values. Adaptive techniques have been employed to ensure accurate external control and system robustness against process parameter variations. 1. INTRODUCTION Massively parallel analogue processing systems are natural candidates in those application fields where nature has proven their outstanding capabilities.'32In particular, the processing front-end of biological vision systems (retina) has inspired the development of highly parallel computation algorithms, representing a potential breakthrough in artificial vision application^.^ However, the practical use of these algorithms is conditioned to their efficient and feasible physical implementation. In this context the cellular neural network (CNN) ~ a r a d i g m represents ~-~ an interesting alternative, with a demonstrated wide range of applications'-' and particularly well suited for monolithic 1C realizations. Also, the recent extension towards the definition of a programmable analogic array computer, the CNN universal machine, has opened up many new application fields which can be handled through spatial and temporal task sequentiation controlled by a stored programme. l o A key feature of CNNs is their potential for high operation speed in the processing of array signals. However, this does not manifest itself if the CNNs are realized in the form of software on conventional computers, but only if they are realized as VLSI chips, The few CNN chips that have been reported so far in the literature differ in their complexity and functionality. Most of them include only a limited number of cells, about 30,"-'4 while a few feature complexity levels up to 1024 cell^.'^.^^ These chips also vary significantly in their functional capabilities: some of them have fixed f ~ n c t i o n , ' while ~ others are electrically ~ o n t r o l l a b l e . ' ~Recently, -~~ Harrer and Nossek reported a chip with some of the functionalities of a CNN universal machine," based on a discrete-time CNN model. This paper reports the first CNN universal machine chip based on continuous-time CNN models, and describes the major trends, obstacles and solutions adopted after careful analysis of many alternatives. An additional feature of the reported chip is its capability of being initialized either in electrical or optical form. Section 2 contains general considerations about the mathematical CNN model used in this prototype, implementation trends and extended capabilities towards the CNN universal machine. Section 3 is dedicated to the cell architecture, from the selected synapse topology to the operating mode of the analogue and digital circuitry, while Section 4 describes the system architecture and control strategies. Finally, Section 5 presents some examples of experimental and simulated results and Section 6 provides a summary of the paper. t Part of this research has been reported in the Proceedings of the 1994 IEEE International Workshop on Cellular Neural Networks and Their Applications held in Rome. CCC 0098-9886/96/010093- 17 0 1996 by John Wiley & Sons, Ltd. Received 15 January 1995 Revised 6 April I995 94 S. ESPEJO ET AL. 2. GENERAL CONSIDERATIONS 2.1. Mathematical model The presented chip, whose microphotograph is shown in Figure 1, uses the so-called full signal range (FSR) CNN model x<-1 1x1 d 1 m ( x - 1 ) + 1, x > 1 m(x+l)-1, first proposed by the authors in Reference 14 and discussed in detail in Reference 18. Cell state variables . x ' ( r ) in this model are restricted to the same variation interval as input and output variables, [ - 1 , l J (after normalization). This differs significantly from Chua and Yang's model, where the interval for state variables is larger and its amplitude varies with the values of the templates. In the new model the state and output variables of each cell merge for all practical purposes, the block that realizes the output nonlinearity is discarded and the complexity of the associated circuitry decreases. Consequently, it yields larger Figure 1. Photomicrograph of the complete chip 95 CNN CHIP IN CMOS TECHNOLOGY cell density and smaller power consumption than the original model-prerequisites to increase complexity in the production of CNN chips. In addition, it is insensitive to uniform variations in CNN coefficients, an important feature in a VLSI context, where process parameters are always subject to variations. Finally, for a given time constant the processing speed is approximately doubled in common applications. 2.2. Implementation trends Most problems found in the electronic implementation of CNN systems relate to the high number of pixels (cells) required in practical applications, in particular for image processing. Given the limitations on chip dimensions (about 1 cm x 1 cm) imposed by technological and economical constraints (mainly the yield of the process), the high cell count translates into a requirement of high area efficiency in the cell circuitry design. Cell power dissipation must also be kept within some limits, imposed either by the thermal properties of the package or by battery life specifications in the case of portable systems. Another important objective is to maximize the accuracy in the emulation of the mathematical CNN model. This consideration is particularly important considering the strong area-efficiency requirement, owing to the inverse relationship between mismatch errors and device area.” Although a compromise must finally be accepted between the two objectives, the achievable error and area figures depend strongly on the particular circuit blocks employed in the implementation. For this reason a detailed analysis of mismatch effects as a function of area for alternative circuit structures was necessary. Robustness of system behaviour against process parameter variations is also a necessary condition, particularly when using standard digital technologies for which process parameter windows are commonly large. Finally, the control of the analogue chip must be robust and as simple as possible. External setting of internal analogue parameters required to programme a particular application must be accurate and simple and hence independent of process parameter values. 2.3. Functionality and characteristics The functionality of the chip is similar to that of a CNN universal chip.” Its central capability is that of a completely programmable CNN: controllable feedback, control and offset coefficients. In addition, every cell is equipped with a programmable digital gate, digital memory and other 1/0 and control circuitry as shown in Figure 2. The CNN network is constructed on a rectangular grid, is assumed uniform and its neighbourhood radius PROGRAMMABLE APR CNN CELL I 1 LLM sensor I I LLM-Addnss Control f - CNN Control -1 Logic f- Unit ’1 I LLM-VO Multiplexer Control Figure 2. Schematic architecture of one cell of the prototype LPR 96 S. ESPEJO ET AL. is always unity, i.e. connectivity between cells is limited to adjacent neighbours in vertical, horizontal and diagonal directions. Strong emphasis has been placed on low area and power consumption in the implementation of the cells, as well as on accuracy. Extensive analysis has been performed to optimize the architecture and basic building blocks against statistical on-die parameter variations (mismatch). Strong effort is made to obtain robustness against wafer-level parameter variations. For this purpose many of the internal variables are automatically tuned on chip. The final topology forecasts weight and offset accuracies in the range of 7-8 bits, while area requirements allow a complete 32 x 32 cell system to be integrated in a 7.7 mm x 6.8 mm prototype using a 1.0 pm, n-well, digital technology. In particular, cell area is about 171 pm x 187 pm. Array dimensions are 5 . 5 mm x 6.0 mm. The remaining area is dedicated to boundary cells, weight adaptation stages, memories for template coefficients and local-logic-unit truth tables, adaptive bias stages, 1/0circuitry and bonding pads. Chip management is completely digital, facilitating its control and communications. No external analogue signals are required for chip control or references. Analogue weights are specified and stored in digital form. For each template value, an adaptive stage transforms the digital code to an analogue voltage, which is then transmitted to the network. This methodology results in weight independence of particular process parameter values, as well as accurate external control. The quantization error on the coefficients is lower than the expected statistical error of the analogue multipliers. Hence it is not relevant. Every cell includes a photosensitive device, which allows the system to be optically initialized. These devices are CMOS compatible and incorporate a tuning scheme for automatic adaptation to different illumination conditions.20 Electrical initialization is also possible, while the output image is always downloaded in electrical form. Input and output images are assumed to be binary in every case. Electrical image uploading and downloading are realized row by row through 32 1/0bonding pads. The digital circuitry at each cell includes a 4 bit static memory (LLM), a completely programmable twoinput digital gate (LLU) and initialization and control circuitry for many different operations. The 4 bit memory at each cell allows the network to store four complete images. Two additional ‘read-only memories’ with fixed + 1 and - 1 values are also available. In addition to the CNN processing capability and the programmable digital gate, every cell can perform the following operations, which are generally realized simultaneously at every cell in the array: (a) storing an electrically loaded image into a specific memory location-this operation is realized sequentially, row by row (b) storing an optically captured image into a specific memory location (c) storing the result y(=) of a CNN process into a specific memory location (d) storing the result of a particular digital operation among two images into a specific memory location (e) Moving the content of one memory location to another (f) using a specific memory location as input u of the cell (g) using a specific memory location as initial condition x(0) of the cell (h) downloading a particular memory location through the 1/0bonding pads-this operation is realized sequentially, row by row The programme control functionality is based on a large static digital memory located at the periphery of the cell array. This memory is used to store up to eight complete sets of coefficients. Each set specifies all the parameters required to define the CNN operation (APR) and the truth table of the programmable digital gate (LPR). The content of each set of coefficients is described in Table I. The combined APR and LPR memory can be viewed as a digital memory of eight words of 160 bits (each describing a complete set of coefficients). The loading process is performed through an 8 bit data bus on a sequential schedule controlled by two complementary address buses, one indicating the particular 8 bit data being loaded and the other indicating the set of coefficients. After the internal memory has been loaded, the eight sets of coefficients can be used in any order, any number of times. The state and input values of boundary cells and the truth table of the local logic unit are grouped in a 97 CNN CHIP IN CMOS TECHNOLOGY Table I. Information content of one set of coefficients in the joined APR and LPR static memory Data description Feedback coefficients Control coefficients Offset term Boundary cell state Boundary cell input LLU truth table Symbol Count Bits Values Increment 4 h: d x, u, Tr single 8 bit word. Hence the memory can be viewed as a group of 20 blocks of 8 x 8 bit RAM memories. Each of the 20 blocks corresponds to a particular coefficient and contains eight different values of 8 bits. Each different value corresponds to a different 'set' of coefficients. 3. CELL CIRCUITRY 3.1. Multiplier selection Programmable scaling blocks or multipliers are the key block in a programmable CNN cell, owing to the large number of them required (one per coefficient). We have selected the multiplier shown in Figure 3.2' This structure is a fully differential, four-quadrant multiplier with high linearity and presents an excellent area/accuracy figure. The four transistors are identical and operate in the triode (ohmic) region. Nodes I,, and I, are connected to the differential input of the cells in the neighbourhood. The state variable of the cell is represented differentially by signals Vx, and Vxn.The differential weight signal is compounded of signals V,, and VPn.The scaled signal, in current form, is differential as well and given by the difference of the currents flowing out of the multiplier from nodes I,, and Ion.Analysis of the structure in Figure 3 yields lop-]on=2B(Vpp- Vpn>(Vxp- Vxn) (2) were is the large-signal transconductance of one transistor (B = pockxW/2L). Note that all transistors are n-channel, resulting in high area efficiency (no wells required). Both input signals (state variable and weight) are given as voltages, while output is represented as current. These characteristics facilitate the following three operations: (a) the distribution of the state variable of each cell to the multipliers used to implement the different coefficients (one per neighbour); (b) the distribution of the programmed coefficient values to every cell in the array (weights are invariant from cell to cell, since the network is uniform); (c) the summation of the contributions coming into every cell from the different cells in the neighbourhood. The high input impedance of the state variable input nodes of the multiplier avoids the use of on-cell vPP VPn Figure 3. Analogue four-quadrant multiplier used in the prototype 98 S. ESPEJO ET AL. low-impedance buffers. On the other hand, the weight signals must be driven by buffers with an extremely low output impedance, since the already low input impedances of a high number of multipliers (one per cell) are connected in parallel. The input nodes of the cells must have an extremely low input impedance, since the output impedance of the multipliers is low and a number of multipliers (one per cell in the neighbourhood) are also connected in parallel at the input node of every cell. The common mode current at the output of the multiplier is given by Jop+lon=-B(Vpp-Vpn)* (3) which is small and depends not on the state variable but only on the weight value. This allows elimination of the common mode signal in a preprocessing step, storing its value in analogue memories and avoids the use of continuous-time common mode feedback circuitry within the cell. The input capacitance at the state variable nodes of the multiplier can be used as integrating capacitors, as explained later. 3.2. Cell architecture and operating mode: analogue circuitry Figure 4 shows the analogue core of the cell, excluding the multipliers. The architecture is fully differential. Contributions from all neighbours are added at the low impedance nodes I,, and Z,,. This is accomplished by connecting the output nodes I,, and I,, of the corresponding multipliers in the neighbouring cells to nodes I,, and I,,, respectively. Correspondingly, multipliers implementing the scaled replicas of the state variable of this cell, with output directed towards the cells in the neighbourhood, have their input nodes V,, and V,, connected to the nodes with the same name in Figure 4, whose voltages represent the differential state variable of the cell. Switches labelled Lo are used to set the cell in open- or closed-loop configuration. That is, while the signal Lo is high (switches are of), no dynamic evolution occurs. Diodes in the figure are actually realized by a p-n-p vertical device and a diode-connected n-channel transistor as shown in Figure 5 . Their function is to implement the outer segments of the non-linear loss term in equation (1).22 The central piece of equation (1) is realized by redefining the central feedback coefficient a: as a',- 1. Each rail of the state variable is limited to the interval [ V,, - V,r,( Vref),Vref+ V,], where V , is the 'on-voltage' of the emitter-base junction of the vertical p-n-p device and VTn(Vmf)is the threshold voltage of the diode-connected n-channel transistor. The aspect ratio of this device is chosen to be very high to obtain a sharp non-linearity. Integrating capacitors C at nodes Vxp and V,, are realized by parasitic capacitors, corresponding to the sum of the gate capacitance of the multipliers, as mentioned earlier. Since all the transistors in the D ? 8 vc',,c A 4 P, Figure 4. Schematic diagram of the analogue core of the cell. Multipliers are not included CNN CHIP IN CMOS TECHNOLOGY 99 "ref Figure 5. Diodes and feedthrough compensation capacitor realization in Figure 4 multipliers are permanently in strong inversion, this capacitance is quite linear. In addition, area efficiency is higher than with the usual two-poly structures in analogue technologies. Only nine analogue multipliers (not shown in Figure 4) are included in each cell. These multipliers generate the nine scaled contributions (one for each neighbour) of the cell input or state variable signal. In a first step the control contributions are generated by connecting the multipliers' input nodes to a pair of voltage levels representing the input value ui of the cell and setting the weight signals to the values corresponding to the control template. In this manner, each cell receives the sum of the control contributions from its neighbourhood and stores it in an analogue memory. In a second step the same multipliers are used to generate the feedback contributions. Weight signals are set to the values corresponding to the feedback template, initial state values x'(0) are introduced in the integrating capacitors and the network is allowed to perform its dynamic evolution. This strategy halves the number of multipliers in every cell. Furthermore, the input-stage offset is stored together with the control contributions and cancelled during the dynamic evolution. A structure similar to that of the multiplier is used to generate the offset term contribution. The required analogue memories are implemented by the p-channel transistors in Figure 4 with gates driven by switches labelled A,,,. The n-channel transistors located in the lower part of the figure provide the necessary bias shifting to allow the memories to store either positive or negative currents. Most of the remaining circuitry in Figure 4 is used to eliminate the common mode component of the differential current. An adaptive feedthrough cancellation technique is used which involves the capacitor C,,,. This capacitor is implemented by a shorted p-channel transistor as shown in Figure 5 . For the sake of clarity, let us describe a typical analogue operation of the cell, step by step; 1. Set the state variable (nodes V,, and V,,) of every cell i to the saturated value (binary images are assumed) corresponding to the cell input ui.This is done by the inverters driven by local data lines (see Figure 6 below) D and b and through the switches labelled R at the output of these inverters in Figure 4.Excess current from the inverters is taken care of by the voltage limiter. 2. Connect the control template weight voltages to the nine multipliers in every cell. This step is performed simultaneously with the previous one. During this process, signals A,, c,, and R are low, while c, and Lo are high. This part of the process ends with the rising edge of A,, which results in storing the control contributions from the cells in the neighbourhood in the analogue memories. A sign inversion is present in this process, which is internally solved using opposite-sign control weights at every cell. A significant feedthrough error from the memories will be present. However, its dominant component corresponds to a common mode error in the stored differential current. This is cancelled in next steps. 3. Set the state variable (nodes V,, and Vxn) of every cell i to the saturated value (binary images are assumed) corresponding to the initial conditions x'(0). This is again done by the inverters driven by D and D. 4. Connect the feedback template weight voltages to the nine multipliers in every cell. This step is performed simultaneously with the previous one. 5. This step begins with the rising edge of c,, and the falling of c,. As a result, the common mode of the currents going into the cell, which corresponds to the common mode of the control contributions (stored in the memories), plus the feedthrough error of the memories, plus the common mode of the feedback contributions at the initial point, flows through the two p-channel transistors enabled by the 100 S. ESPEJO ET AL. switch controlled by cme,which are 'diode connected' by the switches controlled by c,. An additional bias shifting is provided by the n-channel transistors connected through the n-channel switches controlled by cme. 6 . After a short settling time, signal c, rises, storing the common mode signal in the associated pchannel transistors. Introduced feedthrough error represents a significant common mode error. Unlike in fully differential linear structures, common mode errors are significant owing to the non-linear nature of the system, since the limits of the state variable signal range do not move with the common mode voltage. The rather particular switching configuration, involving three switches, results in strong feedthrough attenuation. Nevertheless, the remnant error levels are still the limiting factor for system accuracy. Thus an identical stage is replicated out of the cell array, which reproduces this feedthrough error and, using a common mode adaptive loop, modifies the voltage at the gate of the p-channel transistors through capacitor C,,,. Even though different cells have different control and feedback contributions owing to different input ui and initial conditions x'(O), this is irrelevant because the common mode of the multipliers being used depends only on the weight values (uniform throughout the array) and not on the signal levels. This scheme results in an extremely accurate cancellation of the common mode. 7. Signal Lo falls, closing the loop of the cells. However, the network remains 'tied' to the initial conditions until signal R falls in the next step. This is done so to avoid charge redistribution between the nodes at the sides of the Lo switches, which would result in a large error in the initial conditions. This is due to two facts. First, after the switches controlled by c, are turned off, the impedance at these nodes is high and hence the voltage at these nodes settles close to the bias supplies. Second, the capacitance at the drain of the p-channel transistors is quite high, since these transistors must necessarily be wide due to the high current levels that may flow into the cell under some combinations of weights and signals. The technique described at the beginning of the paragraph avoids this problem. 8. The final step consists of raising signal R , which releases the network allowing its evolution. 3.3. Cell architecture and operating mode: digital circuitry The digital circuitry of the cell is composed of three major blocks, shown in Figures 6-8. Figure 6 shows the local logic memory. It consists of four flip-flops, used as a 4 bit memory and has the combined functionality of the LLM and LAM of a general CNN universal machine." A 'strong inverter' driven by signal A, is used to force the 'weak' inverters in the flip-flops. Any of the four bits can be addressed at any time using global signals m , , m2, m 3 and m4. Signals W and R ( i ) t are used to control the writing procedure, which can be time multiplexed row by row when an image is being loaded electrically or completely parallel if it is being loaded in optical form. Signals D and are used to initialize the cells (see Figure 4) and are obtained from one of the flip-flops or directly from the power supplies when a uniform Figure 6. Local logic memory: merged functionality of LLM and LAM 't in what follows, i andj are used to denote rows and columns, respectively. CNN CHIP IN CMOS TECHNOLOGY 101 LO Dn 1 D Pl Dn2 DP2 b b D l l D1o b b Dm Figure 7. Local logic unit: LLU vxn +oc Figure 8. Local communication and control unit: LCCU + 1 or -1 input or initial condition is desired (signals m band m,). Using the parasitic capacitance in the initialization circuitry as an intermediate memory, the content of a given flip-flop or a uniform + 1 or - 1 can be moved to any memory location. Figure 7 illustrates the implementation of the local logic unit (LLU), which is a two-input digital operator with fully programmable truth table, transmitted through global signals D,, Do,, D,,and D , , . The two input signals of the LLU are always taken from memories 1 and 2. Figure 8 describes the local communication and control unit (LCCU). It is a digital multiplexer which selects the origin of the signal to be stored in one of the digital memories. Possible sources are the photosensor ( I F ) , the external 1 / 0 signal (EIOU)) corresponding to a particular cell column j , the cell Figure 9. Photomicrograph o f one cell of the FSR CNN universal chip 102 S. ESPEJO ETAL output obtained from node V,, with the help of a simple inverter (a bipolar to unipolar converter, B-U) and the output of the logic unit (LO). In addition, by selecting two signal paths simultaneously, the multiplexer permits downloading of the cell output ( E and C) on a row-by-row schedule, as well as reading of the output of the photosensors ( E and F ) or the local logic unit ( E and 15).The content of any memory location can also be read by first using it as an initial cell condition. In this manner the output of the cell contains the memory information (before the evolution is allowed) readable from UC. This concludes a brief description of the cell architecture and its operating mode. A final observation concerns the lay-out strategy. Over 50 global signals must be connected to every cell, including weight voltages, biasing, analogue adaptive signals, digital signals and control. For this reason, cell routing is realized using only polysilicon, diffusion and the first metal layer. Global signals are transmitted using the second metal layer over the transistors and local routing area. This approach results in an area saving which can be estimated from the fact that almost the whole surface of the cell is covered by metal 2 global lines, as shown in Figure 9. 4. SYSTEM ARCHITECTURE AND CONTROL 4.1. Architecture The system architecture is represented schematically in Figure 10. As mentioned previously, system operation relies on a number of adaptive stages to tune electrical variables, compensate for inaccuracies and perform automatic weight adjustment. Bias and tuning stages generating analogue reference voltages are located at every corner of the chip area and interconnected. This is intended to produce biasing levels which correspond to the average value of the electrical parameters in the die. This is an important consideration given the large area of the prototype and the unavoidable spatial variation of the process parameters. Other miscellaneous circuitry, such as digital gates used to generate some control signals and the feedthrough adaptive stage, are located on the left side of the array. 32 I/O cells Bias-Adaptive stage a‘ Bias- Adaptive stage 1 - 1 DBias-Adaptive stage , , , 0Bias-Adaptive I \ //I Adaptive stages (weights) SRAM Figure 10. Schematic system architecture \ stage 103 CNN CHIP IN CMOS TECHNOLOGY A digital decoder placed on the right side of the cell array is used to generate the 32 signals R ( i ) required for the row-by-row 1/0 protocol. The decoder includes an ‘all-enabled’ signal, since signals R ( i) must all be high during normal parallel operations and during optical image uploading from the photosensors. The 32 1/0 cells located at the top of the chip include input and output digital buffers as well as the circuitry required to multiplex the input and output signals through the same 32 lines EZO(j). The circuitry located at the bottom of the cell array can be divided into two large sections. The first, located below, consists of a set of 20 SRAM blocks, each with eight 8 bit words. The second, located above, contains 10 adaptive stages. A more detailed block diagram of this circuitry is shown in Figure 11. Considered as a whole, the 20 SRAM blocks constitute an SRAM memory of eight 160 bit words. Each of these words contains all the information required to describe a CNN network and a truth table of the LLU. Hence the system stores up to eight complete programmes which, after being loaded in a preinitialization step, can be selected any number of times in any order. If each of these complete programmes is considered as an instruction of the analogue processor, the content of the SRAM memory can be interpreted as the microcode defining the set of instructions available. This microcode is userprogrammable in an initialization process. A or B Weight Voltages * . . . . . . I . A-Template . . * I , . RAM &Template RAM Blocks Blocks , , . . . . q * . , , . Offset term RAM Block Boundary Conditions & LLU T.T. /’ Buffers \ Inverters Buffers R A M Block d bd 8x8 RAM 8x8 G<07>!- I I - I 0 Program Selection for reading & writing D<o:7>! - DataInput Figure 11. Programme memories (APR and LPR) architecture 104 S . ESPUO ET AL. The 160 bits of each instruction are distributed as follows. Each coefficient of the feedback and control templates is codified by a group of 8 bits, seven plus sign, as described in Table I. This accounts for 144 bits storing the 18 weights of the feedback and control templates. The next 8 bits codify the value of the offset term, again seven plus sign. Finally, the last 8 bits are used to store three different parameters: the state variable of the boundary cells (2 bits), the input to the boundary cells (2 bits) and the truth table of the LLU (4bits). The state and input of the boundary cells can take the values - 1, 0 or + I . A zero value results in the absence of boundary conditions. Nine of the 10 adaptive stages are identical and tune the weight voltages of the feedback or control templates. Bear in mind that the feedback and control templates are not used simultaneously by the network. The input to the adaptive stages is connected to the corresponding feedback or control coefficient by global signals A and B. In this manner, switches are avoided in the analogue weight signal paths. Given the extremely low impedance required from the analogue weight signals, switches would produce large errors if inserted in the weight signal paths. The last adaptive stage differs slightly from the others and tunes the value of the offset term. 4.2. Weight control The use of either analogue or digital programmability presents advantages and disadvantages. Even for a reduced number of bits, digitally programmable multipliers require much larger areas than common analogue multipliers. In addition, a large number of control lines is required, which typically results in a dominant area requirement. On the other hand, analogue multipliers are sensitive to process parameter variations, making accurate setting of the coefficients difficult. In addition, on-chip storage of the analogue weight values requires analogue memories, which generally present time degradation problems. The disadvantages of analogue programmability relate to the control and storage of weight values and their dependence on process parameters, while digitally programmed multipliers require large areas and an excessive number of control lines. We have used a hybrid approach in this design, based on the use of analogue programmable multipliers within the cells and digital control from the exterior of the network. This combines the advantages of analogue and digital programmability as summarized in Table II.'6.23 The use of analogue programmable multipliers within the cells provides higher area efficiency and a low number of control lines, while the external digital control facilitates the control of the weights and their onchip storage. The analogue weight signal is generated from the digital word using an adaptive loop, which involves a linear D/A converter and an analogue multiplier identical with those used within the cells. The adaptive control eliminates the dependence of the weight values on process parameters. This hybrid control strategy relies on the proper behaviour of the adaptive stages used to generate the analogue weight voltages from digital words. The use of multipliers based on MOS transistors in the ohmic region requires that some of the input terminals be driven by low-impedance nodes. We have used the highimpedance input nodes of the multipliers as the signal input to avoid the use of low-impedance buffers within the cells. As a result, the weight signals must be introduced through the low-impedance nodes of the Table II. Simplified comparison of different alternatives for programmable CNNs Programmability alternatives for CNNs Effective resolution Area consumption Number of internal signals Power consumption Process variation effects Design effort External weight control Global linearity On-chip weight storage Analogue 7-8 bits digital Hybrid 7-8 bits Low Low Variable High High Difficult Difficult Difficult 7-8 bits Very high Very high High Low Low Simple Simple Simple 7-8 bits Low Low Variable Low Very high Simple Difficult Simple CNN CHIP IN CMOS TECHNOLOGY 105 multipliers and hence be driven by low-impedance nodes. Furthermore, since the weight signals drive all the cells in the array (1024 cells in our system), the weight signal buffers must have extremely low output impedance and high output current capability. The architecture of one adaptive stage is illustrated schematically in Figure 12 and can be briefly described as follows. Two low-impedance input stages, identical with those used in the cells, are driven by one multiplier, also identical with those used in the cells. The differential output current of the multiplier is made single-ended by a p-channel current mirror. The resulting single-ended current is compared with the current generated by a D/A converter with current form output. The current generated by the D/A converter always flows in the same direction and its value is 2pI,,,, where p is the digitally programmed weight in the range [0,4] and I,,, is the current corresponding to the saturation level of one of the rails of the differential state variable of the cell. The input signals of the multiplier correspond to the voltage saturation levels of the state variable, i.e. the limits of the signal range and are generated by the same 'diodes' as described in Figure 5 and used in the cells to limit the voltage range of the state variable. The weight signals of the multipliers are driven by two analogue buffers, whose output voltages are the weight signals transmitted to every cell in the system. During a precalibration step controlled by $of, these buffers are disconnected from the multiplier and the differential weight signals are shorted at the input of the multiplier. This has the effect of setting the common mode voltage V,, to the voltage V , at the input of Vpni -0 1 lr' ' r 4 Figure 12. Architecture of a weight adaptive stage 106 S. ESPEJO ET AL. the low-impedance input stages. The current flowing out of the p-channel current mirror is due to the offset of the low-impedance input stages and the current mirror itself. This error is stored in the p-channel transistors with gates driven by switches controlled by which constitute the analogue memories. After this calibration step the output of the current memories is added to the difference of the output signals of the analogue multiplier and the D/A converter and the resulting current is integrated at the input nodes of the analogue buffers. The output of these buffers controls the weight signals of the multiplier, thus closing a feedback loop which settles after the analogue weight signals have the correct, adapted value. Output impedance is extremely low owing to the feedback loop. Also, the offset of the buffers is irrelevant, since their effect is included in the loop. The sign of the programmed weight is introduced by swapping the connections of the input and output nodes of the buffers with respect to the adaptive core of the circuitry. The signals driving the cells have no switch in their path, to avoid output impedance degradation. The common mode of the weight signals is cancelled with the help of two resistors. These resistors can be implemented with polysilicon, n-well regions or MOS transistors in the ohmic region. Mismatch among these resistors has little effect on the behaviour of the multipliers, which are almost insensitive to the common mode of the weight signals. The D/A convener is implemented by a binary-weighted array of n-channel transistors in saturation. Errors of the D/A converter are not cancelled and hence, the unitary transistor must be carefully chosen. All current sources in Figure 12 are realized by MOS transistors in saturation. Cascode structures are actually used for these current sources and for the p-channel current mirror. A final important issue is the extremely high power dissipation of the buffers. Note that most of the current flowing in the system actually originates at the buffers of the adaptive stages, since the state variables of the cells drive capacitive nodes. The amount of current required from the output of one buffer depends on the value of the programmed weight. For this reason the implementation of the buffers, illustrated at the top right of Figure 12, includes a weight-dependent bias current controlled by the most significative bits of the digital word encoding the weight value. A fixed bias current is permanently present to ensure proper functioning of the buffer. An additional power dissipation reduction is achieved by maintaining the bias current of the buffer generating the positive rail of the weight signal fixed to the minimum weight-independent level. 5. RESULTS The testing of the prototype has been affected by an unexpectedly low yield of the technology and the reduced sample set (30 units). Optical and electrical 1/0 functionalities, internal (cell level) memory, digital and data-transfer operations, adaptive stage behavior and global control and memory functionalities have been separately confirmed on reduced subsets of samples. The analogue processing circuitry has also been verified to the limits allowed by the faulty digital shell, which masks the internal analogue behavior. Random catastrophic faults have also been observed on the analogue circuitry of isolated cells within every sample. The verification of the different functionalities of the prototype chip is complete, although none of the samples is completely free of isolated defects. These results validate the design approach and the employed circuit techniques. As an example, Figure 13 illustrates simulated results of the analogue processing circuitry, using extracted net lists from the lay-out of a reduced 4 x 4 cell network employed during the design process for verification. The use of this reduced network is imposed by simulation constraints on CPU time and memory. The example corresponds to a diagonal connected component detection process. The wave-forms in the lower part of the figure, spatially distributed like the cells in the array, correspond to the positive rail of the state variables of the different cells. The two sets of wave-forms in the upper part of the figure correspond to selected control lines (right) and to the four I/O data lines of the network, one per column of cells. Serial input and output processes take place at a clock frequency of 10 MHz. Thus the loading or downloading of a complete image in the real prototype (32 x 32) takes 3-2 ps. The image-processing time 107 CNN CHIP IN CMOS TECHNOLOGY 1. .i ....... OUTPUT EZO( 1) 2.0: 0. -. .; ' 4.0\,,'- EZO(2) 2 ._: . , ~ .:'-.,,f, o ................................ 0. - n EZO(3) 5.0- EZO(4) l - 2.0u 0. 3.4.04--7--+ 5o = , . . 4.OT7 -r 3*50_' 4 . . . . 0. . . . . T. . ~ ............... I 3.0~ . . . . . .." ......... - 50_'................. 3*o: tlme ( I i n ) 1 ! - 3.0_ J I . . . . . . . . . ., # I , 3.0_' , . . . . . . . . . . ..^. ....., > , I , .....- 4.0r-! 3.501 - 3.07. . . . . . . . . . . i I i , i .. 3.50d f \ 3.0; ..... I . ...... - ............ . . . . . 3.50: .. ... 3.0: . - ............. 3.0k 2.80; 2.801. 0. 2.0u time (Iln) 0. tine I I m ) 2.80; 3.0: 2.0u 2.0u 3.0~ (._IT-d 3.0~ 0. time ( I l n ) 3.0~ [ 0. , ,~ , 2.0u time ( I i n ) L . 4 3.0~ Figure 13. Result example: diagonal connected component detection in a 4 x 4 network is strongly dependent on the particular CNN application, ranging from less than 1 ps to several microseconds. Figure 14 shows a microphotograph and measured results of the static transfer characteristic of the adaptive stages (output weight signals versus time-swept digital code). The settling time of the adaptive stages is in the range of 0.5 ps, which represents an increase in image-processing time of 1 ps due to the sequential use of the control and feedback templates. 6. CONCLUSIONS This paper describes the design of a programmable cellular neural network (CNN) chip with added functionalities similar to those of the CNN universal machine. The prototype contains 1024 cells and has 108 S. ESPEJO ET AL. Figure 14. Photomicrograph (a) and measured output of an adaptive stage: (b) differential weight voltages; (c) difference of weight voltages been designed and manufactured in a 1.0 pm, n-well CMOS technology. A modified CNN algorithm which is significantly advantageous from an implementation point of view has been employed.2’ This, together with careful selection of the circuit topology and design parameters has resulted in a cell density of 31 cells mm - 2 and approximately 7 bits accuracy in the weight values. Adaptive techniques have been employed to ensure accurate external control and system robustness against process parameter variations. Measurements of the prototype validate the circuit techniques employed for the implementation of the different functionalities required by a CNN universal chip. We are currently finishing a redesign of the prototype using a high-reliability process and yield-oriented lay-out techniques, which will surely result in completely operative units. ACKNOWLEDGEMENT The research of Ricardo Carmona has been partially supported by Iberdrola S.A. under contract INDES-94/377. REFERENCES I . J. M. Zurada, Inrroducrion ro Artificial Neural Systems, West, 1992. 2. C. Mead and M. Ismail (eds), Anologrre VLSI Implemenration of Neural Systems, Kluwer, Dordrecht, 1989. 3. M. M. Gupta, and G. K. Knopf, (eds), Neuro-Vision Systems, Principles aridApplic~otiori.s,IEEE, Ncw York, 1994. 4. L. 0. Chua and L. Yang, ’Cellular neural networks: theory’. IEEE Trans. Circuim ond Systems, CAS-35, 1257- 1272 (1988). 5 . L. 0. Chua and L. Yang, ‘Cellular neural networks: applications’, IEEE Trans. Circuits and Spsiems, CAS-35, 1273- 1290 (1988). CNN CHIP IN CMOS TECHNOLOGY 109 L. 0. Chua and T. Roska, ‘The CNN Paradigm’, IEEE Trans. Circuits and Systems I , CAS-40, 147- 156 (1993). Proc. IEEE Int. Workshop on Cellular Neural Networks and Their Applications, Budapest, December 1990. Proc. IEEE Int. Workshop on Cellular Neural Networks and Their Applications, Munich, December 1992. Proc. IEEE Int. Workshop on Cellular Neural Networks and Their Applications, Rome, December 1994. T. Roska and L. 0. Chua, ‘The CNN universal machine: an analogic array computer’. IEEE Trans. Circuits and Systems 11, CAS-40, 163-173 (1993). 1 1 . J. M. Cruz and L. 0. Chua: ‘A CNN chip for connected component detection’, IEEE Trans. Circuits and Systems, CAS-38, 812-817 (1991). 12. H. Harrer and J. Nossek, ‘An analog implementation of discrete-time cellular neural networks’, IEEE Trans. Neural Networks, ”-3,466-476 (1992). 13. P. Kinget and M. S. J. Steyaert, ‘A programmable analogue cellular neural network CMOS chip for high speed image processing’, IEEE J.Solid-State Circuits, SC-30, 235 -243 (1995). 14. A. Rodriguez-VBzquez, S. Espejo, R. Dominguez-Castro, J. L. Huertas and E. Shchez-Sinencio. ‘Current-mode techniaues for the implementation of continuous-time and- discrete-time cellular neural networks’, IEEE Trans. Circuits and Sysients / I , CAS-40, 132-146 (1993). 15. S. Espejo, A. Rodriguez-Vazquez, R. Dominguez-Castro, J. L. Huertas and E. Shchez-Sinencio, ‘Smart-pixel cellular neural networks in analog current-mode CMOS technology’, IEEE J. Solid-State Circuits, SC-29,895 -905 (1994). 16. S. Espejo, ‘VLSI design and modeling of CNNs’, Ph.D Dissertation, University of Seville, 1994. 17. H. Harrer, J. A. Nossek, T. Roska and L. 0. Chua, ‘Measurement results of the DTCNN-universal chip’, Proc. 4th Int. Conf. on Microelectronics for Neural Networks and Fu::y Systems, Turin, September 1994, pp. 95-99. 18. S. Espejo, A. Rodriguez-VBzquez, R. Domhguez-Castro and R. Carmona, ‘A VLSI-oriented continuous-time CNN model’, Int. J . Cir. Theor. Appl., 24, (1 966), to appear. 19. M. J. M. Pelgrom, A. C. J. Duinmaijer and A. P. G. Welbers, ‘Matching properties of MOS transistors’, IEEE J . Solid-State Circuits, SC-24, 1433-1440 (1989). 20. S. Espejo, A. Rodriguez-VAzquez, R. Dominguez-Castro, J. L. Huertas and E. Shchez-Sinencio, ‘An analogue design technique for smart-pixel CMOS chips’, Proc. Eur. Solid-state Circuits Conf., Seville, September 1993, pp. 78-81. 21. N. I. Khachab and M. Ismail, ‘Linearization techniques for nth-order sensor models in MOS VLSI technology’, IEEE Trans. Circuits and Systems, CAS-38, 1439-1449 (1991). 22. S. Espejo, A. Rodriguez-Vizquez, R. Dominguez-Castro and R. Carmona, ‘Convergence and stability of the FSR CNN model’, Proc. Third IEEE Int. Workyhop on Cellular Neural Networks and Their Applicaiions, Rome, December 1994, pp. 41 1-416. 23. S. Espejo, R. Dominguez-Castro, A. Rodriguez-VBzquez and R. Carmona, ‘Weight-control strategy for programmable CNN chips’, Proc. Third IEEE Int. Worbhop on Cellular Neural Networks and Their Applications, Rome, December 1994, pp. 405-4100. 6. 7. 8. 9. 10.