PROTEINS: Structure, Function, and Genetics 29:252–257 (1997) SAmBA: An Interactive Software for Optimizing the Design of Biological Macromolecules Crystallization Experiments Stéphane Audic, Fabrice Lopez, Jean-Michel Claverie, Olivier Poirot, and Chantal Abergel* Information Génétique et Structurale, E.P. 91-C.N.R.S., Institut de Biologie Structurale et Microbiologie, Marseille, France ABSTRACT SAmBA is a new software for the design of minimal experimental protocols using the notion of orthogonal arrays of strength 2. The main application of SAmBA is the search of protein crystallization conditions. Given a user input defining the relevant effectors/variables (e.g., pH, temperature, salts) and states (e.g., pH: 5, 6, 7 and 8), this software proposes an optimal set of experiments in which all tested variables and the pairwise interactions between them are symmetrically sampled. No a priori restrictions on the number and range of experimental variables is imposed. SAmBA consists of two complementary programs, SAm and BA, using a simulated annealing approach and a backtracking algorithm, respectively. The software is freely available as C code or as an interactive JAVA applet at http://igs-server.cnrs-mrs.fr. Proteins 29:252– 257, 1997. r 1997 Wiley-Liss, Inc. Key words: software; crystallization; experimental design; JAVA INTRODUCTION The limiting step in the crystallography of biological macromolecules is still the identification of physical and chemical conditions leading to optimal nucleation and crystal growth. Given the variety of putative effectors (pH, temperature, ionic strength, precipitating agents, etc.), a full-factorial screening for ideal conditions would require a prohibitive number of experiments. Following a seminal article by Carter and Carter,1 various design strategies have been proposed to delineate crystallization conditions while trying to minimize the number of experiments. Recent works in this area include the use of orthogonal arrays2 and sampling techniques.3 The concept of incomplete factorial design4 and its use in the exploration and optimization of protein crystallization conditions has been reviewed by Carter.5 Given N variables (salts, pH, precipitating agent, temperature, etc.) and a finite number of states for each variable (NaCl 1M, 0.1M; pH 5.0, 7.0; etc.), the number of experiments required can be r 1997 WILEY-LISS, INC. significantly reduced by varying more than one variable at a time. In theory, the incomplete factorial design can provide a minimal set of experiments in which the influence of each variable can still be rationally estimated, for instance, using multiple linear regression. This approach was first implemented in the widely used INFAC program.6 The input to INFAC consists of the following parameters: the total number of experiments required by the user, the number of variables (effectors such as pH, temperature, and salt), and the number of states per variable. INFAC then randomly generates up to 450 matrices (protocols) obeying the constraints that the states of each variable are the most equally represented (e.g., given a request for 35 experiments and a variable with 5 states, each of them will be represented 7 times) and that each proposed experiment is different. In a primitive way, INFAC also incorporates the notion that the best protocols are those in which all pairs of states are represented. For each of the proposed protocol, the number of missing pairs of states is listed, so the user can select a protocol with this additional constraint in mind. In its present version, INFAC suffers of a number of limitations of practical importance. First, the user has to iteratively guess an adequate number of experiments. Second, the program often requires more experiments than is really needed to find a solution. Third, a careful analysis of the proposed protocols reveals that they are rarely ‘‘balanced,’’ for example, the sampling of all variables and states is not symmetrical and prevents the experimental design to be fully informative, in particular for pairwise interactions. This problem is illustrated in Table I. Finally, INFAC limits the number of variables to 15, and the number of states to 6. In our experience, crystallization protocols tend to make use of fewer variables but more states (e.g., trying 12 different pH *Correspondence to: Chantal Abergel, Information Gènétique et Structurale, E.P.91-C.N.R.S., Institut de Biologie Structurale et Microbiologie, Marseille, France. E-mail: firstname.lastname@example.org Received 20 February 1997; Accepted 6 May 1997 SOFTWARE FOR DESIGNING CRYSTALLIZATION PROTOCOLS TABLE I. An ‘‘Unbalanced’’ Matrix* Experiment no. V1 1 4 2 3 4 5 6 7 8 3 1 2 4 3 2 1 9 2 10 11 12 4 1 3 13 3 14 2 15 1 16 4 Variables V2 V3 V4 ..... 3 ..... 4 1 2 4 1 4 4 ..... 3 ..... 1 2 2 ..... 3 ..... 1 ..... 3 ..... 2 ..... 1 ..... 2 2 1 2 1 1 2 ..... 2 ..... 1 1 2 ..... 1 ..... 2 ..... 1 ..... 2 1 3 2 3 3 3 2 1 2 2 1 2 1 1 3 3 *This output of the INFAC program was generated from a query for a protocol of 16 experiments involving 4 variables with 4, 4, 3, and 2 states, respectively. Association of the second and fourth variables (V2. and V4) is not symmetrical. State number 1 of V4 is associated 3 times with the state number 3 of V2 while the state number 2 of V4 is associated only once with state number 3 of V2. A similar problem occurs for state number 4 of V2. It is associated 3 times with state number 2 of V4 and only once with state number 1 of V4. 253 ments required is computed by SAmBA and corresponds to the minimum needed for every pair of states to be represented at least once (e.g., NaCl: 1M/pH: 7.0). In addition, the protocols proposed by SAmBA are correctly ‘‘balanced,’’ that is, every state is equally represented for each variable, and every pair of states is equally represented for every association of any two variables, which allows for the identification of pairwise interactions. Such experimental design ensures that the influence of effectors can be recognized individually, as well as the most probable pairwise interactions between them. Clearly, the absence of some pairs in the final protocol and an unequal representation of the states in a same variable will introduce a bias in the experiment, preventing the user to retrieve the influence of each state from the study of its interaction with all the other. Moreover, to identify pairwise interactions, these should all be represented at least once and an equal number of times. Analyses will otherwise become statistically meaningless, either due to the absence of information, or to the overexpression of some pairs relative to the others. A proper interpretation of the results thus requires ‘‘balanced’’ crystallization protocols. Optimal crystallization conditions are rarely discovered with the first protocol. SAmBA will generally be used in the iterative refinement of experimental conditions, following a scoring of their capacity in inducing various types of precipitates and/or crystals.10 EXPERIMENTAL PROCEDURES values rather than exploring 12 variables in parallel, such as pH, precipitating agent, salts, cofactor, additives, and ions). SAmBA, a new software introduced herein, consists of 2 complementary programs, one based on Simulated Annealing (SAm), the other on a more classical Backtracking (BA) algorithm (see the Methods section); it overcomes all the limitations cited above, and offers a more user-friendly interface (Fig. 1). For users of WWW browsers such as Netscape, no program downloading is required, and the software can be directly used on our server. The user input to SAmBA simply consists of the number of variables and of the number of states per variable, as defined from initial precipitation experiments of the protein of interest at different pH (e.g., isoelectric point and activity pH) and in presence of different precipitating agents. The precipitating agents to be tested will vary from one protein to an other and can be retrieved from the Biological Macromolecule Crystallization Databank7,8 when there are similarities to previously crystallized macromolecules.9,10 The expected minimal number of experi- The protocol matrix corresponding to Incomplete Factorial Design is constituted of Nexp rows and Nv columns, where Nexp is the number of experiments to perform, and Nv is the number of variables V1, V2, . . . , VNv. Each variable assumes a number of states; for example, V1 may consist of five states 5S11, S12, S13, S14, S156 or in general the ith variable Vi may assume the ni distinct states 5Si1, Si2, . . ., Sini6. SAmBA solves the problem of finding a minimal protocol matrix (i.e., the smallest possible number of experiments) obeying three constraints: 1. The states of each variable have to be the most equally represented. 2. All experiments (rows in the matrix) have to be different. 3. All possible pairwise associations between states of different variables have to be represented once, and if more than once, an equal number of times. An estimate of the minimal number of experiments required to fulfill those constraints, noted Nexp, is needed to initiate the search process. Nexp is such that no solution is clearly possible with a lesser Fig. 1. Typical SAmBA session using the JAVA version. Left, top to bottom: Input windows for the various options, parameters and variables. Each variable is given a name and a nominative list of states. Right: listing of the optimal protocol found by SAmBA. This protocol corresponds to a 4-variable matrix with 4, 4, 3, and 2 states, respectively. This is to compare to the problem presented in Table I for which INFAC could not produce a relevant protocol. SAmBA produced a solution that led to crystallization. In this real experiment, lines 1, 11, 12, and 16 provided the crystallization conditions of a new high molecular weight Desulfuromonas acetoxidans cytochrome (Dr. Bruschi M., B.I.P., CNRS Marseille). The incomplete factorial protocol consists of 16 different conditions, while the full-factorial protocol would amount to 96 different conditions. SOFTWARE FOR DESIGNING CRYSTALLIZATION PROTOCOLS number. However, certain difficult cases might require more experiments (see the Results section). Nexp is computed as follows. First, consider the two variables with the largest number of states: Vk, having nk distinct states 5Sk1, Sk2, . . . , Skn 6, and Vl, k having nl states 5Sl1, Sl2, . . . , Sln 6. If the total numl ber of variables Nv is not larger than the maximal number of states nk, Nexp is computed as nk 3 nl. Otherwise, (i.e., Nv . nk) Nexp is computed as Nv 3 nk. The latter situation is rarely encountered in the design of actual crystallization experiments. The program SAm and BA both initiate the search of optimal protocols involving Nexp distinct experiments. If they fail to find a solution in a given elapsed time (or a given number of iterations), this number is incremented until a solution obeying the constraints listed above is found. For more flexibility, SAmBA also retains the option for the user to request a specific number of experiments, larger or equal to Nexp. SAm: A Simulated Annealing Algorithm The design of an experimental protocol is viewed here as a global minimization problem. The goal is to meet the constraints described above: for each variable, every state must be represented about the same number of times and every pairwise combination of states and variables should be represented at least once and/or about the same number of times. We now define Qkplq as the desired number of occurrences of the pair 5Vk 5 Skp, Vl 5 Slq6 in the experimental protocol. Similarly Pkplq is defined as the observed number of occurrences of the same pair. To satisfy constraint 1, given Nexp, one can calculate the number of occurrences of each state for the ith variable as Nexp/ni where ni is the number of distinct states for the ith variable. When the quotient is not an integer, some states will be represented by one more occurrence than others. By examining each row, representing a single experiment, with every other row, it can be assured that constraint 2 holds. Now the energy function can be defined as: Nv21 E5 Nv nk nl o oo o (Pkplq 2 Qkplq)2 k51 l5k11 p51 q5p11 The matrices satisfying the first requirement may correspond to a local minimum of the energy function, and among them are hidden matrices satisfying both requirements and corresponding to the global minimum of the energy function. Switching from one matrix to an other is done by swapping two elements of the same column. Swapping leaves constraint 1 satisfied since the total number of occurrences of each of the two swapped states is left unchanged. However, since constraint 2 may no longer be fulfilled, a matrix satisfying constraints 1 and 3 will be checked for constraint 2 at the end. 255 Simulated annealing is a method of choice for such minimization problems where the global minimum is hidden among many local minima.11,12 Starting from a randomized matrix satisfying constraints 1 and 2, we perform a random swap and compute the modification of energy it produces on the system. The most obvious algorithm would be to keep only the swaps decreasing the system energy (DE , 0); however, this is most likely to result in trapping the system in a local minimum. Instead, the simulated annealing approach consists in accepting all swaps reducing the system energy, but also to keep those increasing the system energy with a probability e(2DE/T) where T is the temperature of the system. At high temperatures the system wanders freely throughout the space of configurations. Then, the temperature is progressively lowered, increasing the tendency for the system to become trapped in local minima. At the beginning of the experiment, T is large with respect to the typical energy changes produced by random swaps. Then, T decreases geometrically, i.e. T 5 0.8 3 T. Temperature changes are programmed to occur either after 50,000 swaps, or after 5,000 successful ones (DE , 0). Random swaps are generated by the random draw of a variable and two experiment numbers, and by exchanging the states of this variable in the two experiments. As a single cycle of cooling is not always sufficient to reach the global minimum (E 5 0), multiple cycles are performed to find the solution of the most difficult problems, such as near-square matrices involving variables with numbers of states larger than 6. BA: A Backtracking Algorithm The backtracking approach simply consists in building row after row the matrix satisfying the constraint on the number of occurrence of each state (constraint 1) and pair of states (constraint 3). After each row is added, Pkplq is checked to see that it does not exceed Qkplq, that none of the previous row is duplicated (constraint 2) nor (constraint 1) is violated. If this is the case, the program then proceeds to build the next row. Otherwise, if one of the checks is violated, another possible row generated by stepping to the next possible combinations of states is investigated. When all the combinations have been unsuccessfully exhausted for the current row, the program goes back one row. The last successful row is replaced by the next possible row, the new combination of states is investigated for the 3 constraints. Since the number of possibilities increases rapidly with the number of variables, the straightforward implementation of the backtracking algorithm soon becomes impractical. Additional strategies are used to speed up the convergence. Before going back, the program also tries to exchange the last successful row with another row. The row to be replaced is chosen on the basis of an energy function similar to the one used by the simulated annealing algorithm. 256 S. AUDIC ET AL. The energy is computed for all the incomplete matrices constituted of all but one row of the current matrix. The row chosen is the one such that its removal from the incomplete list minimizes the energy function defined in the simulated annealing protocol. If this procedure still fails to unlock the situation and after 30 seconds of elapsed time, the next row targeted for replacement is randomly chosen. Prime Number Square Matrix Prime number square matrices constitute a peculiar case.13,14 A square matrix corresponds to protocols where the number of states is the same for all variables, and equal to the number of variables. Perfectly balanced solutions are particularly easy to find for those matrices if they involve a prime number (3, 5, 7, 11, etc.) of variables. The algorithm involves a succession of well-defined permutations. The required number of experiments correspond to the square of the prime number: P2. Every state will appear p times. The matrix can be viewed as p blocks of dimension ( p 3 p). The first column of each block bears p time the block number, the first block being given number 1. The second column bears a sequential list of the states numbers of the second variable. Starting from column 3, the columns are generated by circular permutation of the previous column. The step of this permutation is equal to the block number minus 1. RESULTS AND DISCUSSION There is no known simple and general answer (i.e., for an arbitrary number of variables and states) to the problem of building pairwise balanced matrices as defined above. The problem exhibits some aesthetic and surprising features. For instance, experimental designs involving prime number square matrices (e.g., 5 variables, each of them with 5 states) of any size are particularly easy to find.15,16 On the other hand, experimental designs involving even square matrices larger or equal to 6 (e.g., 6 variables at 6 states) seems to require more experiments and produce less symmetrical solutions than expected from intuition. Outside of those specific categories, the difficulty of solving given queries varies greatly, with no apparent rule. The perfectly pairwise balanced matrices we are attempting to build are mathematically known as orthogonal arrays of strength 2.17–19 The construction of orthogonal arrays is still an active field of research, but it is already clear that these can only exist for a small subset of all possible parameters (e.g., numbers of variables, states and experiments). Orthogonal arrays have been discussed in the context of protein crystallization by Kingston et al.2 They advocated the use of a general 64-experiment protocol built on an orthogonal array of strength 3 (all possible triplets of states are represented at least once, and an equal number of times). We believe that this scheme is not flexible enough to encompass all experimental situations. Furthermore, given the strict constraints governing orthogonal arrays of strength 3, most other combinations of variables and states will lead to a near full-factorial experimental design. From the laboratory point of view, a satisfactory experimental strategy must be a compromise between a symmetrical sampling of the full-factorial space and a minimal number of experiments, for any combination of variables and states. Protocols generated using the SAmBA software appear to fulfill those constraints. All protocols involving a reasonable number of variables and states are handled very easily by both the SAm and BA programs. Most often, the proposed protocols consists of the precomputed minimal number of experiments (see the Methods section). For instance, a fully balanced protocol can be found for an experiment involving 5 variables with 14, 12, 4, 3, and 2 states, respectively, in the predicted minimal number of experiments (14 3 12 5 168; see the Methods section). Noticeable exceptions concerned even square matrices, like 6 3 6 and 8 3 8, for which neither the SAm or BA programs could find a solution within the precomputed minimum number of experiments (36 and 64, respectively) in a reasonable time (less than a few hours). For those cases, or protocols involving larger than usual variables and number of states, the minimal number of experiments was progressively incremented until convergence occurred. Usually, adding two or three experiments was sufficient. It is noteworthy that the responsibility of choosing the relevant variables and their states still lies in the scientist’s hands. From this information, SAmBA effectively samples the factorial space of all possible experiments and provides a balanced protocol that corresponds to a minimal number of experiments. The SAmBA software offers the choice between SAm and BA, which complement each other well. In general, BA is faster for designs in which no variable has more than 6 states. Table II compares the performance of the SAm and BA algorithm for a set of representative queries. For none of these queries could INFAC find a fully balanced protocol in the minimal number of experiments. Interactive Network access and a user-friendly interface are other key features of SAmBA. The program can be used to its full potential without any internal knowledge of the algorithms. By default, the software automatically activates the a priori most efficient algorithm for the current query. The default can be overridden, including the possibility of specifying a number of experiments (larger than the minimum). Variables and states are handled as text, allowing printout of the protocols to be immediately usable in the laboratory. The software package is available in 3 versions: C code, JAVA code (for Silicon Graphics, SUN, and P.C.), and as an interactive SOFTWARE FOR DESIGNING CRYSTALLIZATION PROTOCOLS TABLE II. Compared Performances of the Simulated Annealing (SAm) and Backtracking (BA) Programs on the Same Platform (Silicon Graphics Indy 175 MHz IP22, FPU: MIPS R4010, CPU: MIPS R4400, Memory: 128 Mb.)* Design Time (SAm) Protocol size Time (BA) Protocol size (5, 4, 4, 3, 2) (4, 4, 3, 3, 2) (4, 4, 3, 3, 3, 2) (6, 5, 4, 3, 2) (6, 6, 5, 4, 2) (8, 6, 4, 3, 2) (6, 5, 5, 4, 3) (6, 5, 5, 4, 3, 2) (7, 6, 5, 4, 3, 2) (12, 8, 4, 3, 2) (14, 8, 4, 3, 2) (14, 12, 4, 3, 2) 28139 28339 268949 68479 338259 98139 .1 h .1 h .1 h 18179 378209 28109 20 20 24 36 57 48 30 36 42 96 112 168 0.569 0.489 7908 0.959 598 158 38 358 .1 h .1 h 88 .1 h 20 16 16 30 36 48 30 30 42 96 112 168 *Queries are shown in column 1, listing the number of states for each variable. Thus (5, 4, 4, 3, 2) means 5 variables with 5, 4, 4, 3, and 2 states, respectively. The table focuses on queries for which INFAC could not produce a balanced protocol within the minimal number of experiments. Running times for the SAm and BA programs as well as protocol size (i.e., total number of experiences to be performed) are shown. JAVA applet running from a Netscape window at http://igs-server.cnrs-mrs.fr. Experimental design is useful in other areas than crystallization. This concept is relevant each time one wishes to interpret the properties of complex mixtures, such as in food science (e.g., Refs. 20 and 21), cell culturing (e.g., Ref. 22), pharmacy (e.g., Ref. 23), gasoline design (e.g., Ref. 24), or agricultural science (e.g., Ref. 25). SAmBA might thus prove useful in those areas as well. ACKNOWLEDGMENTS We thank Dr. M. Tegoni, Dr. P. Marchot, and Dr. Y. Bourne for their critical comments on the manuscript, and Prof. L. Abergel for helpful suggestions. Finally, we wish to thank a reviewer for helping us to clarify the ‘‘Experimental Procedure.’’ REFERENCES 1. Carter, C.W. Jr., Carter, C.W. Protein crystallization using incomplete factorial experiments. J. Biol. Chem., 254: 12219–12223, 1979. 257 2. Kingston, R.L., Baker, H.M., Baker E.N. Search designs for protein crystallization based on orthogonal arrays. Acta Crystallogr. D50:429–440, 1994. 3. Shieh, H.-S., Stallings, W.C., Stevens, A.M., Stegeman, R.A. Using sampling techniques in protein crystallization. Acta Crystallogr. D51:305–310, 1995. 4. Fisher, R.A. ‘‘The Design of Experiments,’’ 3rd Ed. London: Olivier and Boyd, 1942. 5. Carter, C.W. Jr. Efficient factorial designs and the analysis of macromolecular crystal growth conditions. Methods 1:12–24, 1990. 6. Carter, C.W., Jr., Baldwin, E.T., Frick, L. Statistical design of experiments for protein crystal growth and the use of precrystallization assay. J. Cryst. Growth, 90:60–73, 1988. 7. Gilliland, G.L., Bickham, D.M. The Biological Macromolecule Crystallization Database: A tool for developing crystallization strategies. Methods. 1:6–12, 1990. 8. Gilliland, G.L., Tung, M., Blakeslee, D.M., Ladner, J. The Biological Macromolecule Crystallization Database, Version 3.0: New features, Data, and the NASA archive for ptoyein crystal growth data. Acta Crystallogr. D50:408– 413, 1990 9. Carter, C.W. Jr. Design of crystallization experiments and protocols. In: ‘‘Crystallization of Nucleic Acids and Proteins, Ducruix A. and Giegé, The Practical Approach Series.’’ Rickwood D., Hames B.D. (eds.). London: Oxford University Press, 1992:47–73. 10. Abergel, C., Moulard, M., Moreau, H., Loret, E., Cambillau, C., Fontecilla-Camps, J.C. Systematic use of the incomplete factorial approach in the design of protein crystallization experiments, J. Biol. Chem. 266:20131–20138, 1990. 11. Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P. In: ‘‘Numerical Recipes in C,’’ 2nd Ed. Cambridge, UK: Cambridge, University Press, 1992:444–455. 12. Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P. Optimization by simulated annealing. Science 220:671–680, 1983. 13. Bose, R.C., Bush, K.A. Orthogonal arrays of strength two and three. Ann. Math. Stat. 23:508–524, 1952. 14. Bush, K.A. Orthogonal arrays of index unity, Ann. Math. Stat. 23:426–434, 1952. 15. Rao, C.R. General methods of analysis for incomplete block designs. J. Am. Stat. Assoc. 42:541–561, 1947. 16. Mandeli, J.P. Construction of asymmetrical orthogonal arrays having factors with a large non-prime power number of levels. J. Stat. Plan. Infer. 47:377–391, 1995. 17. Wang, J.C., Wu, C.F.J. An approach to the construction of asymmetrical orthogonal arrays. J. Am. Stat. Assoc. 86:450– 456, 1991. 18. Kissell, L.T. Optimization of white layer cake formulations by a multiple factor experimental design. Cereal Chem. 44:253–268, 1967. 19. Hare, L.B. Mixture designs applied to food formulation. Food Technol. 28:50–62, 1974. 20. Huor, S.S., Ahmed, E.M., Rao, P.V., Cornell, J.A. Formulation and sensory evaluation of a fruit punch containing watermelon juice. J. Food Sci. 19:809–813, 1980. 21. Moll, M., Flayeux, R., Mathieu, D., Phan-Tan-Luu, R. Optimal blending of malts. J. Inst. Brew. 88:139–144, 1982. 22. Zhang, J., Marcin, C., Shifflet, M.A., Salmon, P., Brix, T., Greasham, R., Buckland, B., Chartrain, M. Development of a defined medium fermentation process for physostigmine production by Streptomyces griseofuscus. Appl. Microbiol. Biotechnol. 44:568–575, 1996. 23. Hewlett, P.S. Measurement of the potencies of drug mixtures. Biometrics 25:477–487, 1969. 24. Morris, W.E., Snee R.D. Blending relationships among gasoline component mixtures. Autom. Technol., 3:56–61, 1979. 25. Nigam, A.K. Some designs and models for mixture experiments for the sequential exploration of response surfaces. J. Indian Soc. Agric. Stat. 26:120–124, 1974.