вход по аккаунту



код для вставкиСкачать
PROTEINS: Structure, Function, and Genetics 29:252–257 (1997)
SAmBA: An Interactive Software for Optimizing
the Design of Biological Macromolecules
Crystallization Experiments
Stéphane Audic, Fabrice Lopez, Jean-Michel Claverie, Olivier Poirot, and Chantal Abergel*
Information Génétique et Structurale, E.P. 91-C.N.R.S., Institut de Biologie Structurale et Microbiologie,
Marseille, France
SAmBA is a new software for
the design of minimal experimental protocols
using the notion of orthogonal arrays of
strength 2. The main application of SAmBA is
the search of protein crystallization conditions. Given a user input defining the relevant
effectors/variables (e.g., pH, temperature, salts)
and states (e.g., pH: 5, 6, 7 and 8), this software
proposes an optimal set of experiments in
which all tested variables and the pairwise
interactions between them are symmetrically
sampled. No a priori restrictions on the number and range of experimental variables is
imposed. SAmBA consists of two complementary programs, SAm and BA, using a simulated
annealing approach and a backtracking algorithm, respectively. The software is freely available as C code or as an interactive JAVA applet
at Proteins 29:252–
257, 1997. r 1997 Wiley-Liss, Inc.
Key words: software; crystallization; experimental design; JAVA
The limiting step in the crystallography of biological macromolecules is still the identification of physical and chemical conditions leading to optimal nucleation and crystal growth. Given the variety of
putative effectors (pH, temperature, ionic strength,
precipitating agents, etc.), a full-factorial screening
for ideal conditions would require a prohibitive number of experiments.
Following a seminal article by Carter and Carter,1
various design strategies have been proposed to
delineate crystallization conditions while trying to
minimize the number of experiments. Recent works
in this area include the use of orthogonal arrays2 and
sampling techniques.3
The concept of incomplete factorial design4 and its
use in the exploration and optimization of protein
crystallization conditions has been reviewed by
Carter.5 Given N variables (salts, pH, precipitating
agent, temperature, etc.) and a finite number of
states for each variable (NaCl 1M, 0.1M; pH 5.0, 7.0;
etc.), the number of experiments required can be
significantly reduced by varying more than one
variable at a time. In theory, the incomplete factorial
design can provide a minimal set of experiments in
which the influence of each variable can still be
rationally estimated, for instance, using multiple
linear regression. This approach was first implemented in the widely used INFAC program.6
The input to INFAC consists of the following
parameters: the total number of experiments required by the user, the number of variables (effectors
such as pH, temperature, and salt), and the number
of states per variable. INFAC then randomly generates up to 450 matrices (protocols) obeying the
constraints that the states of each variable are the
most equally represented (e.g., given a request for 35
experiments and a variable with 5 states, each of
them will be represented 7 times) and that each
proposed experiment is different. In a primitive way,
INFAC also incorporates the notion that the best
protocols are those in which all pairs of states are
represented. For each of the proposed protocol, the
number of missing pairs of states is listed, so the
user can select a protocol with this additional constraint in mind.
In its present version, INFAC suffers of a number
of limitations of practical importance. First, the user
has to iteratively guess an adequate number of
experiments. Second, the program often requires
more experiments than is really needed to find a
solution. Third, a careful analysis of the proposed
protocols reveals that they are rarely ‘‘balanced,’’ for
example, the sampling of all variables and states is
not symmetrical and prevents the experimental design to be fully informative, in particular for pairwise interactions. This problem is illustrated in
Table I.
Finally, INFAC limits the number of variables to
15, and the number of states to 6. In our experience,
crystallization protocols tend to make use of fewer
variables but more states (e.g., trying 12 different pH
*Correspondence to: Chantal Abergel, Information Gènétique et Structurale, E.P.91-C.N.R.S., Institut de Biologie Structurale et Microbiologie, Marseille, France.
Received 20 February 1997; Accepted 6 May 1997
TABLE I. An ‘‘Unbalanced’’ Matrix*
Experiment no.
*This output of the INFAC program was generated from a
query for a protocol of 16 experiments involving 4 variables
with 4, 4, 3, and 2 states, respectively. Association of the second
and fourth variables (V2. and V4) is not symmetrical. State
number 1 of V4 is associated 3 times with the state number 3 of
V2 while the state number 2 of V4 is associated only once with
state number 3 of V2. A similar problem occurs for state number
4 of V2. It is associated 3 times with state number 2 of V4 and
only once with state number 1 of V4.
ments required is computed by SAmBA and corresponds to the minimum needed for every pair of
states to be represented at least once (e.g., NaCl:
1M/pH: 7.0).
In addition, the protocols proposed by SAmBA are
correctly ‘‘balanced,’’ that is, every state is equally
represented for each variable, and every pair of
states is equally represented for every association of
any two variables, which allows for the identification
of pairwise interactions.
Such experimental design ensures that the influence of effectors can be recognized individually, as
well as the most probable pairwise interactions
between them. Clearly, the absence of some pairs in
the final protocol and an unequal representation of
the states in a same variable will introduce a bias in
the experiment, preventing the user to retrieve the
influence of each state from the study of its interaction with all the other. Moreover, to identify pairwise
interactions, these should all be represented at least
once and an equal number of times. Analyses will
otherwise become statistically meaningless, either
due to the absence of information, or to the overexpression of some pairs relative to the others. A proper
interpretation of the results thus requires ‘‘balanced’’
crystallization protocols.
Optimal crystallization conditions are rarely discovered with the first protocol. SAmBA will generally
be used in the iterative refinement of experimental
conditions, following a scoring of their capacity in
inducing various types of precipitates and/or crystals.10
values rather than exploring 12 variables in parallel,
such as pH, precipitating agent, salts, cofactor, additives, and ions).
SAmBA, a new software introduced herein, consists of 2 complementary programs, one based on
Simulated Annealing (SAm), the other on a more
classical Backtracking (BA) algorithm (see the Methods section); it overcomes all the limitations cited
above, and offers a more user-friendly interface (Fig.
1). For users of WWW browsers such as Netscape, no
program downloading is required, and the software
can be directly used on our server.
The user input to SAmBA simply consists of the
number of variables and of the number of states per
variable, as defined from initial precipitation experiments of the protein of interest at different pH (e.g.,
isoelectric point and activity pH) and in presence of
different precipitating agents. The precipitating
agents to be tested will vary from one protein to an
other and can be retrieved from the Biological Macromolecule Crystallization Databank7,8 when there are
similarities to previously crystallized macromolecules.9,10 The expected minimal number of experi-
The protocol matrix corresponding to Incomplete
Factorial Design is constituted of Nexp rows and Nv
columns, where Nexp is the number of experiments to
perform, and Nv is the number of variables V1,
V2, . . . , VNv. Each variable assumes a number of
states; for example, V1 may consist of five states 5S11,
S12, S13, S14, S156 or in general the ith variable Vi may
assume the ni distinct states 5Si1, Si2, . . ., Sini6.
SAmBA solves the problem of finding a minimal
protocol matrix (i.e., the smallest possible number of
experiments) obeying three constraints:
1. The states of each variable have to be the most
equally represented.
2. All experiments (rows in the matrix) have to be
3. All possible pairwise associations between states
of different variables have to be represented once,
and if more than once, an equal number of times.
An estimate of the minimal number of experiments required to fulfill those constraints, noted
Nexp, is needed to initiate the search process. Nexp is
such that no solution is clearly possible with a lesser
Fig. 1. Typical SAmBA session using the JAVA version. Left, top to bottom: Input windows
for the various options, parameters and variables. Each variable is given a name and a
nominative list of states. Right: listing of the optimal protocol found by SAmBA. This protocol
corresponds to a 4-variable matrix with 4, 4, 3, and 2 states, respectively. This is to compare to
the problem presented in Table I for which INFAC could not produce a relevant protocol. SAmBA
produced a solution that led to crystallization. In this real experiment, lines 1, 11, 12, and 16
provided the crystallization conditions of a new high molecular weight Desulfuromonas
acetoxidans cytochrome (Dr. Bruschi M., B.I.P., CNRS Marseille). The incomplete factorial
protocol consists of 16 different conditions, while the full-factorial protocol would amount to 96
different conditions.
number. However, certain difficult cases might require more experiments (see the Results section).
Nexp is computed as follows. First, consider the two
variables with the largest number of states: Vk,
having nk distinct states 5Sk1, Sk2, . . . , Skn 6, and Vl,
having nl states 5Sl1, Sl2, . . . , Sln 6. If the total numl
ber of variables Nv is not larger than the maximal
number of states nk, Nexp is computed as nk 3 nl.
Otherwise, (i.e., Nv . nk) Nexp is computed as Nv 3
nk. The latter situation is rarely encountered in the
design of actual crystallization experiments.
The program SAm and BA both initiate the search
of optimal protocols involving Nexp distinct experiments. If they fail to find a solution in a given
elapsed time (or a given number of iterations), this
number is incremented until a solution obeying the
constraints listed above is found. For more flexibility,
SAmBA also retains the option for the user to
request a specific number of experiments, larger or
equal to Nexp.
SAm: A Simulated Annealing Algorithm
The design of an experimental protocol is viewed
here as a global minimization problem. The goal is to
meet the constraints described above: for each variable, every state must be represented about the
same number of times and every pairwise combination of states and variables should be represented at
least once and/or about the same number of times.
We now define Qkplq as the desired number of occurrences of the pair 5Vk 5 Skp, Vl 5 Slq6 in the experimental protocol. Similarly Pkplq is defined as the observed
number of occurrences of the same pair. To satisfy
constraint 1, given Nexp, one can calculate the number of occurrences of each state for the ith variable as
Nexp/ni where ni is the number of distinct states for
the ith variable. When the quotient is not an integer,
some states will be represented by one more occurrence than others. By examining each row, representing a single experiment, with every other row, it can
be assured that constraint 2 holds. Now the energy
function can be defined as:
o oo o
(Pkplq 2 Qkplq)2
k51 l5k11 p51 q5p11
The matrices satisfying the first requirement may
correspond to a local minimum of the energy function, and among them are hidden matrices satisfying
both requirements and corresponding to the global
minimum of the energy function. Switching from one
matrix to an other is done by swapping two elements
of the same column. Swapping leaves constraint 1
satisfied since the total number of occurrences of
each of the two swapped states is left unchanged.
However, since constraint 2 may no longer be fulfilled, a matrix satisfying constraints 1 and 3 will be
checked for constraint 2 at the end.
Simulated annealing is a method of choice for such
minimization problems where the global minimum
is hidden among many local minima.11,12 Starting
from a randomized matrix satisfying constraints 1
and 2, we perform a random swap and compute the
modification of energy it produces on the system. The
most obvious algorithm would be to keep only the
swaps decreasing the system energy (DE , 0); however, this is most likely to result in trapping the
system in a local minimum. Instead, the simulated
annealing approach consists in accepting all swaps
reducing the system energy, but also to keep those
increasing the system energy with a probability
e(2DE/T) where T is the temperature of the system. At
high temperatures the system wanders freely
throughout the space of configurations. Then, the
temperature is progressively lowered, increasing the
tendency for the system to become trapped in local
minima. At the beginning of the experiment, T is
large with respect to the typical energy changes
produced by random swaps. Then, T decreases geometrically, i.e. T 5 0.8 3 T. Temperature changes are
programmed to occur either after 50,000 swaps, or
after 5,000 successful ones (DE , 0). Random swaps
are generated by the random draw of a variable and
two experiment numbers, and by exchanging the
states of this variable in the two experiments. As a
single cycle of cooling is not always sufficient to reach
the global minimum (E 5 0), multiple cycles are
performed to find the solution of the most difficult
problems, such as near-square matrices involving
variables with numbers of states larger than 6.
BA: A Backtracking Algorithm
The backtracking approach simply consists in
building row after row the matrix satisfying the
constraint on the number of occurrence of each state
(constraint 1) and pair of states (constraint 3). After
each row is added, Pkplq is checked to see that it does
not exceed Qkplq, that none of the previous row is
duplicated (constraint 2) nor (constraint 1) is violated. If this is the case, the program then proceeds
to build the next row. Otherwise, if one of the checks
is violated, another possible row generated by stepping to the next possible combinations of states is
investigated. When all the combinations have been
unsuccessfully exhausted for the current row, the
program goes back one row. The last successful row
is replaced by the next possible row, the new combination of states is investigated for the 3 constraints.
Since the number of possibilities increases rapidly
with the number of variables, the straightforward
implementation of the backtracking algorithm soon
becomes impractical. Additional strategies are used
to speed up the convergence. Before going back, the
program also tries to exchange the last successful
row with another row. The row to be replaced is
chosen on the basis of an energy function similar to
the one used by the simulated annealing algorithm.
The energy is computed for all the incomplete matrices constituted of all but one row of the current
matrix. The row chosen is the one such that its
removal from the incomplete list minimizes the
energy function defined in the simulated annealing
protocol. If this procedure still fails to unlock the situation and after 30 seconds of elapsed time, the next row
targeted for replacement is randomly chosen.
Prime Number Square Matrix
Prime number square matrices constitute a peculiar case.13,14 A square matrix corresponds to protocols where the number of states is the same for all
variables, and equal to the number of variables.
Perfectly balanced solutions are particularly easy to
find for those matrices if they involve a prime
number (3, 5, 7, 11, etc.) of variables. The algorithm
involves a succession of well-defined permutations.
The required number of experiments correspond to
the square of the prime number: P2. Every state will
appear p times. The matrix can be viewed as p blocks
of dimension ( p 3 p). The first column of each block
bears p time the block number, the first block being
given number 1. The second column bears a sequential list of the states numbers of the second variable.
Starting from column 3, the columns are generated
by circular permutation of the previous column. The
step of this permutation is equal to the block number
minus 1.
There is no known simple and general answer (i.e.,
for an arbitrary number of variables and states) to
the problem of building pairwise balanced matrices
as defined above. The problem exhibits some aesthetic and surprising features. For instance, experimental designs involving prime number square matrices (e.g., 5 variables, each of them with 5 states) of
any size are particularly easy to find.15,16 On the
other hand, experimental designs involving even
square matrices larger or equal to 6 (e.g., 6 variables
at 6 states) seems to require more experiments and
produce less symmetrical solutions than expected
from intuition. Outside of those specific categories,
the difficulty of solving given queries varies greatly,
with no apparent rule.
The perfectly pairwise balanced matrices we are
attempting to build are mathematically known as
orthogonal arrays of strength 2.17–19 The construction of orthogonal arrays is still an active field of
research, but it is already clear that these can only
exist for a small subset of all possible parameters
(e.g., numbers of variables, states and experiments).
Orthogonal arrays have been discussed in the context of protein crystallization by Kingston et al.2
They advocated the use of a general 64-experiment
protocol built on an orthogonal array of strength 3
(all possible triplets of states are represented at least
once, and an equal number of times). We believe that
this scheme is not flexible enough to encompass all
experimental situations. Furthermore, given the
strict constraints governing orthogonal arrays of
strength 3, most other combinations of variables and
states will lead to a near full-factorial experimental
From the laboratory point of view, a satisfactory
experimental strategy must be a compromise between a symmetrical sampling of the full-factorial
space and a minimal number of experiments, for any
combination of variables and states. Protocols generated using the SAmBA software appear to fulfill
those constraints. All protocols involving a reasonable number of variables and states are handled very
easily by both the SAm and BA programs. Most
often, the proposed protocols consists of the precomputed minimal number of experiments (see the Methods section). For instance, a fully balanced protocol
can be found for an experiment involving 5 variables
with 14, 12, 4, 3, and 2 states, respectively, in the
predicted minimal number of experiments
(14 3 12 5 168; see the Methods section). Noticeable
exceptions concerned even square matrices, like 6 3
6 and 8 3 8, for which neither the SAm or BA
programs could find a solution within the precomputed minimum number of experiments (36 and 64,
respectively) in a reasonable time (less than a few
hours). For those cases, or protocols involving larger
than usual variables and number of states, the
minimal number of experiments was progressively
incremented until convergence occurred. Usually,
adding two or three experiments was sufficient.
It is noteworthy that the responsibility of choosing
the relevant variables and their states still lies in the
scientist’s hands. From this information, SAmBA
effectively samples the factorial space of all possible
experiments and provides a balanced protocol that
corresponds to a minimal number of experiments.
The SAmBA software offers the choice between
SAm and BA, which complement each other well. In
general, BA is faster for designs in which no variable
has more than 6 states. Table II compares the
performance of the SAm and BA algorithm for a set
of representative queries. For none of these queries
could INFAC find a fully balanced protocol in the
minimal number of experiments.
Interactive Network access and a user-friendly
interface are other key features of SAmBA. The
program can be used to its full potential without any
internal knowledge of the algorithms. By default, the
software automatically activates the a priori most
efficient algorithm for the current query. The default
can be overridden, including the possibility of specifying a number of experiments (larger than the
minimum). Variables and states are handled as text,
allowing printout of the protocols to be immediately
usable in the laboratory. The software package is
available in 3 versions: C code, JAVA code (for Silicon
Graphics, SUN, and P.C.), and as an interactive
TABLE II. Compared Performances of the
Simulated Annealing (SAm) and Backtracking (BA)
Programs on the Same Platform (Silicon Graphics
Indy 175 MHz IP22, FPU: MIPS R4010, CPU: MIPS
R4400, Memory: 128 Mb.)*
(5, 4, 4, 3, 2)
(4, 4, 3, 3, 2)
(4, 4, 3, 3, 3, 2)
(6, 5, 4, 3, 2)
(6, 6, 5, 4, 2)
(8, 6, 4, 3, 2)
(6, 5, 5, 4, 3)
(6, 5, 5, 4, 3, 2)
(7, 6, 5, 4, 3, 2)
(12, 8, 4, 3, 2)
(14, 8, 4, 3, 2)
(14, 12, 4, 3, 2)
.1 h
.1 h
.1 h
.1 h
.1 h
.1 h
*Queries are shown in column 1, listing the number of states
for each variable. Thus (5, 4, 4, 3, 2) means 5 variables with 5, 4,
4, 3, and 2 states, respectively. The table focuses on queries for
which INFAC could not produce a balanced protocol within the
minimal number of experiments. Running times for the SAm
and BA programs as well as protocol size (i.e., total number of
experiences to be performed) are shown.
JAVA applet running from a Netscape window at
Experimental design is useful in other areas than
crystallization. This concept is relevant each time
one wishes to interpret the properties of complex
mixtures, such as in food science (e.g., Refs. 20 and
21), cell culturing (e.g., Ref. 22), pharmacy (e.g., Ref.
23), gasoline design (e.g., Ref. 24), or agricultural
science (e.g., Ref. 25). SAmBA might thus prove
useful in those areas as well.
We thank Dr. M. Tegoni, Dr. P. Marchot, and Dr. Y.
Bourne for their critical comments on the manuscript, and Prof. L. Abergel for helpful suggestions.
Finally, we wish to thank a reviewer for helping us to
clarify the ‘‘Experimental Procedure.’’
1. Carter, C.W. Jr., Carter, C.W. Protein crystallization using
incomplete factorial experiments. J. Biol. Chem., 254:
12219–12223, 1979.
2. Kingston, R.L., Baker, H.M., Baker E.N. Search designs for
protein crystallization based on orthogonal arrays. Acta
Crystallogr. D50:429–440, 1994.
3. Shieh, H.-S., Stallings, W.C., Stevens, A.M., Stegeman,
R.A. Using sampling techniques in protein crystallization.
Acta Crystallogr. D51:305–310, 1995.
4. Fisher, R.A. ‘‘The Design of Experiments,’’ 3rd Ed. London:
Olivier and Boyd, 1942.
5. Carter, C.W. Jr. Efficient factorial designs and the analysis
of macromolecular crystal growth conditions. Methods
1:12–24, 1990.
6. Carter, C.W., Jr., Baldwin, E.T., Frick, L. Statistical design
of experiments for protein crystal growth and the use of
precrystallization assay. J. Cryst. Growth, 90:60–73, 1988.
7. Gilliland, G.L., Bickham, D.M. The Biological Macromolecule Crystallization Database: A tool for developing crystallization strategies. Methods. 1:6–12, 1990.
8. Gilliland, G.L., Tung, M., Blakeslee, D.M., Ladner, J. The
Biological Macromolecule Crystallization Database, Version 3.0: New features, Data, and the NASA archive for
ptoyein crystal growth data. Acta Crystallogr. D50:408–
413, 1990
9. Carter, C.W. Jr. Design of crystallization experiments and
protocols. In: ‘‘Crystallization of Nucleic Acids and Proteins, Ducruix A. and Giegé, The Practical Approach Series.’’ Rickwood D., Hames B.D. (eds.). London: Oxford
University Press, 1992:47–73.
10. Abergel, C., Moulard, M., Moreau, H., Loret, E., Cambillau,
C., Fontecilla-Camps, J.C. Systematic use of the incomplete factorial approach in the design of protein crystallization experiments, J. Biol. Chem. 266:20131–20138, 1990.
11. Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery,
B.P. In: ‘‘Numerical Recipes in C,’’ 2nd Ed. Cambridge, UK:
Cambridge, University Press, 1992:444–455.
12. Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P. Optimization by
simulated annealing. Science 220:671–680, 1983.
13. Bose, R.C., Bush, K.A. Orthogonal arrays of strength two
and three. Ann. Math. Stat. 23:508–524, 1952.
14. Bush, K.A. Orthogonal arrays of index unity, Ann. Math.
Stat. 23:426–434, 1952.
15. Rao, C.R. General methods of analysis for incomplete block
designs. J. Am. Stat. Assoc. 42:541–561, 1947.
16. Mandeli, J.P. Construction of asymmetrical orthogonal
arrays having factors with a large non-prime power number of levels. J. Stat. Plan. Infer. 47:377–391, 1995.
17. Wang, J.C., Wu, C.F.J. An approach to the construction of
asymmetrical orthogonal arrays. J. Am. Stat. Assoc. 86:450–
456, 1991.
18. Kissell, L.T. Optimization of white layer cake formulations
by a multiple factor experimental design. Cereal Chem.
44:253–268, 1967.
19. Hare, L.B. Mixture designs applied to food formulation.
Food Technol. 28:50–62, 1974.
20. Huor, S.S., Ahmed, E.M., Rao, P.V., Cornell, J.A. Formulation and sensory evaluation of a fruit punch containing
watermelon juice. J. Food Sci. 19:809–813, 1980.
21. Moll, M., Flayeux, R., Mathieu, D., Phan-Tan-Luu, R.
Optimal blending of malts. J. Inst. Brew. 88:139–144, 1982.
22. Zhang, J., Marcin, C., Shifflet, M.A., Salmon, P., Brix, T.,
Greasham, R., Buckland, B., Chartrain, M. Development of
a defined medium fermentation process for physostigmine
production by Streptomyces griseofuscus. Appl. Microbiol.
Biotechnol. 44:568–575, 1996.
23. Hewlett, P.S. Measurement of the potencies of drug mixtures. Biometrics 25:477–487, 1969.
24. Morris, W.E., Snee R.D. Blending relationships among gasoline component mixtures. Autom. Technol., 3:56–61, 1979.
25. Nigam, A.K. Some designs and models for mixture experiments for the sequential exploration of response surfaces.
J. Indian Soc. Agric. Stat. 26:120–124, 1974.
Без категории
Размер файла
146 Кб
Пожаловаться на содержимое документа