close

Вход

Забыли?

вход по аккаунту

?

CCBF Ц A System for the Computer Processing of Chemical and Biological Facts.

код для вставкиСкачать
CCBF - A System for the Computer Processing of Chemical and
Biological Facts[**]
By Gerhard Ohnacker and Werner Kalbfleisch I*]
A computer-based system for the documentation of company research results in drug
research is described. The system stores the numerous results of standardized biological
tests and processes chemical formulas by topological techniques. The stored material is
processed to provide computer-printed card indexes and lists on chemical and/or biological topics. The programs enable searches to be made for any desired substructure and
for pharmacophoric groups, thus providing highly effective assistance in the search for
structure-activity relationships.
1. Introduction
The development of a new, biologically active substance to the point where it can be introduced as a
drug takes at present an average of six to eight years.
As a rule it requires the synthesis of a few thousand
new chemical substances, which must be tested by a
wide variety of biological test methods. The requirements for proof of biological activity, medical usefulness, and the relative degree of safety are continually
refined and made more stringent, and we must give
serious consideration to the predictions 111 that in
future the time required t o develop a new drug resulting from a large research project might increase to as
much as 15 to 20 years.
During this long period scientists in many disciplines
will be working simultaneously o r successively on a
large number of promising groups of substances, and
as the project progresses the number of data that must
be evaluated will increase. In such a case individual
results are only meaningful when confirmed by other
similar results. Thus for a given substance large numbers of results, obtained either simultaneously o r at
different times, have to be compared with other individual values o r with the activity profiles of other substances.
Such comparisons can assist the scientists efforts in
optimizing biological activity by variation of both
substance groups and substituents. The usefulness of
these comparisons increases with:
- the size of the cooperating team, i.e. with the number and range of the tests from which thegiven individual results are derived;
- the degree of certainty of the biological results, i.e.
with the number of standardized tests that can be used,
the results of which remain comparable over long
[ *] Dr. G . Ohnacker and W. Kalbfleisch
Chemische Forschung der Dr. K. Thomae GmbH
795 Biberach an der Riss (Germany)
[**I CCBF = Computer processing of Chemical and Biological
Facts.
[ll Medicine in 1990 - a technical forecast by the Office of
Health Economics, London, Med. Pharm. Studienges. e. V.,
Frankfurt, 1969.
Angew. Chem. internat. Edit. j Vol. 9 (1970) /No.8
periods and are largely unaffected by external influences;
- the extent and accuracy of the physicochemical
data that can be used for regression and correlation
analyses.
Today the ever-increasing mass of data can be stored
and processed efficiently by large computers. The usefulness of computers depends on the speed and certainty with which access is offered to the collection of
formulas and results, and on the clearness and readability of the answer. In an effort to meet these requirements as closely as possible, a system for the documentation of company research results can use different approaches and more specific techniques than
those used by a documentation system for published
material in the same field which is of interest to
numerous people.
Every documentation system of published material
has to strive for completeness. It aims to cover every
publication in a given field and all publications in the
marginal fields. In addition, it must try to ensure that
each document is indexed t o the same depth. Often
this is only partly achieved, because it is determined
not only by the actual content of the publication but
also by a more o r less detailed thesaurus which lists
the terms considered worth indexing, and also by the
subjective on-the-spot decisions of the abstractor.
Contrariwise, company research results can be deliberately indexed at different levels and depths, because
the subsequent use of the system can be planned very
exactly to match the needs of a team of limited size.
If the documentation system is to give effective support
to the research effort, the best system for a given case
will depend not only on the nature and extent of the
data to be stored and processed but also on the specific
scientific requirements. For this reason we need the
collaboration of the scientists themselves in the design
of the means for data recording, in the establishment
of SDI (Selective Dissemination of Information)
profiles, and in the continuous enlargement of the
system.
This is the basis upon which the CCBF system for the
computer processing of chemical and biological research results was developed. As in all documentation
605
systems, data recording is the bottleneck. Therefore,
particular care was taken that the various possibilities
of the system can be offered with a minimum expenditure of data recording.
C 1 7 ti19 C L N Z S
REGIPHEN
elslmOll
f i \ / \ / \
1. Prfg.
Prfg
.
.
__
- _ ~
03702:
CL-c
c
c c
P/\/\(.
C
N
C
c
c
\ / \ /
c
N,
2. Functions of the CCBF System
I
C
The system is required to:
-
store chemical structures and biological results;
provide information in the search for structureactivity relationships;
-
Fig. I .
CCBF chemical index card. (Original format 6 ’
y.
4’.)
- supply the scientist, either in a current awareness or
a t special request, with computer-produced card indexes and lists of results;
- prepare statistical data for management.
To meet these requirements the programs enable the
chemical file to be searched for individual compounds,
substructures, and general formulas. The biological
file can be searched for individual results or activity
profiles. The computer prints out the appropriate
formulas and data. T o a large extent the printouts can
be arranged to suit the inquirer’s wishes. Table 1shows
examples of some of the types of questions that can be
put, together with the types of printouts. I n addition
to these specifically requested searches, the CCBF
system also offers a continuing service in the form of
chemical (Fig. 1) and biological (Fig. 2) card index
systems. These subsystems are continually updated
and distributed to the scientific workers.
Table 1.
system.
Summary of types of questions and answers in the CCBF
Type of question
Format of search result
Chemical questions:
Search for a definite structure
Printout of structural formula
Search for a definite class of
compounds
Printout of the structural
formulas of all the
compounds found. The list
can be arranged by code
names or by substituents
Search for a general formula
Search for any given substructure
Combined questions:
Search for the biological
action of a definite substance
Printout of all biological data
stored for this substance
Search for compounds which
are effective in one or more
tests; dosage limits can be set
Printout of the code names
and/or formulas of the
substances found together
with the pharmacological
data. The substances can
also be arranged in order
of effectiveness
Search for compounds with one
or more given side effects
Printout of code names and/or
formulas of the compounds
found, arranged by dose,
animal. and mode of
application
Statistical questions:
Summary of all new compounds
synthesized in a given period
List of code names and/or
formulas, arranged by
classes of substances
Summary of all biological results
reported i n a given period
List arranged by test number
606
and/or customer
Fig. 2.
CCBF biological index card. (Original format 6” x 4”.)
3. Scope of the CCBF Store
3.1. Biological Facts
All results obtained on screening with standardized
biological test methods in pharmacology, biochemistry, and microbiology are stored numerically. Results
from special studies of individual substances are also
stored, but they are not indexed in such depth since
they have no significance in the search for structureactivity relationships. These can only be detected with
standardized tests on a series of structurally related
compounds. However, the individual results are important in searching for activity profiles of individual
substances. But in these cases it is usually sufficient to
find indications of the dates, authorships, and locations of the reports on the results of special biological tests.
3.2. Chemical Facts
The file contains the structures of all new compounds
synthesized within the company for biological testing.
It also includes the formulas of published compounds
that have been used for biological testing within the
company.
4. Documentation Techniques
4.1. Biological Results
The indirect coding of biological information with
numerical codes or thesauri [21 is certainly of universal
application, but it also involves considerable ex[ 2 ] For example t h e Biology C o d e of t h e Chemical Biological
Coordination Center, ed. by P . G . Seitner, G . A . Liviitgston,
a n d A . S . Williams, N a t i o n a l Academy of Sciences, N a t i o n a l
Research Council, Washington D.C., 1960.
Angew. Chem. internat. Edit.
/ Vol. 9 (1970)/ No. 8
penditure. Conversion of the information into combinations of numbers o r keywords, and also t h e
continual amplification of the codes and thesauri, can
only be performed with the necessary degree of accuracy by well trained experts. The same applies to
direct codings with coding forms on which the relevant
information must be marked.
The great majority of the biological results stored in
the CCBF system is derived from tests performed by
standardized methods. For this reason the format of
all the results from a given test is always the same;
therefore, a data sheet suitable for direct data recording has been developed for each test method. All the
data sheets can be filled in while experiments are
going on. The data on these forms are transferred to
the biological file either in text form or as numerical
values. For example, Fig. 3 shows the data sheet for
the motility test with the revolving drum. The entry of
data becomes still simpler in the case of tests whose
results are calculated using biometrical methods by
computer. In these cases the results form (Fig. 4) is
also produced by computer; it is first sent t o the
scientist so that he can eliminate any useless results.
Only the data worth recording are transferred from
the results tape t o the CCBF store. All the data are
first checked by the computer for formal errors and
plausibility. Formal errors are indicated, for example,
if the substance tested has not yet been entered in the
chemical file, o r if the sequence of results (card sequence) is incorrect. Testing for plausibility shows
whether the data supplied make good sense; the
program will give a warning if, for example, the
results are illogical or the doses are nonsensical.
The data record of all the results relating to an individual substance in the biological file begins with the
key data of the compound, which also provide the
link to the chemical file. The test results follow in the
order of their entry. For each test there are the
characteristic data (test number, type of animal, mode
of application, etc.), the dosages, the numerical results, and finally any remarks made by the investigator,
in clear text.
4.2. Chemical Structures
Only substances of clearly defined structure are
entered into the chemical file of the CCBF system;
Markush formulas d o not occur [*I. Thus it is possible
to process the chemical formulas topologically. In
this method, the fundamentals of which were first
described by Mooers 131, the structures are stored
unambiguously and searches may be carried out for
any desired substructure. In developing our coding
system for formulas we took advantage of the experience of Horowitz and Crane [41, Waldo 151, and
Meyer[61. The formulas are entered manually onto a
grid sheet (Fig. 5). Certain rules have to beobserved
mainly with regard to the data recording equipment.
.. ...
.
Fig. 3. Data sheet for the input of biological data into the CCBF
system. Code symbols are only used in the “Test No.” and “Modif.”
fields for the method of test, in the “Type of animal” field, and in the
“Side effects” field. Anything that the scientist has entered in the “Remarks” section will be stored as text. Such remarks cannot be searched
for but they can be printed out together with biological search results.
(Original format 8” x 6”.)
GLUCOSE COLOR.
864
192001
. ...__
,
C A . ,) ,,, ,
- -u
,
37p1
1211
!ma
,
,
,
,
,
,
,
,
,
,
,
,
,
Fig. 5 . Coding sheet for recording chemical formulas in the CCBF
system. (Original format 12” x S”.)
24.03.70
For printing formulas by computer we use a high-speed IBM
B -DF 0001
OOSIS
12r500 M G l K G PO
W
TlERE
180 0
ZEIT
S
1403/N 1 line-printer with additional characters t o print
some chemical symbols not included in the normal character-set. These are: \, for this one single bond, I!, /, and \,
for the double bonds, = and 111 for the triple bonds, and the
symbol U t o characterize an aromatic ring. In order to get a
RAE
7.5 PK
MITTELW.
82.5
1.53
21
60
6814
2.37
20
120
72.5
2.96
20
180
73.3
3.34
20
240
75.4
3.30
EGI
AENO.
SXQ
0
21
NA
-
17%
SIGX.
12%
SIGN.
11%
9%
SIGN.
N.
SIGN.
Fig. 4. Computer-produced results form for biometrically calculated
biological data.
Angew. Chem. internat. Edit.
/ Vol. 9 (1970) 1 No. 8
[“I A Markush formula is a general formula such as R’NH2,
in which R’ might for example stand for any alkyl group from
c1 to ClQ.
[3] C. N . Mooers, Zator Techn. Bull. 59, 1 (1951).
[4] P . Horowitz and E . M . Crane: Hecsagon: A System for
Computer Storage and Retrieval of Chemical Structure. Eastman Kodak Company, Rochester, N.Y., 1961.
[ 5 ] W . H . Waldo, J. chem. Documentation 2, 1 (1962).
[6] E . Meyer, Angew. Chern. 77, 240 (1965); Angew. Chem.
internat. Edit. 4, 347 (1965).
607
suitable printed formula, the line density is altered to eight
lines per inch. Since these symbols d o not appear on a normal
key punch, the aromatic ring is coded as @, the single bond
\ as *,the double bond in any direction as#, and the triple
The coding program ensures that the correct
bond as
symbols are filed. For the bonds of six-membered aromatic
systems the program generates the “aromatic” bond type.
With regard to a suitable printed formula it is also possible
to draw dotted bonds to indicate ionic bondings or steric
positions; furthermore, signs of charges can be drawn. A
special code relates these and other data (e.g. radioactive
labeling) to the corresponding atoms. This code gives the
grid coordinates (row and column number of the atom in the
grid-sheet) together with a keyword for the special fact.
Topological coding is concerned solely with atoms other than
hydrogen. Hydrogen atoms are therefore not drawn on the
grid-sheet, unless they are required for the improvement of
the formula printout. Not only the descriptive data, such as
key-data of the compound, month of its synthesis, and the
chemical name or bibliographical data, but also for test
purposes the empirical formula and the number of rings
present in the molecule are given.
n.
One technician is able to fill in the grid-sheets for an
average of 40 formulas per hour. The coding is easy
to check visually and does not involve any worries
about future eventualities; for instance, later on they
might be processed by optical scanners [71. Moreover,
this kind of input provides the information for computer printing of the formulas without any additional
manual operation. The input data are checked by the
computer for formal validity. Error messages are produced for invalid atom or bonding symbols, missing
lines of the formula matrix, and similar formal errors.
In order to produce the topological connectivity table,
the computer numbers the atoms of each formula. The
program also ascertains the quantity and type of the
bonds from each atom. No attention is paid to
hydrogens, unless they are specially coded. The computer then produces a compact list [81 by removing the
redundant information from the connectivity table. In
order to achieve higher speed in searching by higher
selectivity we have deviated from the topological
methods described by Gluck [*I and Morgan 191, insofar as the CCBF encoding program calculates
further information about the individual atoms and
includes it in the compact list. This information
records whether or not the atom is a ring member,
and also the number of hydrogen atoms attached to it
(this can shorten the search time if it is required that
hydrogen atoms should be present a t definite places
in the structure or substructure that is searched for).
In the computer file, each atom of a structure is
described by:
- t h e code for the type of atom;
- the
information “Ring member or not ring member”;
- t h e information “Bond from atom No. . . .”;
Some of these codes are bit chains, thus Boolean
logic operators (AND, OR, NOT) can be used in
searches.
The computer calculates the empirical formula and
t h e number of rings and compares them with the
values given. Erroneous codings are rejected.
Since even third generation computers still require too
much time for the iterative atom by atom comparison
described by Ray and Kirsch[l*l, the CCBF system
also employs some preliminary checks. These screens
are produced by the computer. At present the system
uses:
- an
empirical formula screen;
- a two-atom fragment screen, which indicates how
often multiple bonds occur between two carbon atoms
or single and multiple bonds between carbon and
hetero atoms or between two hetero atoms; single,
double, aromatic, and triple bonds are counted
separately;
- a ring screen, in which every basic ring in the molecule
is described. This ring screen corresponds approximately to the ring fragments of the Ring Code [111.
These preliminary checks eliminate at least 90 %, and
in many cases over 99 %, of the stored compounds
very quickly as “not relevant”. Only those remaining
need to be subjected to a topological comparison.
The entire data record for a structural formula thus
comprises:
-the key data (for company-produced substances the
company code, for compounds known in the literature
the name in clear text);
- the computer-generated screens:
- the
computer-generated compact list;
- t h e information for the printing of the formula (a
computer condensed form of the input data);
- the
chemical name and/or bibliographic details.
On average such a data record occupies 330 bytes, so
that one magnetic tape can hold about 120000 formulas. The CCBF encoding program processes an
average of 150 formulas per minute.
5. Search Program
Not only does the CCBF system perform the continuing service of producing card index subsystems
(Figs. 1 and 2) but it also enables searches to be made
on chemical or biological topics or on structureactivity relationships (Table 1).
5.1. Coding of Biological Questions
- t h e code for the type of bond;
- the code for the number of hydrogens connected with
the atom;
-the codes for any special data.
[7J W. E . Cossum, M . E . Hardenbrook, and R . N . Wove, Proc.
Amer. Documentation Inst. 1964, 269.
[ 8 ] D . J . Gluck, J. chem. Documentation 5 , 43 (1565).
[9] H. L. Morgan, J. chem. Documentation 5, 107 (1565).
608
In biological searches every field of the biological data
record can be searched. Logical connections are
possible within similar parts of different records. All
[lo] L . C . Ray and R . A . Kirsch, Science (Washington) 126, 814
(1957).
1111 W . Steidle, Pharm. Ind. 19, 88 (1557).
Angew. Chem. internat. Edit.
Vol. 9 (1970) J No. 8
questions about dosages and results must be connected
with the appropriate test code. It is also possible to
put an overall request for certain data, e.g. toxicity
data (percentage of animals that died during the test)
and side effects. Biological questions are generally
coded in free form in the order keyword - operator search requirements. The keywords name the data
fields in the biological record that are to be checked
in the search. The operators and search requirements
define all the conditions that must be fulfilled or items
that must be excluded in order to answer the question.
The permitted operators are “equal to”, “not equal
tot’, “greater than”, “lower than”. The search requirements can also include ranges of values, e.g. ED-50
“between 10 and 50 mg/kg”.
Thus, in this example only phenylacetonitrile and those
mono-ortho derivatives which are not substituted in
the a-position are retrieved.
By specifying that an atom should or should not be a
ring member, we can, for example, request for substructure (2) that the u-carbon be a ring member;
then only formula (4) will give a match. Without
this restriction, the question would also retrieve
structure ( 3 ) .
5.3. Searching
5.2. Coding of Chemical Questions
The formula file can be searched for definite structures, for general formulas, and for substructures of
any desired type and size. Substructures are sets of
connected atoms within complete structures, and may
contain straight chain, branched, and ring elements or
ring fragments in any desired combination.
A chemical search starts with manual coding of the
connectivity table of the required structure or substructure. The search time can be shortened by
choosing numbers as low as possible for those elements
of the structure that are “rare” in relation to all the
stored formulas [121. In addition, all the screens relevant
for the required structure or substructure are coded.
In topological searches for definite compounds the
number of H atoms with which every non-H atom in
the formula is to be connected is given. This means
that in fact only the formula explicitly requested
would be retrieved.
I n topological searches for substructures and general
formulas it is possible t o replace one or more atoms
by dummy symbols, e.g. “Any halogen”, “Any hetero
atom”, “N or 0 o r P o r S”, and “C o r 0 or N o r S”.
In this way it is possible to search for groups of isosteric compounds in a single question. In coding of
the bond type it is possible to choose for each bond
OR or NOT combinations, e.g. “Single or double
bond”, “Single o r aromatic or double bond (=NOT
triple bond)” and “Any bond”.
By specifying the number of hydrogen atoms that are
to be attached to a non-H atom of a given substructure
or general formula it is possible to request that, for
example, in substructure ( I ) the u-carbon shall have
two hydrogens, and each of the carbon atoms 3, 4, 5,
and 6 of the phenyl group shall carry one hydrogen.
(1)
1121 W.E . Cossum, M . L . Krakiirsky, and M . P . Lynch, J. chem.
Documentation 5, 33 (1965).
Angew. Chem. internat. Edit. / Vo1. 9 (1970) No. 8
The search program first converts the coding of the
biological and/or chemical questions from punched
cards to an internal format. It is checked for formal
and logical errors analogous to the input checking.
I n combined searches it is possible to select whether
the chemical o r the biological file shall be searched
first. If, for example, the question refers to a rare
biological test, then the biological file is searched
first. If, on the other hand, the structures required d o
not occur frequently, then the chemical file is searched
first.
The search program transfers the data records of all
the substances retrieved to an intermediate file. From
this file the editing program selects all the parts
for the desired printout (Table 1).Printout instructions
are given at the same time as the initial question.
Provision is also made for printout instructions to be
given even after the search has been completed, so
that the whole of the information for all the substances
retrieved can be used repeatedly and for different
purposes.
6. Experience Gained
I n 1968, after two years of development, the CCBF
system was installed to modernize a research documentation system which for the previous eight years
had been recording chemical formulas with the Ring
Code
and biological information by direct coding.
Our experience with these documentation systems
enables us to make the following statements:
- a research documentation system which is to be an
effective aid both now and in the future must continually be adjusting its techniques to keep up with
developments in the field of data processing and with
progress in the documentation methods for the particular subject concerned;
- in a documentation system concerned with chemical
formulas in drug research, the selectivity of pure
fragmentation codes is insufficient and, particularly
for substructure searching, only topological methods
meet all the requirements;
609
- the only expedient way to run a biological documentation system is to keep the data recording so simple
that no specialists are required for encoding;
- a research documentation system will only be attractive to the scientist if it provides both the opportunity
to carry out individual searches and also a continuing
service in the form of SDI;
- a research documentation will only remain up-to-date
if the system is controlled by the research department,
and if the scientist whose research results are documented continually collaborates in its extension and
maintenance.
In cooperation with the department Zentrale Datenverarbeitung der Firma C. H . Boehringer Sohn, Ingelheim, the programs for the CCBF system were written
partly in Assembler and partly in PLII for an IBM
360140. Dr. B. Braun, Ingelheim, coordinated the programs for the various sections of the system and made
many valuable suggestions for the system analysis, the
system maintenance, and the continual extensions to
the system. Fraulein U. Zech helped considerably in
the extensive planning for the computer processing of
biological information and supplied numerous practical
ideas. Herr J . Becker in his writing of the encoding
program for chemical formulas contributed very useful details. Herr J. Gruber programmed the biological
section; his shrewd suggestions helped considerably in
the development of the system. The updating and editing
programs were written by Herr P. Oppitz with many
fruitful ideas.
Received: May 25, 1970
I A 774 IEI
German version: Angew. Chern. 82, 628 (1970)
Translated by Express Translation Service, London
Gas-Liquid Chromatography and Mass Spectrometry in Methylation Analysis
of Polysaccharides
By HBkan Bjorndal, Carl Gustaf Hellerqvist, Bengt Lindberg, and Sigfrid Svensson[*I
New methylation procedures and the combined application of gas-liquid chromatography
and mass spectrometry for the qualitative and quantitative analysis of mixtures of
methylated sugars permit methylation analysis of polysaccharides to be performed more
accurately, faster, and with less material than previously.
2. Methylation of Polysaccharides
1. Introduction
Methylation analysis is an important method in structural polysaccharide chemistry. It involves exhaustive
methylation of the polysaccharide and hydrolysis to
a mixture of monomeric methylated sugars, which are
then separated, identified, and quantitatively estimated. The positions of glycosidic linkages in the polysaccharide correspond to the positions of unsubstituted
hydroxyl groups in these methylated monosaccharides.
The method gives no information on the relative order
of the sugar residues or on their anomeric nature.
Determination of the complete structure of a polysaccharide also requires complementary analyses, the
most important of which are graded hydrolysis, by
acids or enzymes, followed by isolation and identification of the oligosaccharides formed, and various modifications of periodate oxidation. Unlike most other
methods used in structural polysaccharide chemistry,
methylation analysis provides quantitative information [ I ] .
[*I Dr. H. Bjorndal, Dr. C. G. Hellerqvist,
Prof. B. Lindberg, and Dr. S. Svensson
Kungl. Universitetet i Stockholm
Institutionen for organisk kemi
Stockholm Va, Sandasgatan 2 (Sweden)
610
2.1. Haworth and Purdie Methylation
The aim of methylation is to achieve etherification of
all the free hydroxyl groups in the polysaccharide. In
the original procedure, used by Denham and Woodhouse[21 and by Haworthr31, this was achieved by
repeated reaction with dimethyl sulfate and sodium
hydroxide. Often, only a partially methylated product
was obtained which was then fully methylated by repeated treatment with silver oxide in boiling methyl
iodide, according to Purdie and Irvine[41.
Purdie’s technique was considerably improved by Kuhn
and co-workers 151, who carried out the reaction in the
polar solvent N,N-dimethylformamide. Polysaccharides which dissolve o r swell in this solvent may be
111 H. 0. Bouveng and B. Lindberg, Advances Carbohydrate
Chem. 15, 53 (1960).
[2] W. S.Denham and H. Woodhouse, J. chem. SOC. (London)
103, 1735 (1913).
[3] W. N . Haworth, J. chem. SOC.(London) 107, 8 (1915).
141 T. Purdie and J . C . Irvine, J. chem. SOC.(London) 83, 1021
(1903).
[5] R. Kuhn, H . Trischmann, and I. Low, Angew. Chem. 67, 32
(1955).
Angew.
Chem.
internat. Edit. 1 Vol. 9 (1970) / No. 8
Документ
Категория
Без категории
Просмотров
0
Размер файла
627 Кб
Теги
chemical, ccbf, biological, system, computer, facts, processing
1/--страниц
Пожаловаться на содержимое документа