SMILES is a simple yet comprehensive chemical nomenclature.
The answer to the most commonly asked question about SMILES is: yes,
it is an acronym, meaning Simplified Molecular Input Line Entry Specification.
(SMILES originated in the depths of the US government, where humorous
names for things are frowned upon unless they are acronyms.)
SMILES is widely used as a general-purpose chemical nomenclature and
data exchange format. However, SMILES differs in several fundamental ways
from most chemical nomenclatures and other chemical formats. It is useful
to review a few fundamental concepts before digging into the specifics
of the SMILES language.
Substructure searching, the process
of finding a particular pattern (subgraph) in a molecule (graph), is one
of the most important tasks for computers in chemistry. It is used in
virtually every application that employs a digital representation of a
molecule, including depiction (to highlight a particular functional group),
drug design (searching a database for similar structures and activity),
analytical chemistry (looking for previously-characterized structures
and comparing their data to that of an unknown), and a host of other problems.
SMARTS is a language that allows you to specify substructures using
rules that are straightforward extensions of SMILES. For example, to search
a database for phenol-containing structures, one would use the SMARTS
string "[OH]c1ccccc1", which should be familiar to those aquainted
with SMILES. In fact, almost all SMILES specifications are valid SMARTS
targets (see "SMARTS Exceptions," below). Using SMARTS, flexible
and efficient substructure-search specifications can be made in terms
that are meaningful to chemists.
In the SMILES language, there are two fundamental types of symbols:
atoms and bonds. Using these SMILES symbols, once can specify
a molecule's graph (its "nodes" and "edges") and assign
"labels" to the components of the graph (that is, say what type
of atom each node represents, and what type of bond each edge represents).
The same is true in SMARTS: One uses atomic and bond symbols to specify
a graph. However, in SMARTS the labels for the graph's nodes and edges
(its "atoms" and "bonds") are extended to include
"logical operators" and special atomic and bond symbols; these
allow SMARTS atoms and bonds to be more general. For example, the SMARTS
atomic symbol [C,N] is an atom that can be aliphatic C or aliphatic N;
the SMARTS bond symbol "~" (tilde) matches any bond.
SMARTS provides a number of primitive symbols describing atomic properties
beyond those used in SMILES (atomic symbol, charge, and isotopic specifications).
The following tables list the atomic primitives used in SMARTS (all SMILES
atomic symbols are also legal). In these tables <n> stands for a
digit, <c> for chiral class.
SMARTS Atomic Primitives
Atomic property requirements
<n> explicit connections
<n> attached hydrogens
<n> implicit hydrogens
in <n> SSSR rings
any ring atom
in smallest SSSR ring of size <n>
any ring atom
total bond order <n>
<n> total connections
-1 charge (-- is -2, etc)
+<n> formal charge
+1 charge (++ is +2, etc)
atomic number <n>
anticlockwise, default class
clockwise, default class
chiral class <c> chirality <n>
chiral or unspec
chirality <c><n> orunspecified
explicit atomic mass
1Note that atomic primitive
"H" can have two meanings, implying a property or the element
itself. [H] means hydrogen atom. [*H2]
means any atom with exactly two hydrogens attached.
aliphatic carbon atom
aromatic carbon atom
atom with a +2 charge
atom in any ring
atom with 3 explicit bonds (implicit H's don't count)
atom with 3 total bonds (includes implicit H's)
atom with bond orders totaling 3 (includes implicit H's)
match chirality (H-F-O anticlockwise viewed from C)
matches if chirality is as specified or is not specified
Atom and bond primitive specifications may be combined to form expressions
by using logical operators. In the following table, "e" is an
atom or bond SMARTS expression (which may be a primitive). The logical
operators are listed in order of decreasing precedence (high precedence
operators are evaluated first).
SMARTS Logical Operators
a1 and e2 (high precedence)
e1 or e2
a1 and e2 (low precedence)
All atomic expressions which are not simple primitives must be enclosed
in brackets. The default operation is `&' (high precedence "and"),
i.e., two adjacent primitives without an intervening logical operator
must both be true for the expression (or subexpression) to be true.
The ability to form expressions gives the SMARTS user a great deal of
power to specify exactly what is desired. The two forms of the AND operator
are used in SMARTS instead of grouping operators.
aliphatic carbon with two hydrogens (methylene carbon)
SMARTS may contain "zero-level" parentheses which can be used
to group dot-disconnected fragments. This grouping operator allows SMARTS
to express more powerful component queries. In general, a single set of
parentheses may surround any legal SMARTS expression. Two or more of these
expressions may be combined into more complex SMARTS:
The semantics of the "zero-level" parentheses are that all
of the atom and bond expressions within a set of zero-level parentheses
must match within a single component of the target.
yes, no component level grouping specified
yes, both carbons in the query match the same component
no, the query must match carbons in two different components
yes, the query does match carbons in two different components
yes, both carbons in the query match the same component
yes, the first two carbons match different components, the third matches
a carbon anywhere
These component-level grouping operators were added specifically for
reaction processing. Without this construct, it is impossible to distinguish
inter- versus intra-molecular reaction queries. For example:
All SMILES expressions are also valid SMARTS expressions, but the semantics
changes because SMILES describes molecules whereas SMARTS describes patterns.
The molecule represented by a SMILES string is usually, but not always,
matched by the same string when used as a SMARTS.
SMILES is interpreted as a molecule, and it is the resultant molecule
(not the SMILES string) which is subject to searching. Similarly, SMARTS
is interpreted as a pattern; it is this pattern (not the SMARTS string)
which is matched against molecules. For instance, the SMILES "C1=CC=CC=C1"
(cyclohexatriene) is interpreted as the benzene molecule. This molecule
will be matched by the SMARTS c1ccccc1, which is interpreted as the pattern
"6 aromatic carbons in a ring". The SMARTS "C1=CC=CC=C1"
makes a pattern ("six aliphatic carbons in a ring with alternating
single and double bonds") which will not match benzene. It
will, however, match the nonaromatic phenylate cation with SMILES C1=CC=CC=[CH+]1.
When atoms are specified without brackets in SMILES, default values
are used; in SMARTS, unspecified properties are not defined to be part
of the pattern. For instance, the SMILES O means an aliphatic oxygen with
zero charge and two hydrogens, i.e. water. In SMARTS, the same expression
means any aliphatic oxygen regardless of charge, hydrogen count, etc,
e.g. it will match the oxygen in water, but also those in ethanol, acetone,
molecular oxygen, hydroxy and hydronium ions, etc. Specifying [OH2] limits
the pattern to match only water (this is also the fully specified SMILES
There are a few anachronisms in most SMILES interpreters which can also
lead to confusion. Some SMILES interpreters allow implicit hydrogens to
be added as explicit atoms on input as a shortcut. E.g., the SMILES for
1H-pyrrole is "[nH]1cccc1" which is matched by itself as SMARTS
and by "n1cccc1". The current Daylight SMILES interpreter will
also accept "Hn1cccc1" for (not very good) reasons of historical
compatability; this generates the same (hydrogen-suppressed) molecule
as does "[nH]1cccc1" and is matched by the same SMARTS. However,
the SMARTS "Hn1cccc1" does not match this molecule.
Most SMARTS expressions are not valid SMILES expressions. For instance,
the string "cOc" is a valid SMARTS, matching an aliphatic oxygen
connected to two aromatic carbons as part of a larger molecule (e.g. diphenyl
ether). However, "cOc" does not describe a molecule per se,
and is therefore not a valid SMILES.
The Daylight 4.x SMARTS Toolkit provides a function, dt_smarts_opt(),
which automatically optimizes a SMARTS by reordering, expanding, and/or
consolidating atom and bond expressions. Programs which use this feature
(e.g. the Merlin program) can be expected to be near optimal in terms
of the time used to search typical organic structures.
When this optimization method is not used, there are some things which
can be done to facilitate efficient (fast) searching operations using
SMARTS. It is important to recognize that SMARTS target strings are processed
in strictly left-to-right order. For this reason, substantial gains in
speed can be achieved by following these guidelines:
Uncommon atoms or bond arrangements should be placed
early in SMARTS targets.
In an "and-expression", the less common atom
or bond specifications should be placed early.
In an "or-expression", the less common atom
or bond specifications should be placed last.