Online Software Help Manual

BookmarkIndexPrint
Contents
Display Legacy Contents

SMILES and SMARTS Nomenclature

SMILES

SMILES is a simple yet comprehensive chemical nomenclature.

The answer to the most commonly asked question about SMILES is: yes, it is an acronym, meaning Simplified Molecular Input Line Entry Specification. (SMILES originated in the depths of the US government, where humorous names for things are frowned upon unless they are acronyms.)

SMILES is widely used as a general-purpose chemical nomenclature and data exchange format. However, SMILES differs in several fundamental ways from most chemical nomenclatures and other chemical formats. It is useful to review a few fundamental concepts before digging into the specifics of the SMILES language.

SMARTS

Substructure searching, the process of finding a particular pattern (subgraph) in a molecule (graph), is one of the most important tasks for computers in chemistry. It is used in virtually every application that employs a digital representation of a molecule, including depiction (to highlight a particular functional group), drug design (searching a database for similar structures and activity), analytical chemistry (looking for previously-characterized structures and comparing their data to that of an unknown), and a host of other problems.

SMARTS is a language that allows you to specify substructures using rules that are straightforward extensions of SMILES. For example, to search a database for phenol-containing structures, one would use the SMARTS string "[OH]c1ccccc1", which should be familiar to those aquainted with SMILES. In fact, almost all SMILES specifications are valid SMARTS targets (see "SMARTS Exceptions," below). Using SMARTS, flexible and efficient substructure-search specifications can be made in terms that are meaningful to chemists.

In the SMILES language, there are two fundamental types of symbols: atoms and bonds. Using these SMILES symbols, once can specify a molecule's graph (its "nodes" and "edges") and assign "labels" to the components of the graph (that is, say what type of atom each node represents, and what type of bond each edge represents).

The same is true in SMARTS: One uses atomic and bond symbols to specify a graph. However, in SMARTS the labels for the graph's nodes and edges (its "atoms" and "bonds") are extended to include "logical operators" and special atomic and bond symbols; these allow SMARTS atoms and bonds to be more general. For example, the SMARTS atomic symbol [C,N] is an atom that can be aliphatic C or aliphatic N; the SMARTS bond symbol "~" (tilde) matches any bond.

Atomic Primitives

SMARTS provides a number of primitive symbols describing atomic properties beyond those used in SMILES (atomic symbol, charge, and isotopic specifications). The following tables list the atomic primitives used in SMARTS (all SMILES atomic symbols are also legal). In these tables <n> stands for a digit, <c> for chiral class.

 SMARTS Atomic Primitives

Symbol

Symbol name

Atomic property requirements

Default

*

wildcard

any atom

(no default)

a

aromatic

aromatic

(no default)

A

aliphatic

aliphatic

(no default)

D<n>

degree

<n> explicit connections

exactly one1

H<n>

total-H-count

<n> attached hydrogens

exactly one1

h<n>

implicit-H-count

<n> implicit hydrogens

exactly one1

R<n>

ring membership

in <n> SSSR rings

any ring atom

r<n>

ring size

in smallest SSSR ring of size <n>

any ring atom

v<n>

valence

total bond order <n>

exactly one1

X<n>

connectivity

<n> total connections

exactly one1

- <n>

negative charge

-<n> charge

-1 charge (-- is -2, etc)

+<n>

positive charge

+<n> formal charge

+1 charge (++ is +2, etc)

#n

atomic number

atomic number <n>

(no default)

@

chirality

anticlockwise

anticlockwise, default class

@@

chirality

clockwise

clockwise, default class

@<c><n>

chirality

chiral class <c> chirality <n>

(nodefault)

@<c><n>?

chiral or unspec

chirality <c><n> orunspecified

(no default)

<n>

atomic mass

explicit atomic mass

unspecified mass

 

1Note that atomic primitive "H" can have two meanings, implying a property or the element itself. [H] means hydrogen atom. [*H2] means any atom with exactly two hydrogens attached.

Examples:

C

aliphatic carbon atom

c

aromatic carbon atom

a

aromatic atom

[#6]

carbon atom

[Ca]

calcium atom

[++]

atom with a +2 charge

[R]

atom in any ring

[D3]

atom with 3 explicit bonds (implicit H's don't count)

[X3]

atom with 3 total bonds (includes implicit H's)

[v3]

atom with bond orders totaling 3 (includes implicit H's)

C[C@H](F)O

match chirality (H-F-O anticlockwise viewed from C)

C[C@?H](F)O

matches if chirality is as specified or is not specified

Bond Primitives

Various bond symbols are available to match connections between atoms. A missing bond symbol is interpreted as "single or aromatic".

 SMARTS Bond Primitives

Symbol

Atomic property requirements

-

single bond (aliphatic)

/

directional single bond "up"

\

directional single bond "down"

/?

directional bond "up or unspecified"

\?

directional bond "down or unspecified"

=

double bond

#

triple bond

:

aromatic bond

~

any bond (wildcard)

@

any ring bond

 

Examples:

C

any aliphatic carbon

cc

any pair of attached aromatic carbons

c:c

aromatic carbons joined by an aromatic bond

c-c

aromatic carbons joined by a single bond (e.g. biphenyl)

Logical Operators

Atom and bond primitive specifications may be combined to form expressions by using logical operators. In the following table, "e" is an atom or bond SMARTS expression (which may be a primitive). The logical operators are listed in order of decreasing precedence (high precedence operators are evaluated first).

 SMARTS Logical Operators

Symbol

Expression

Meaning

exclamation

!e1

not e1

ampersand

e1&e2

a1 and e2 (high precedence)

comma

e1,e2

e1 or e2

semicolon

e1;e2

a1 and e2 (low precedence)

 

All atomic expressions which are not simple primitives must be enclosed in brackets. The default operation is `&' (high precedence "and"), i.e., two adjacent primitives without an intervening logical operator must both be true for the expression (or subexpression) to be true.

The ability to form expressions gives the SMARTS user a great deal of power to specify exactly what is desired. The two forms of the AND operator are used in SMARTS instead of grouping operators.

 Examples:

[CH2]

aliphatic carbon with two hydrogens (methylene carbon)

[!C;R]

( NOT aliphatic carbon ) AND in ring

[!C;!R0]

same as above ("!R0" means not in zero rings)

[n;H1]

H-pyrrole nitrogen

[n&H1]

same as above

[nH1]

same as above

[c,n&H1]

any arom carbon OR H-pyrrole nitrogen

[X3&H0]

atom with 3 total bonds and no H's

[c,n;H1]

(arom carbon OR arom nitrogen) and exactly one H

[Cl]

any chlorine atom

[35*]

any atom of mass 35

[35Cl]

chlorine atom of mass 35

[F,Cl,Br,I]

the 1st four halogens.

Recursive SMARTS

Any SMARTS expression may be used to define an atomic environment by writing a SMARTS starting with the atom of interest in this form:

$(SMARTS)

Such definitions may be considered atomic properties. These expressions can be used in same manner as other atomic primitives (also, they can be nested).

Recursive SMARTS expressions are used

*C

atom connected to methyl (or methylene) carbon

*CC

atom connected to ethyl carbon

[$(*C);$(*CC)]

atom in both above environments (matches CCC)

 

The additional power of such expressions is illustrated by the following example which derives an expression for methyl carbons which are ortho to oxygen and meta to a nitrogen on an aromatic ring.

CaaO

C ortho to O

CaaaN

C meta to N

Caa(O)aN

C ortho to O and meta to N (but 2O,3N only)

Ca(aO)aaN

C ortho to O and meta to N (but 2O,5N only)

C[$(aaO);$(aaaN)]

C ortho to O and meta to N (all cases)

Component-level grouping of SMARTS

SMARTS may contain "zero-level" parentheses which can be used to group dot-disconnected fragments. This grouping operator allows SMARTS to express more powerful component queries. In general, a single set of parentheses may surround any legal SMARTS expression. Two or more of these expressions may be combined into more complex SMARTS:

(SMARTS)
(SMARTS).(SMARTS)
(SMARTS).SMARTS

The semantics of the "zero-level" parentheses are that all of the atom and bond expressions within a set of zero-level parentheses must match within a single component of the target.

SMARTS

SMILES

Match behavior

C.C

CCCC

yes, no component level grouping specified

(C.C)

CCCC

yes, both carbons in the query match the same component

(C).(C)

CCCC

no, the query must match carbons in two different components

(C).(C)

CCCC.CCCC

yes, the query does match carbons in two different components

(C).C

CCCC

yes, both carbons in the query match the same component

(C).(C).C

CCCC.CCCC

yes, the first two carbons match different components, the third matches a carbon anywhere

 

These component-level grouping operators were added specifically for reaction processing. Without this construct, it is impossible to distinguish inter- versus intra-molecular reaction queries. For example:

Reaction SMARTS expression

Match behavior

C(=O)O.OCC>>C(=O)OCC.O

Matches esterifications

(C(=O)O).(OCC)>>C(=O)OCC.O

Matches intermolecular esterifications

(C(=O)O.OCC)>>C(=O)OCC.O

Matches intramolecular esterifications (lactonizations)

SMARTS vs. SMILES

All SMILES expressions are also valid SMARTS expressions, but the semantics changes because SMILES describes molecules whereas SMARTS describes patterns. The molecule represented by a SMILES string is usually, but not always, matched by the same string when used as a SMARTS.

SMILES is interpreted as a molecule, and it is the resultant molecule (not the SMILES string) which is subject to searching. Similarly, SMARTS is interpreted as a pattern; it is this pattern (not the SMARTS string) which is matched against molecules. For instance, the SMILES "C1=CC=CC=C1" (cyclohexatriene) is interpreted as the benzene molecule. This molecule will be matched by the SMARTS c1ccccc1, which is interpreted as the pattern "6 aromatic carbons in a ring". The SMARTS "C1=CC=CC=C1" makes a pattern ("six aliphatic carbons in a ring with alternating single and double bonds") which will not match benzene. It will, however, match the nonaromatic phenylate cation with SMILES C1=CC=CC=[CH+]1.

When atoms are specified without brackets in SMILES, default values are used; in SMARTS, unspecified properties are not defined to be part of the pattern. For instance, the SMILES O means an aliphatic oxygen with zero charge and two hydrogens, i.e. water. In SMARTS, the same expression means any aliphatic oxygen regardless of charge, hydrogen count, etc, e.g. it will match the oxygen in water, but also those in ethanol, acetone, molecular oxygen, hydroxy and hydronium ions, etc. Specifying [OH2] limits the pattern to match only water (this is also the fully specified SMILES for water).

There are a few anachronisms in most SMILES interpreters which can also lead to confusion. Some SMILES interpreters allow implicit hydrogens to be added as explicit atoms on input as a shortcut. E.g., the SMILES for 1H-pyrrole is "[nH]1cccc1" which is matched by itself as SMARTS and by "n1cccc1". The current Daylight SMILES interpreter will also accept "Hn1cccc1" for (not very good) reasons of historical compatability; this generates the same (hydrogen-suppressed) molecule as does "[nH]1cccc1" and is matched by the same SMARTS. However, the SMARTS "Hn1cccc1" does not match this molecule.

Most SMARTS expressions are not valid SMILES expressions. For instance, the string "cOc" is a valid SMARTS, matching an aliphatic oxygen connected to two aromatic carbons as part of a larger molecule (e.g. diphenyl ether). However, "cOc" does not describe a molecule per se, and is therefore not a valid SMILES.

Efficiency Considerations

The Daylight 4.x SMARTS Toolkit provides a function, dt_smarts_opt(), which automatically optimizes a SMARTS by reordering, expanding, and/or consolidating atom and bond expressions. Programs which use this feature (e.g. the Merlin program) can be expected to be near optimal in terms of the time used to search typical organic structures.

When this optimization method is not used, there are some things which can be done to facilitate efficient (fast) searching operations using SMARTS. It is important to recognize that SMARTS target strings are processed in strictly left-to-right order. For this reason, substantial gains in speed can be achieved by following these guidelines:

  • Uncommon atoms or bond arrangements should be placed early in SMARTS targets.

  • In an "and-expression", the less common atom or bond specifications should be placed early.

  • In an "or-expression", the less common atom or bond specifications should be placed last.

Examples

cc

any pair of attached aromatic carbons

c:c

aromatic carbons joined by anaromatic bond

c-c

aromatic carbons joined by a single bond (e.g.biphenyl).

O

any aliphatic oxygen

[O;H1]

simple hydroxy oxygen

[O;D1]

1-connected (hydroxy or hydroxide) oxygen

[O;D2]

2-connected (etheric) oxygen

[C,c]

any carbon

F,Cl,Br,I]

the 1st four halogens.

[N;R]

must be aliphatic nitrogen AND in a ring

[!C;R]

( NOTaliphatic carbon ) AND in a ring

[n;H1]

H-pyrrole nitrogen

[n&H1]

same as above

[c,n&H1]

any arom carbon OR H-pyrrole nitrogen

[c,n;H1]

(arom carbon OR arom nitrogen) and exactly one H

*!@*

two atoms connected by a non-ringbond

*@;!:*

two atoms connected by a non-aromatic ringbond

[C,c]=,#[C,c]

two carbons connected by a double ortriple bond

References

"SMILES 1. Introduction and Encoding Rules", Weininger, D., J. Chem. Inf. Comput. Sci., 1988, 28, 31.

Daylight Chemical Information Systems, Inc.