Description

The selection string parser is a relatively simpled parser patterned after C/PERL expressions and includes support for PERL-style regular expressions via Boost. There are two kinds of literals supported: strings and numbers. Numbers are any valid integer. Strings are delimited by either single quotes or double quotes, so both of the following are valid strings:

"a string"
'another string'

An important caveat to integer numbers is that LOOS assumes that none will be negative. In other words, no atomid nor resid nor number extracted from a segid (see Magical Operations magical ops below) will evaluate to a negative number. The relational operators < and <= will behave differently if either operand is a negative number. In this case, they will evaluate to false, for reasons that will become obvious when you read about the magical operators below...

The parser also recognizes a small set of keywords that evaluate to Atom properties. These keywords fall into two types as well: those that evaluate to a number (id, resid) and those that evaluate to a string (name, resname, chainid, and segname or segid). Keep in mind that keywords are not substitutions, but are more like a pre-defined function that returns that atom property. So you cannot put a keyword in a string and expect it to be substituted with the appropriate value, for example.

Relational Operators

Operator	Operation	Strings	Numbers	Example
>	Greater than	yes	yes	resid > 10
>=	Greater than or equals	yes	yes	resid >= 10
<=	Less than or equals	yes	yes	resid <= 50
<	Less than	yes	yes	resid < 50
==	Exactly equals	yes	yes	name == "CA"
!=	Doesn't equals exactly	yes	yes	segname != "SOLV"
=~	Regular expression match	yes	no	name =~ "^(C[A]?\|N\|O)$"

Logical Operators

Operator	Operation	Example
&&	Logical And	name == "CA" && segid == "PROT"
\|\|	Logical Or	segid == "SOLV" \|\| segid == "BULK"
!	Not (Negate)	!(segid == "SOLV")

Magical Operators

Operator	Operation	Example
->	Extracts a number from a string	segid -> "L(\d+)"

Keywords

Keyword	Atom Property	Evaluates to...	Operators
name	Atom name	string	>, >=, <=, <, ==, !=, =~
id	Atom ID	number	>, >=, <=, <, ==, !=
index	Atom index in model file (0=based)	number	>, >=, <=, <, ==, !=
resname	Residue name	string	>, >=, <=, <, ==, !=, =~
resid	Residue ID	number	>, >=, <=, <, ==, !=
segid	Atom segid	string	>, >=, <=, <, ==, !=, =~
segname	Synonym for segid	string	>, >=, <=, <, ==, !=, =~
chainid	Chain ID	string	>, >=, <=, <, ==, !=, =~
all	Evaluates to true	number
hydrogen	Evaluates to true if atom is a hydrogen	number
backbone	Evaluates to true if atom is a backbone atom (nucleic acids and proteins, and includes hydrogens)	number

Notes:
The hydrogen selector looks for low-mass atoms with names starting with H. In order to work correctly when hydrogen mass repartitioning is used, the threshold mass has been set to 4.1 amu. This means the selector will produce false positive matches if the system contains helium.

The all keyword is used to force a selection string to match all atoms in instances where a selection is required. For example, a program to align frames of a trajectory DCD to a reference structure might require a selection to pick which atoms to use when computing the rotations and then another selection to pick which atoms are actually rotated. If you wanted to apply the rotation to all atoms, you just use the all keyword, i.e.

aligner --selection='name='CA' && segid =~ "BAR[12]"' --transform='all' foo.pdb foo.dcd newfoo

Regular Expression Matching

The regular expression matching operator "=~" deserves special attention. It's use is more restrictive than the other operators in that it can only take a keyword that evaluates to a string on the left-hand side and a string on the right-hand side. So, the following expressions are valid:

name =~ "CA"
name =~ "^(C|O|N)$"
segid =~ "PROT|HEME"

While the following are not valid:

resid =~ "10[0-9][0-9]"
segid =~ 0010
name =~ resname

The regular expression syntax supported is the PERL syntax as implemented by the Boost libraries. While you can write regular expressions that look a lot like globbing (a la VMD selections), keep in mind that it isn't globbing. It's a regular expression, which is more powerful anyway... You do need to be careful though that your shell does not munge any of the regex operators. It's a good idea to use single quotes when you're writing regex's in a shell, or to use configuration files to do the arguments instead (see the wiki for a discussion of how to do that).

The string equality operators ("==" and "!=") both consider the entire string.

"CA" == "C"  --> false
 "C" == "C"  --> true

You can use the "=~" operator to perform a substring match.

"CA" == "C"  --> false
 "C" == "C"  --> true
"CA" =~ "C"  --> true

This brings up an important point about using regular expressions: be careful of unexpected substring matches. For example, let's say you are wanting to pick out all backbone atoms and you write this selection string:

name =~ "C|CA|O|N"

Now look what happens when the following atom names are matched:

 "CG" --> true
"CD1" --> true
 "NE" --> true
"OH2" --> true

The problem is that the regular expression is not constrained, so even though you explicitly put "CA" and "CB" in there, you also have a "C" which says any atom name with a "C" in it is a match. If you want to match a string exactly with a regular expression, you must anchor it:

name =~ "^(C|CA|CB|O|N)$"

Magical Operations

There is currently only one "magical operator" defined: "->". This operator takes a string keyword on the left-hand side (i.e. name, resname, or segid/segname) and a string on the right-hand side representing a regular expression pattern. It will then try to extract a numeric value (integer) from the subexpression matches. For example, suppose you have a range of segments that all follow a pattern such as "PG1", "PG2", "PG3", ..., "PG120". The regular expression "PG(\d+)" matches these and the pattern within the parenthesis is a subexpression. So,

(segid->"L(\d+)") >= 10 && (segid->"L(\d+)") <= 50

will match segid's "L10" through "L50". Since each matched subexpression will be examined for a valid integer conversion, the following will work as expected:

segid->"(L|PG)(\d+)"

There is a small hitch with the magical operator. If there is no match, it evaluates to -1. But this is a valid int, so you cannot do the following:

segid->"L(\d+)" <= 100

since it will match all segids. You can't, unless the <= operator is also a little bit special. Fortunately, it is. If either operand is a negative number, both the < and <= operands assume that this is a flag for a null-match, and will result in a false value being returned. It's a bit of a kludge, but it works...

Putting It All Together...

When you perform a selection on an AtomicGroup using the selection language, the expression is evaluated once for each atom in the group. If it evaluates to "true" (integer 1), then the atom is added to the new selection. Only one atom is considered at a time.

Here are some example selections:

Extract C-alphas:
  name == "CA"

Solvent:
  segid == "SOLV" || segid == "BULK"

Solvent heavy atoms (oxygens only)
  name =~ "O" && (segid == "SOLV" || segid == "BULK")

C-alphas from a range of residues:
  name == "CA" && resid >= 10 && resid <= 50

Usage

Most tools based on LOOS will accept selection strings from the command-line. They must be enclosed in quotes though so they are all one argument to the tool. If you're using regular expressions, it's a good idea to use single quotes to prevent your shell from misinterpreting the regular expression operators and as mentioned before, back-slash escapes may need doubling.

You can store your selection in a file if you want. To use it then, use the back-quote feature of your shell to "cat" your selection file. Since your selection must be one argument, you must enclose the back-quote within double-quotes, i.e.

  a_tool_name "`cat myselection.txt`" arg arg arg

If you store your selection in a file, then you can also use comments. A comment is anything after a "#" on a line. Here's an example of a selection in a file:

### Select water oxygens only...
# Pick out any atom that contains an oxygen
name =~ "O" &&
(segid == "SOLV" || # any segment named SOLV
segid == "BULK")    # or named BULK