Why Peptide Nomenclature Matters
A peptide's name is not merely a label. When written according to established conventions, it functions as a compressed structural description, communicating sequence, length, stereochemistry, and chemical modifications in a compact string of letters, prefixes, and suffixes. The designation Ac-D-Phe-Cys-Tyr-D-Trp-Lys-Val-Cys-Thr-NH2, for instance, conveys terminus chemistry, D-amino acid positions, and the full residue sequence before a single diagram is consulted.
For anyone reading peptide research — whether evaluating a preclinical study, cross-referencing a compound in a regulatory submission, or assessing structural similarity to a known sequence — fluency in nomenclature is a foundational skill. Errors in naming conventions have contributed to reproducibility problems in the literature, and inconsistent designation across databases complicates systematic review [1].
This guide covers the principal conventions in current use, from the standard amino acid codes to modification prefixes, structural variants, and regulatory naming requirements.
Standard Amino Acid Codes: Three-Letter and Single-Letter Systems
The Three-Letter Code
The three-letter amino acid code — Ala for alanine, Gly for glycine, Leu for leucine — was formalised by the IUPAC-IUBMB Joint Commission on Biochemical Nomenclature and remains the dominant convention in structural biology, medicinal chemistry, and regulatory documentation [1]. Each code is capitalised with only the first letter uppercase (e.g., Phe, Trp, Ser), and residues in a sequence are typically separated by hyphens: Ala-Gly-Phe-Leu.
The three-letter system is preferred when clarity is paramount, particularly in documents where a single ambiguous character could be misread. Regulatory submissions to agencies such as the FDA and EMA routinely require three-letter notation for peptide active substances to reduce transcription errors [3].
The Single-Letter Code
The single-letter code — A for alanine, G for glycine, L for leucine — was introduced to facilitate sequence alignment, database storage, and computational analysis [6]. In this system, a sequence is written as an unbroken string: AGFL. The single-letter code is standard in genomic and proteomic databases such as UniProt and in bioinformatics tools that process large sequence datasets.
The critical limitation of the single-letter system is ambiguity when modifications are present. Non-standard amino acids, D-forms, and chemical modifications cannot be represented without supplementary notation, which is why the three-letter code is retained for modified peptides in primary research literature.
The Twenty Standard Residues at a Glance
The table below lists the twenty canonical amino acids with both codes. Researchers encountering an unfamiliar three-letter code in a compound name should verify it against this set before assuming a non-standard residue is present.
| Three-Letter | Single-Letter | Amino Acid | |-------------|--------------|------------------| | Ala | A | Alanine | | Arg | R | Arginine | | Asn | N | Asparagine | | Asp | D | Aspartic acid | | Cys | C | Cysteine | | Gln | Q | Glutamine | | Glu | E | Glutamic acid | | Gly | G | Glycine | | His | H | Histidine | | Ile | I | Isoleucine | | Leu | L | Leucine | | Lys | K | Lysine | | Met | M | Methionine | | Phe | F | Phenylalanine | | Pro | P | Proline | | Ser | S | Serine | | Thr | T | Threonine | | Trp | W | Tryptophan | | Tyr | Y | Tyrosine | | Val | V | Valine |
Decoding Modification Notation
N-Terminus and C-Terminus Chemistry
By convention, peptide sequences are written from left to right, N-terminus to C-terminus. This orientation is not merely stylistic — it mirrors the direction of ribosomal synthesis and determines which end carries which modification [1].
The free N-terminus (an amino group, –NH₂) and free C-terminus (a carboxylic acid, –COOH) are the default states in an unmodified linear peptide. Modifications to these termini are denoted as prefixes and suffixes respectively.
Acetylation of the N-terminus is written as Ac- at the start of the sequence. Acetylation replaces the free amino group with an acetyl group (CH₃CO–), which eliminates the positive charge at physiological pH, increases metabolic stability, and reduces susceptibility to aminopeptidase degradation [4].
Amidation of the C-terminus is written as -NH2 or -amide at the end of the sequence. This modification replaces the free carboxyl group with a primary amide (–CONH₂), removing the negative charge and similarly improving resistance to carboxypeptidase cleavage [4].
A fully annotated example: Ac-Gly-Gly-Phe-Leu-NH2 describes a tetrapeptide with an acetylated N-terminus, the sequence glycine-glycine-phenylalanine-leucine, and an amidated C-terminus. Both terminal modifications are present, which is a common pattern in synthetic research peptides designed for enhanced stability.
D-Amino Acid Notation
Naturally occurring amino acids are almost exclusively L-form (L-configuration). When a D-amino acid is incorporated into a peptide sequence, it is denoted by a lowercase d- or the prefix D- before the three-letter code: D-Phe, D-Trp, d-Ala.
D-amino acid substitutions are a deliberate strategy in peptide design. Because most endogenous proteases are stereospecific for L-residues, incorporation of D-forms at vulnerable positions — particularly at the N-terminus or at cleavage-prone internal sites — substantially extends half-life in biological matrices [5]. Recognising D-amino acid notation in a compound name is therefore a direct signal that the compound has been engineered for metabolic resistance.
A retro-inverso peptide, in which the entire sequence is reversed and all residues are D-form, is denoted with the prefix RI- or by listing all residues as D-forms in reverse order. This class of modification is discussed further under structural variants below [5].
Non-Standard and Non-Proteinogenic Amino Acids
Research peptides frequently incorporate amino acids not found in the canonical twenty. Common examples include:
- Nle (norleucine): a structural isomer of leucine with a straight rather than branched side chain, used as a methionine surrogate.
- Cit (citrulline): a non-proteinogenic residue derived from arginine, relevant in post-translational modification research.
- Aib (α-aminoisobutyric acid): a non-natural residue that strongly induces helical conformation.
- Orn (ornithine): a lysine homologue relevant in metabolic pathway research.
- Hyp (hydroxyproline): a post-translationally modified proline common in collagen sequences [4].
When a three-letter code appears in a compound name that is not in the canonical twenty, the reader should consult IUPAC nomenclature tables or the original publication's methods section for the precise structural definition before drawing conclusions about the compound's properties.
Position Numbering and Structural Orientation
In longer peptides and in peptides derived from protein sequences, residues are often numbered according to their position in the parent protein. The notation [Sar1, Ile8]-angiotensin II, for example, indicates that position 1 (normally Asp) has been replaced by sarcosine (Sar, N-methylglycine) and position 8 (normally Phe) by isoleucine. The bracketed superscript position numbers allow precise identification of substitution sites without listing the entire sequence [2].
This positional notation is particularly common in pharmacological literature, where analogue series are described relative to a reference native sequence. Evaluators reading such designations should identify the reference peptide first, then apply the stated substitutions sequentially.
Recognising Structural Variants Through Nomenclature
Cyclic Peptides
Cyclic peptides are formed by a covalent bond between two residues — most commonly a disulfide bridge between two cysteine residues, a lactam between a lysine amine and an aspartate or glutamate carboxyl, or a head-to-tail cyclisation between the N- and C-termini.
Disulfide-bridged cyclic peptides are denoted with brackets or parentheses around the cyclised region, or with the notation cyclo(...). A disulfide bridge between Cys³ and Cys⁸ in a sequence may appear as Cys³–Cys⁸ with a bridge symbol, or the sequence may be enclosed in parentheses with a superscript indicating the bond type [5].
Head-to-tail cyclic peptides — where the N- and C-termini are directly bonded — are typically written as cyclo(Gly-Pro-Phe-Leu) or c(GPFL). The absence of terminal modification notation (no Ac- or -NH2) combined with the cyclo prefix is a reliable indicator of this architecture.
PEGylated Peptides
PEGylation — the covalent attachment of polyethylene glycol chains — is denoted by the prefix or suffix PEG, often with a subscript indicating molecular weight: PEG2000-Lys-peptide or peptide-Lys(PEG5000). PEGylation at a specific lysine residue is indicated by placing the modification in parentheses after the residue code: Lys(PEG). This modification extends hydrodynamic radius, reduces renal clearance, and can shield immunogenic epitopes [4].
Retro-Inverso and Branched Peptides
Retro-inverso peptides reverse the sequence direction and invert stereochemistry, producing a compound that mimics the side-chain topology of the parent peptide while resisting proteolysis. The prefix RI- or the explicit listing of all D-residues in reverse order identifies this class [5].
Branched peptides, in which multiple chains are attached to a central scaffold or to a lysine core, are denoted by the scaffold designation followed by the branch sequences. Multiple antigenic peptides (MAP) use this architecture and are typically designated MAP-core-[sequence]n, where n indicates the number of branches.
Cross-Referencing Name, Sequence Diagram, and Molecular Weight
A compound designation should be internally consistent with its sequence diagram and calculated molecular weight. Verification proceeds in three steps.
First, parse the name to extract the residue sequence and all modifications. Second, sum the residue masses (monoisotopic or average, as appropriate) and add the masses contributed by terminal modifications — acetylation adds 42.04 Da, amidation subtracts 0.98 Da relative to the free acid form. Third, compare the calculated mass against the reported molecular weight [2].
Discrepancies greater than 1 Da (for monoisotopic calculations) or 0.1% (for average mass calculations on larger peptides) warrant scrutiny. Common sources of discrepancy include unlisted disulfide bonds (each bridge removes 2.02 Da), unreported PEG chains, or errors in the stated sequence. Mass spectrometry data, when provided in a publication's supplementary material, is the most reliable cross-reference for verifying structural consistency.
IUPAC Nomenclature Versus Proprietary and Abbreviated Designations
Formal IUPAC nomenclature for peptides is systematic and unambiguous but becomes unwieldy for sequences longer than four or five residues. A tripeptide such as glutathione has the IUPAC name (2S)-2-amino-4-{[(1R)-1-[(carboxymethyl)carbamoyl]-2-sulfanylethyl]carbamoyl}butanoic acid, which is precise but impractical for routine use [1].
In practice, research literature uses a tiered approach. Short peptides are named by their residue sequence in three-letter code with modification notation. Longer peptides derived from known proteins are named by their parent protein, residue range, and any substitutions. Proprietary or investigational compounds are assigned alphanumeric codes (e.g., BPC-157, AOD-9604) that carry no intrinsic structural information and must be cross-referenced against primary literature or patent filings to determine composition.
When evaluating research quality, the presence of a full sequence designation alongside a proprietary code is a positive indicator of methodological transparency. Studies that reference only a proprietary alphanumeric code without disclosing the underlying sequence make independent verification difficult [2].
Using Nomenclature to Assess Structural Similarity and Homology
A peptide's sequence can be compared against known endogenous peptides, protein fragments, and approved drug substances to identify potential structural homology. Even partial sequence similarity — particularly at receptor-binding motifs — is relevant to understanding a compound's likely off-target interactions.
Database tools such as BLAST (for protein sequence alignment) and the UniProt peptide search function accept both single-letter and three-letter input and return homologous sequences with percentage identity scores [6]. A researcher encountering an unfamiliar compound designation can extract the sequence from the name, submit it to these tools, and rapidly determine whether it shares significant identity with characterised peptides.
D-amino acid substitutions and retro-inverso configurations will not return hits in standard BLAST searches, because these tools assume L-amino acid sequences. For modified peptides, manual comparison of the L-form equivalent sequence is the appropriate starting point.
Regulatory Naming Requirements
Regulatory agencies require unambiguous peptide designation in investigational new drug (IND) applications and new drug applications (NDA). FDA guidance on chemistry, manufacturing, and controls (CMC) for peptide drug substances specifies that the complete amino acid sequence must be provided in three-letter code, with explicit notation of all modifications, stereochemistry, and disulfide bond positions [3].
The International Nonproprietary Name (INN) system, administered by the World Health Organization, assigns standardised nonproprietary names to approved and investigational drug substances. Peptide INNs typically carry the stem -tide (for peptides) or specific sub-stems indicating pharmacological class, such as -pressin for vasopressin analogues or -relix for gonadotropin-releasing hormone analogues. Recognising these stems allows rapid identification of a compound's pharmacological class from its INN alone.
EMA guidelines similarly require full structural characterisation, including molecular formula, molecular weight, and sequence notation consistent with IUPAC-IUBMB conventions, in marketing authorisation applications [3]. Submissions that employ only proprietary codes or abbreviated notation without full structural disclosure are considered incomplete.
Practical Application: Reading an Unfamiliar Compound Name
Applying these conventions to an unfamiliar designation follows a consistent workflow. Consider the hypothetical name Ac-D-Phe-Pro-Arg-chloromethylketone.
Parsing from left to right: Ac- indicates an acetylated N-terminus. D-Phe is a D-phenylalanine at position 1. Pro is L-proline at position 2. Arg is L-arginine at position 3. chloromethylketone is a C-terminal reactive warhead rather than a standard terminus modification, indicating this is an affinity label or irreversible inhibitor rather than a conventional research peptide.
This single name therefore communicates sequence, stereochemistry, terminus chemistry, and mechanism of action — all before consulting a diagram. That level of information density is what makes systematic nomenclature indispensable for evaluating research compounds efficiently and accurately.
Conclusion
Peptide nomenclature is a structured language with consistent grammar. Mastery of its conventions — amino acid codes, modification prefixes and suffixes, position numbering, structural variant notation, and regulatory naming standards — enables rapid, reliable interpretation of compound designations across literature, databases, and regulatory documents. For researchers and evaluators working with research-stage peptide compounds, this fluency is not a peripheral skill but a prerequisite for rigorous assessment of compound identity, structural consistency, and relationship to known sequences.