Naail Kashif-Khan: A Multimodal Approach towards Genomic Identification of Protein Inhibitors of Uracil-DNA Glycosylase
An example is Ung-family uracil-DNA glycosylase inhibition, which prevents Ung-mediated degradation via the stoichiometric protein blockade of the Ung DNA-binding cleft. This is significant where uracil-DNA is a key determinant in the replication and distribution of virus genomes. Unrelated protein folds support a common physicochemical spatial strategy for Ung inhibition, characterised by pronounced sequence plasticity within the diverse fold families. That, and the fact that relatively few template sequences are biochemically verified to encode Ung inhibitor proteins, presents a barrier to the straightforward identification of Ung inhibitors in genomic sequences.
In this study, distant homologs of known Ung inhibitors were characterised via structural biology and structure prediction methods. A recombinant cellular survival assay and in vitro biochemical assay were used to screen distant variants and mutants to further explore tolerated sequence plasticity in motifs supporting Ung inhibition. The resulting validated sequence repertoire defines an expanded set of heuristic sequence and biophysical signatures shared by known Ung inhibitor proteins. A computational search of genome database sequences and the results of recombinant tests of selected output sequences obtained are presented here.
The pyrimidine-base, uracil, occurs frequently in DNA under ambient cellular conditions: In a mammalian cell, approximately 100–500 cytosine bases per day will deaminate to form U:G mismatches, and around 104 misincorporations of dUTP (a natural precursor for thymidine biosynthesis) will occur in each round of replication to create U:A base pairs . Assuming rates are comparable in bacterial cells, hundreds of dUTP misincorporations would occur per round of replication, and on the order of 1 in 250 cytosine positions would deaminate per day. Perturbed cellular function could be expected unless uracil is removed and replaced with the canonical base. For example, thymine methyl groups are important chemical signatures for DNA binding proteins such as transcription factors during motif recognition , while U:G base pairs would transition to T:A during the replication of the affected DNA strand.
When viruses replicate in a cell, the nucleotide pools are imbalanced to the extent that the further misincorporation of uracil would be inevitable . Some viruses, particularly bacteriophages, manage uracil-DNA via specialised viral genome encoded factors, either by utilising it as a protective strategy against restriction enzymes or by suppressing it if it interferes with the replication strategy [4,5,6,7].
Bacterial cellular dUTPase will minimise the misincorporation of dUTP, and the host’s uracil-DNA glycosylase (UDG) will remove uracil from DNA . Ung (family 1 of the UDG superfamily) is the most prevalent form of UDG and removes uracil bases, regardless of their state of base pairing or sequence context, via the cleavage of the glycosyl bond attaching them to the deoxyribose . Unless it is enzymatically processed, the abasic site created by Ung is relatively stable under ambient cellular conditions .
For DNA repair purposes, UDG/Ung initiates a process known as base excision repair (BER), in which a specialised DNA endonuclease (endonuclease IV in bacterial cells) will rapidly recognise and cleave the DNA backbone and under ordinary circumstances would recruit a specialised repair DNA polymerase (PolA in bacterial cells); repair is then completed by a DNA ligase (LigA in bacterial cells) [11,12].
However, in the presence of replicating viral DNA, where millions of copies accumulate over so-called “burst“ periods (measured in minutes), Ung is in effect a restriction factor. Since Ung promotes endonuclease cleavage of the DNA backbone, double-stranded viral DNA will fragment if uracil bases appear in close proximity . This DNA fragmentation will also affect viruses whose DNA replication involves programmed backbone nicks or exposed single-stranded DNA regions, and transposable elements where the translocation of naked single-stranded DNA is necessary during conjugative transfer.
To mitigate against Ung restriction effects, bacteriophages may encode protective factors, such as a dUTPase or uracil-DNA glycosylase inhibitors [7,14,15]. The latter are presently known to target Ung specifically. Ung inhibitors (which are interchangeably referred to herein as UngIns) are DNA mimetic proteins utilising charge-based alignment with the Ung DNA binding cleft to dock Ung. Docking promotes the hydrophobic sequestration of the sidechain of the apical residue in the Ung DNA minor groove binding loop, which is essentially irreversible under biological conditions, to inhibit Ung [16,17].
So effective is the strategy of Ung inhibition that under its protection, some phages are known to encode entire nucleotide biosynthesis pathways with specialised DNA and RNA polymerases to replicate their genomes in the form of thymine-free uracil-DNA, thereby avoiding sensitivity to host restriction endonucleases [18,19,20,21,22,23].
UngIn protein structures described to date fall into one of three observed folds: Ugi/SAUGI (note that Ugi and SAUGI share the same protein fold at ~12% sequence residue identity), p56, and Vpr; yet, within each fold, type there is pronounced sequence plasticity, to the extent that the unambiguous identification of an Ung inhibitor in any genome is rarely straightforward. In some organisms using uracil in their own genomes, such as the uracil-DNA Yersinia phage PhiR1-37 , the identification of an encoded UngIn has proven to be impossible with common bioinformatics tools.
It is possible that other forms of Ung modulation could be deployed, such as that described for Escherichia phage T5 , or that some UngIn encoding sequences could be beyond our present limits of detection. These undetectable sequences may belong to as-yet unknown, structurally diverse families.
At best, novel UngIns may be discovered through homology to known inhibitor sequences, their known genomic loci, or through predicted biophysical properties of gene products; however, such methods have not thus far permitted UngIn identification in genomes such as Yersinia phage PhiR1-37.
The prototypical known UngIn families (Ugi, SAUGI, and p56) display significant sequence diversity. Sequence identities as low as 20% are typical, which makes the computational identification of UngIns difficult in the absence of experimental data . This poor sequence conservation can lead to the misannotation of UngIn sequences. Not all such sequences in genome records are annotated as uracil-DNA glycosylase inhibitors, and even when they are thus annotated, it cannot be taken for granted that these proteins are viable or act as functional UngIns.
For these reasons, distant homologs of each UngIn type (Ugi, SAUGI, and p56) were selected for the experimental screening of UngIn activity. Heuristic signatures of UngIn sequences were defined from verified UngIn variants and subsequently used to search for UngIns in virus or other pathogen genomes.
In this study, the aim is to develop an expanded, validated sequence repertoire for UngIns, based upon structural insights and directed random mutagenesis, and use it to search phage genomes for uracil-DNA glycosylase inhibitors. Genomic sequences are searched using heuristic signatures that essentially define the properties of currently known Ung inhibitors, and the output sequences are assessed via recombinant expression and UngIn function assays.