Identification of ApaH like phosphatases in available eukaryotic proteomes
ALPHs belong to the PPP family of phosphatases and possess the four conserved signature motifs (motif 1–4) of this family, GDxHG, GDxxDRG, GNHE, and HGG, sometimes with conservative substitutions [2]. One distinctive feature of ALPHs are two changes in the GDxxDRG motif: The second Asp is replaced by a neutral amino acid and the Arg residue is replaced by Lys. In addition, ALPHs have two C-terminal motifs [3, 6] that we here call motif 5 and 6. We screened 827 complete ukaryotic proteomes (Additional file 3: Table S1a) for the presence of ApaH like phosphatases with a home-made Python algorithm; these included all reference proteomes available on UniProt [27] and all available Kinetoplastida proteomes available on TriTrypDB [28, 29]. The algorithm is based on recognising the six sequence-motifs that are characteristic for ALPHs. The matrices used to define these motifs were optimised stepwise using yeast and Kinetoplastida proteomes to not miss any ALPH (controlled by BLAST) while on the other hand not to recognise PPPs, and in particular not the closely related phosphatases SLP, RLPH and ApaH (sequences taken from [6]). The final algorithm also included restrictions on distances between the motifs 1 to 2, 2 to 3 and 5 to 6 that we found highly conserved. More details are in material and methods. BLAST screens on selected proteomes without ALPH proteins revealed that ALPHs were either fully absent or, in rare cases, present in a truncated version and missing at least one of the motifs, mostly the N- or C-terminal one. ALPH proteins with missing N- or C-terminal motifs may have arisen from sequencing or annotation errors and were not included in the final list (Additional file 4: Table S2c) (11 proteins in total, all belong to phylogenetic groups with ALPH proteins present). Only for the Kinetoplastida, ALPHs with wrongly annotated start codons were manually included (based on comparison with related Kinetoplastida and available genome information).
Figure 1 summarises all organisms of this study in phylogenetic groups based on the latest eukaryotic classification suggested by [30]. For each group, the fraction of organisms with and without ALPH is indicated in orange and blue, respectively. 332 of all 827 organisms included in this study have at least one ALPH, and these organisms are distributed in a patchy way throughout all eukaryotes. Most Euglenozoa have ALPHs (32 of 33), many Stramenopiles (25 of 28) and Fungi (238 of 298), but also Rhodophyceae (3 of 4), Chlorophyta (11 of 18), Haptista (2 of 2) and some Metazoa (17 of 296). ALPHs are absent from land plants (100 proteomes tested), Apicomplexa (20 proteomes tested), Ciliata (3 proteomes tested) and Dinoflagellates (1 proteome tested). They are largely absent from Chordata (134 proteomes tested, only Branchiostoma floridae has ALPH) and from Ecdysozoa (144 proteomes tested, only 4 Arachnida have ALPH) and fully absent from the few available proteomes of Amoebozoa (9 proteomes tested) and Metamonada (3 proteomes tested). A phylogenetic tree built from the catalytic domains of ALPHs mostly reflects the eukaryotic tree (Additional file 1: Figure S1). Taken together, the data indicate that ALPH was present in the last common ancestor of all eukaryotes and was then selectively lost in certain sub-branches. Our data fully agree with the data of [6], regarding absences and presences of ALPH proteins in eukaryotic subgroups. We extend the available dataset from 52 ALPHs (38 organisms) to 441 ALPHs (332 organisms), this way providing a better resolution of ALPH distribution across eukaryotes.
The aim of this work was not to analyse horizontal gene-transfer between eukaryotes, bacteria and archaea; however, we could confirm the presence of ALPHs in a subgroup (11/25) of Myxococcales, as described in [6] (Additional file 5: Table S3A). We detected ALPH in 1 of 285 archaean proteomes (OX = 1906665 GN = EON65_52185, UniProt: UP000292173) (Additional file 5: Table S3B); it is possible that this protein is a contamination. All prokaryotic ALPHs consisted mostly of the catalytic domain with almost no N- or C terminal extensions.
General features of ApaH-like phosphatases
The dataset was used to refine the characteristics of ALPH proteins (Fig. 2). 40% of all analysed organisms have at least one ALPH isoform. The highest percentage is found in Discoba with 94%, the lowest in Diaphoretickes with 24% (Fig. 2A). Of the 332 APLH-positive organisms, 25% have more than one ALPH isoform: most (81%) have two and with one exception no organism has more than four (Fig. 2B). Organisms with multiple ALPHs were mostly enriched among the Discoba (94% of organisms with ALPH had at least 2 ALPHs) (Fig. 2B). The vast majority of all ALPH proteins are very short and consist mainly of the catalytic domain with short N- and C-terminal extensions (Fig. 2C). The median size of the C-terminal extensions is 26 amino acids: only 52 of all 441 ALPH proteins have C-termini extending 100 amino acids and most of these (31) are ALPHs of Discoba. The size of the N-terminus is slightly more variable and has a median of 87 amino acids. Only 61 of all 441 ALPH proteins have N-termini extending 200 amino acids and many of these (28) belong to ALPHs of Discoba. The largest variance in the size of ALPH N- and C-termini is found in Discoba, reflecting the presence of two different ALPH variants in the Kinetoplastida (discussed below).
The amino acid distances between some of the ALPH motifs are highly conserved (Fig. 2D): 93.7% of all ALPHs have between 28 and 30 amino acids between motif 1 and 2 (72% have exactly 29). The distance between motif 2 and 3 is exactly 26 amino acids for 83.0% of ALPHs and between 25 and 33 for 98.0% of ALPHs. The distance between the two C-terminal motifs 5 and 6 is exactly 19 amino acids for 92.5% of ALPHs. The distances between motif 3 and 4 are well conserved in ALPHs of Diaphoretickes and Discoba (94% are between 55 and 64 amino acids) but less well within Amorphea. The distances between motif 4 and 5 are poorly conserved. Sequence motifs were created for all six motifs (Fig. 2E) [31]. Mostly, these are conserved with some group-specific preferences at certain positions indicated with orange bars (Fig. 2E). The ALPH sequence of S. cerevisiae is shown as an example, to illustrate the definitions of all features discussed above (Fig. 2F).
ALPHs of Opisthokonts (fungi and holozoa)
The majority of available ALPH sequences (292) are from Fungi, because the number of available proteomes is high (298) and 80% of these proteomes contain at least one ALPH. We investigated these ALPH sequences further by looking for predicted domains (Interpro [32]), signal peptides and trans-membrane helices (Phobius [33] and Target P [34]) and predicted localisation (DeepLoc [35] and WoLF PSORT [36]) (Fig. 3A and Additional file 3: Table S1b). ALPHs of Fungi have very short C-termini (median = 26 amino acids) and only slightly larger N-termini (median = 97 amino acids). Most fungal ALPHs (95.2%) do not contain any predictable domain in addition to the catalytic ALPH domain. Of the 14 ALPHs that have a further domain, three have domains with functions in cytochrome c complex assembly (IPR021150, IPR018793) indicating mitochondrial functions and five ALPHs have a THIF-type NAD/FAD binding fold, usually found in the ubiquitin activating E1 family, indicating a possible function in protein degradation. The ALPH of Lentinula edodes has a Peroxin-3 domain, indicating a peroxisomal function. Two ALPHs of Rachicladosporium have Pectate lyase domains, indicating a possible function in degradation of cell wall material. ALPH of Phycomyces blakesleeanus has a spore coat protein CotH (IPR014867) domain, a domain of bacterial origin with unknown function in eukaryotes. The ALPH of Rhizopogon vesiculosus has a second ALPH domain.
The most interesting finding was the presence of predicted trans-membrane regions and/or signal peptides within the C-terminal region in 78.1% of all fungal ALPH proteins: 184 ALPH proteins have predicted membrane helices (mostly one), a further 41 ALPH proteins have predicted signal peptides, 3 ALPH proteins have both predicted and only 64 ALPH proteins have neither predicted (Fig. 3A and B, Additional file 3: Table S1b). In agreement with these data, only 49 ALPH proteins (16.8%) have predicted cytoplasmic localisation (mostly the ones without predicted membrane helices and signal peptides) (Fig. 3A and B, Additional file 3: Table S1b). The remaining proteins are predicted to be in the golgi (52.1%), endoplasmatic reticulum (19.5%), mitochondrion (6.5%), lysosome/vacuole (3.8%), nucleus (2 proteins, 0.7%), peroxisome (1 protein, 0.3%) and extracellular (1 protein, 0.3%). Prediction data need to be considered with care in the absence of experimental evidence, but taken together, the data provide strong evidence for the majority of fungal APLH proteins being non-cytoplasmic. Experimental data confirm non-cytoplasmic localisation for the ALPH protein of S. cerevisiae (YNL217w, Ppn2): two high-throughput studies indicate vacuolar localisation [37, 38] and recent data show that Ppn2 is delivered to the vacuolar lumen via the multivesicular body pathway, where it functions as an endopolyphosphatase [7].
ALPHs are underrepresented in Holozoans (Fig. 1). In particular, all 130 vertebrate proteomes lack ALPHs and of the three available non-vertebrate proteomes of Chordata, only the Lancelet Branchiostoma floridae is ALPH-positive. Out of the 140 available Ecdysozoan proteomes, only four organisms (Arachnida species) contain ALPH. ALPHs are present in subgroups of Cnidarians (2/3), Echinodermatans (1/2), Lophotrochozoens (8/11), Placozoans (1/2) and Ichthyosporeans (1/2). 7 of these 18 Holozoan ALPH proteins have a predicted signal peptide or transmembrane helix at their C-termini and with the exception of ALPH from Lottia gigantea, all proteins have non-cytoplasmic localisation predictions, mostly to the mitochondria (13/18) (Fig. 3C, Additional file 3: Table S1c). Two proteins have an additional domain: ALPH of Pomacea canaliculata has a domain of the FAD/NAD(P)-binding superfamily N- terminal of the catalytic domain and ALPH of Leptotrombidium deliense may be interacting with actin, as it has a Kaptin domain C-terminal of its ALPH domain.
All the 42 organisms containing an ALPH protein with a cytoplasmic localisation prediction also have a Dcp2 homologue (Blast, Additional file 3: Table S1b and c), indicating that a function of the ALPH protein in mRNA decapping, if present, is not exclusive.
ALPHs of diaphoretickes
We found no ALPHs in land plants (100 proteomes), Apicomplexa (20 proteomes), Ciliata (3 proteomes) or Dinoflagellata (1 proteome). ALPH is present in 3 of 4 species of red algae, 11 of 18 species of green algae (Chlorophyta), in the filamentous green algae Klebsormidium nitens, in the photosynthetic Alveolate Vitrella brassicaformis, in the cryptophyte algae Guillardia theta and in the two available Haptista proteomes. 25 of 28 Stramenopiles have ALPHs: These are mostly (non-photosynthetic) Oomycetes, including for example Phytophthora parasitica: all the strains of this plant pathogen have three ALPH isoforms. Predicted signal peptides or transmembrane helices are present at the C-termini of many Chloroplastida ALPHs (11/19), as well as in ALPH proteins of the Alveolata Vitrella brassicaformis, the diatom Phaeodactylum tricornutum and the Haptista Emiliania huxleyi; in all cases the presence of transmembrane helices or signal peptides correlates with a predicted non-cytoplasmic localisation (Fig. 4). With two exceptions, all 19 ALPH proteins from Chloroplastida have non-cytoplasmic localisation predictions (11 mitochondrion, 4 chloroplast, 2 endoplasmic reticulum). In contrast, the remaining 74 ALPH proteins from non-Chloroplastida have mostly cytoplasmic localisation predictions, with only 7 exceptions (6 mitochondrion, 1 chloroplast).
The majority of ALPH proteins from Diaphoretickes is very short and, in particular ALPHs of Oomycetes, consist almost exclusively of the catalytic domain with very short N- and C-terminal extensions. Of all 63 Diaphoretickes ALPH proteins, only five have additional domains: the ALPH of Chlamydomonas reinhardtii has a predicted Transposase IS605 domain (a transposon of bacterial origin), the ALPH of Helicosporidium has a coatomer delta subunit domain and the three large and almost identical ALPH proteins of Chlorella sorokiniana have a PsbQ-like domain (IPR023222), an Arm-like repeat and a second ALPH domain. A PsbQ-like domain is typical for proteins of the photosystem II, consistent with localisation predictions of the three Chlorella sorokiniana ALPH proteins to the chloroplast.
Of the 29 Diaphoretickes that have ALPH proteins with predicted cytoplasmic localisation, 26 have a readily identifiable homologue to Dcp2 of Arabidopsis thaliana (in some cases only as a fragment as proteomes were not complete) (Additional file 3: Table S1d). The three remaining organisms with no identifiable Dcp2 homologue are Chrysochromulina tobinii, Hyaloperonospora arabidopsidis and Pythium insidiosum. However, closely related species to all three organisms have Dcp2 and it is likely that the absence of a Dcp2 is caused by genome incompleteness: the proteome of C. tobinii is 37% incomplete and the proteomes of H. arabidopsidis and P. insidiosum are 7% and 1% incomplete, respectively. Thus, any function of an ALPH protein in mRNA decapping is likely not exclusive but occurs in addition to Dcp2.
ALPHs of Euglenozoa
ALPHs are present in all Kinetoplastida and in their close relative Euglena gracilis (Additional file 3: Table S1e). The only exception is the free-living, non-parasitic Bodo saltans.
ALPHs of Kinetoplastida fall into two groups (Fig. 5A): each organism has exactly one ALPH that is homologous to T. brucei ALPH1, the mRNA decapping enzyme [10]. These ALPHs all have a C-terminal extension of between 220 and 278 amino acids and, with two exceptions (Leptomonas pyrrhocoris and T. grayi), they all have N-terminal extensions of a similar size. The in vitro mRNA decapping activity of T. brucei ALPH1 does not require the N-terminus [10], and the two ALPHs that lack the N-terminus are therefore also likely active in mRNA decapping. The fact that no Kinetoplastida strain has lost its ALPH1 homologue, and the absence of homologues to the canonical mRNA decapping enzymes (DCP1/DCP2), indicates that all Kinetoplastida rely on ALPH1 for mRNA decapping.
In addition to the ALPH1 homologue, all but one Kinetoplastid have between one and three additional ALPH proteins that consist exclusively of the catalytic domain (Fig. 5A). These ALPH proteins of Leishmania, Leptomonas, Crithidia and Endotrypanum have extensions within the catalytic domain that are mainly caused by enlarged distances between motif 1 and 2 (up to 115 nucleotides in Leptomonas and Crithidia, more than in any other ALPH) and between motif 4 and 5 (up to 156 nucleotides). The only Kinetoplastida that has no ALPH apart from ALPH1 is Leishmania tarentolae; blast searches identified a protein at the expected position in the genome which lacked both motif 5 and motif 6, indicating a recent loss of the protein.
Differences between the two different groups of Kinetoplastida ALPH proteins are also obvious within the phosphatase and ALPH motifs: several positions show major differences; most pronounced is the difference in motif 1 (GDVHG/GDIHG for ALPH1/non-ALPH1) and in motif 6 (IDTG/LDSG for ALPH1/non-ALPH1) (Fig. 5B). When a phylogenetic tree is constructed from the sequences of the catalytic domains, the Kinetoplastida ALPH1 homologues form a separate group, distant to the group of the non-ALPH1 homologues, indicating that the ALPH1 decapping enzyme has evolved only once in the last common ancestor of the Kinetoplastida (Fig. 5A).
Protein localisation predictions are difficult in Euglenozoa as these organisms have diverse and poorly defined targeting signals: localisation predictions from Phobius, DeepLoc and Target P are listed in Additional file 3: Table S1e but do not correlate well with each other, and, in particular not with our experimental data. The decapping enzyme T. brucei ALPH1, fused to eYFP either C- or N- terminally, localises to the cytoplasm, P-bodies and to the posterior pole of the cell and this cytoplasmic localisation is consistent with the essential function of ALPH1 in mRNA decapping ([10] Fig. 5A). It is likely that ALPH1 orthologues of all other Kinetoplastida have predominant cytoplasmic localisations too. To investigate the localisation of an non-ALPH1 homologue, T. brucei ALPH2 was expressed as C-terminal eYFP fusion in procyclic trypanosomes using an inducible expression system [39]. ALPH2 showed a localisation pattern characteristic of mitochondrial proteins and co-localised with a mitochondrial stain (MitoTracker™ Orange) (Fig. 5A).
Euglena is too distantly related to the Kinetoplastida to unequivocally assign its ALPH to the ALPH1 or non ALPH1 group, but its sequence motifs (GDIHG, GDLVGKG, GNHD, HAG, VFFGH, LDTG) are more similar to the non-ALPH1 group and this is also suggested by the absence of N- and C-terminal extensions (Fig. 5A). Interestingly, Euglena ALPH is the only Kinetoplastida ALPH with a predicted trans-membrane domain. Euglena ALPH was not enriched within purified mitochondria fractions [40] and also not within chloroplasts (Martin Zoltner, Charles University in Prague, personal communication). Like Kinetoplastida, Euglena has no recognisable homologue to the canonical mRNA decapping enzyme DCP1/DCP2 in its genome and the absence of an ALPH1 homologue raises the question of how mRNA decapping is achieved in this organism.
As previously reported, Blast searches cannot identify a Dcp2 homologue in Kinetoplastida, with the one exception of Perkinsela (not examined in this study) [10]. Perkinsela is an obligate intracellular component of the Amoebozoa Neoparamoeba. Its Dcp2 has closest similarity to Dcp2 of flowering plants and the gene may have been taken up by lateral gene transfer.
ALPHs have in vitro mRNAs decapping activity
The phylogenetic data from us and others [6] show that ApaH like phosphatases were present in the last common ancestor of eukaryotes. Since then, 60% of eukaryotes have lost the enzyme and of those ALPH proteins that are still present, 73% have predicted non-cytoplasmic localisations. ALPH proteins are pyrophosphatases and at least the related bacterial ApaH protein has a rather broad substrate specificity and cleaves for example both NpnN nucleotides [13,14,15] as well as mRNA capped with NpnN [24, 26]. The need to protect capped mRNAs from uncontrolled degradation may create selection for the loss or non-cytoplasmic localisation of eukaryotic ALPH proteins. To test this hypothesis, we tested ALPH proteins with non-cytoplasmic localisation prediction for mRNA decapping activity in vitro.
We produced recombinant ALPH proteins from randomly selected organisms of the three eukaryotic kingdoms that have ALPHs: we used ALPH of the Ichthyosporea Sphaeroforma arctica, ALPH of the green algae Auxenochlorella protothecoides (A0A087SQ73) and ALPH2 from Trypanosoma brucei (ALPH2 has mitochondrial localisation and is not the mRNA decapping enzyme). As a control, ALPH of Sphaeroforma arctica was also produced as an inactive mutant by mutating a conserved amino acid in the metal ion binding motif (GDVIG:GNVIG [41]). All four proteins were purified from Arctic express cells (Fig. 6A) and subsequently tested in in vitro decapping assays using a 39 nucleotide long RNA oligo with a m7G cap structure as a substrate. Capped and uncapped oligos can be distinguished by differences in gel mobility on urea acrylamide gels complemented with acryloylaminophenyl boronic acid [42, 43]. The decapping activity of the three active enzymes was tested in the presence of different bivalent ions (Mg2+, Mn2+, Co2+ and Zn2+) (Fig. 6B). All the enzymes had mRNA decapping activity, but the ion requirements for optimal activity differed: the ALPH of Sphaeroforma arctica showed the highest decapping activity with Mn2+ and some activity with Co2+, but none with Mg2+ or Zn2+. T. brucei ALPH2 had the best decapping activity with Co2+ and some with Mg2+ and possibly Mn2+, but none with Zn2+. Auxenochlorella protothecoides ALPH had the best activity with Mg2+ and some with Co2+, but none with Zn2+ and Mn2+. These differences in ion requirements indicate that the enzyme activities are likely specific and not caused by contaminating bacterial enzymes, which would be expected to require identical ions in all samples. Moreover, the catalytically inactive mutant ALPH of Sphaeroforma arctica had no decapping activity, when tested with its supportive ion (Mn2+), further strong evidence for the specificity of the decapping activity.
The data show that all ALPH enzymes tested accept capped mRNA as a substrate in vitro. Importantly, the tested ALPH proteins have non-cytoplasmic localisation predictions, indicating that mRNA cannot be their physiological substrate. Instead, the findings suggest that ALPH proteins may have a similar broad substrate specificity as previously shown for bacterial ApaH. The data support the hypothesis that the preference for none or non-cytoplasmic ALPHs in many eukaryotes could serve to prevent uncontrolled mRNA degradation.