Step-by-step tutorial

 

Step 1. Enter a query sequence (example). Only one peptide sequence is accepted. The length of the query sequences should be 10 to 40,000 amino acids.

Step 2. Select at least one database for the search. Phylogenetic coverage of the listed databases is visually indicated in the 'Database' page. Please note that selecting many or large databases will result in more computational time.

Step 3. Input the number of sequences you want. It is possible that the number of sequences you finally get is smaller than the number you input here, depending on the divergence and phylogenetic distribution of homologs as well as the search threshold you input (see the Step 4 below).

Step 4. Input the search threshold E-value. Relaxing the threshold E-value can result in more sequences you finally get.

Step 5. Select 'yes' if you prefer to turn on low complexity filter (see help of NCBI Blast) in the search. In the default, the low complexity filter is not applied.

Step 6. Execute the search by clicking on the 'Search' button. Search duration can vary between a few seconds to minutes, depending on the length of the query sequence, database selection, etc.

Step 7. When something is wrong with your search setting, an error (example) is reported. Then, by clicking on 'Redo the search', you can go back to the top page for another search with a new setting.

Step 8. If the search is successfully done, you are automatically guided to the page titled 'Search finished!' (example), where you can view and download the output. In the middle of the page, under the section 'Proceed to tree building', you will find a link to a page which guides you to tree building on the MAFFT server (example).

 


Frequently Asked Questions


Q. Some species with already sequenced genomes are not covered in the database selection. Why?

A. At the moment, we include only species with sequenced and published genomes whose gene annotation is already made publicly accessible. Please let us know if you know of any genome sequenced and published but not covered in the list.



Q. How can we tell which sequences in the output multi-fasta file come from which species?

A. In the databases derived from genome projects (= all but Database #7 and #13), the definition lines starting with the symbol '>' have abbreviated names of the source species, such as 'ACRDI' for Acropora digitifera. This way, you can uniquely distinguish species included.



Q. There are more large-scale sequence resources (e.g., transcriptomes with deep sequencing and  Sanger sequencing), but they are not included in the databases at aLeaves. Any hope to be able to perform searches in such sequence resources in the future?

A. At the moment, we don't include any resource available solely as transcriptome. We are thinking of including some of such resources for particular species in the future, particularly if there are no whole genome sequences available for those speices.



Q. I wonder if the sequences I got as a result of an aLeaves run are all expressed as proteins.

A. With aLeaves, Blastp searches are performed in protein sequence databases containing both validated and non-validated sequences. Many of the latter ones are products of automatic gene prediction programs run on genome assemblies. Please note that especially sequences in this category include false-positives that does not exist in reality and sequences with incomplete ends because of partial misannotation or gappy genome assemblies.



Q.
Are the five-letter species identifiers compatible with other databases?

A. No, not necessarily. Identifiers for some species may be the same as in other databases, but it happens by chance. In aLeaves, we combined the first three letters from the genus name and the two from the species name to have the five-letter identifier for each species. If the output of aLeaves is passed on to the MAFFT web server, and if a phylogenetic tree is built with the aLeaves-derived dataset, the five-letter species identifiers are automatically recognized to show different species in corresponding colors. This should enable easy identification of the distribution of different groups of organisms in the tree. The color code is shown in the table on the 'Species' page.



Q.
Is aLeaves suitable to searches of sequences for mitochondrial genes?

A. No. We believe that searching mitochondrial gene sequences should be better performed at other sites, such as the NCBI Blast page. Because mitochondrial sequences occupy a considerable proportion of entries in public sequence databases, we deleted mitochondrial protein sequences (wherever possible) in the Database #7 and #13 (see the 'Database' page for the way how they were excluded). Therefore, searches with aLeaves will necessarily miss many mitochondrial sequences.



Back to top