What is Genes2WordCloud?
Genes2WordCloud is a web-based application that enables users to create biologically-relevant-content WordClouds.
A WordCloud is a visual display of a set of words where the font, size, color or angle can represent some underlying information. A WordCloud is an effective way to visually summarize information about a specific topic of interest. The WordCloud is optimized to maximize the display of the most important terms about a specific topic in the minimum amount of space.
As researchers are faced with the daunting amount of new and growing data and text, methods to quickly summarize knowledge about a specific topic from large bodies of text or data are critical. WordClouds are emerging as a method of choice on the web to accomplish this task.
Genes2WordCloud generates WordClouds from the following sources:
- A single gene, or a list of genes. For that, three different resources are used. Either the gene(s) are matched to:
- their generifs annotations;
- their gene onthology annotations;
- abstracts on Pubmed articles linked to the gene(s) through generifs;
- their mammalian phenotype annotations from MGI;
- Free text or text extracted from a URL of a website. Free text or text extracted from a URL is used to generate a WordCloud.
- An author's name.WordClouds can be created from Pubmed articles returned for a specific author.
- General Pubmed search.A WordCloud can be generated from any Pubmed search based on returned abstracts.
- BMC Bioinformatics most viewed articles.Displays a WordCloud created from the most viewed BMC Bioinformatics articles for different time periods.
How does it work?
There are two tasks for creating WordClouds: first, generating the keywords to display; and secondly, displaying the keywords.
Generating the keywords
The keywords are generated on the server in several ways depending on the source chosen. In each case the process can be divided into two main tasks: obtain the text related to the user input, and text-mine the text.
Diagram 1 - Main task 1: obtain text from the user input
Diagram 2 - Main task 2: text-mining
Diagram 3 -Text-mining task details
Python's NLTK has a good Lemmatizer which works well for English and offers benefits over the commonly used Porter stemming algorithm. Lemmantizers are more language aware and don't join words that don't actually refer to the same concept. The Lemmantized words are used for the word cloud output.
Users have the option to remove text from the keywords, for instance common English words such as the, is, or are, the complete list is available in the following file. Common biological terms such as: experiments, abstracts, contributes can also be removed. These terms are available here. These terms were chosen by hand curation after experimenting with many WordClouds. Text-mining of generifs and gene ontology annotations also contains removed common terms. Finally, a stopwords input box is provided for users to blacklist any words they want.
The source files used to create the database for processing lists of genes to create WordClouds were taken from:
- NCBI for generating a reference of Entrez gene names. Only mouse, rat and human genes were used (file1, file2, file3)
- NCBI file for linking PMIDs to genes. (file4)
- NCBI's GeneRifs annotations. (file5)
- Gene Ontology annotations. Only mouse, rat and human genes were used. (file6, file7, file8, file9)
The different methods to obtain text from the user input and the text-mining algorithms consume a lot of CPU time and memory. For each query we only use a maximum of 150 abstracts or 500 annotations picked randomly when the queries return more than these limits.
Displaying the WordCloud
While a number of general purpose WordCloud generators exist, there are also a number of javascript libraries. The two primary ones being d3-cloud.js and wordcloud2.js. Both were tried and ultimately wordcloud2.js was modified to work more like d3-cloud.js because of d3-cloud's strength of being svg and wordcloud2's better drawing routine. After processing the text-mining server side, your web browser handles generating and displaying the wordcloud itself.
A web-based user-interface was added to Genes2WordCloud where several parameters such as the font or the layout can be changed.
Examples
In this section we provide some examples of using Genes2WordCloud.
A generif based Wordcloud for NANOG and SOX2
NANOG and SOX2 are both genes encoding transcription factors involved in embryonic stem cells self-renewal and pluripotency maintenance. The WordCloud automatically obtained relevant terms such as stem (the word cell was automatically removed as it is considered a biological common term), differentiate, pluripotent, self-renewal . Also Oct4, a gene that is often associated with NANOG and SOX2 was recovered by Genes2WordCloud.
A WordCloud that is based on our laboratory web-page was also created as an example
The Ma'ayan Laboratory is a computational systems biology laboratory and the program correctly extracted the most relevant terms that describe the function of the lab, for example: network, mammalian, software, database, compute, web-based tool.
A WordCloud for the p38 pathway based on a PubMed search
This WordCloud was obtained with the PubMed search: p38 pathway. The algorithm recovered terms such as: kinase, signal, MAPK, phosphorylate, apoptosis which are relevant to the p38 pathway, a signaling pathway involved in cell differentiation and apoptosis.
Troubleshooting
What to do if you don't see the WordCloud?
There are a few possible explanations:
- Your browser is very old. Try to use a more modern browser like Chrome or Firefox.
- Our server is down. Try again in a little while or contact us.
- The results of your query were nothing, try a different query to make sure the cloud works on your system to begin with.
- Your parameters are two restrictive and no words were able to be placed. Try tweaking parameters like frequency significance and font size range.
If it still doesn't work, you can try to figure out the error by opening a javascript console and reporting the error.
Contact us if you experience difficulties with your query and results, we will try to debug the error and get back to you.