What is Genes2WordCloud?

Genes2WordCloud is a web-based application that enables users to create biologically-relevant-content WordClouds.

A WordCloud is a visual display of a set of words where the font, size, color or angle can represent some underlying information. A WordCloud is an effective way to visually summarize information about a specific topic of interest. The WordCloud is optimized to maximize the display of the most important terms about a specific topic in the minimum amount of space.

Wordcloud

As researchers are faced with the daunting amount of new and growing data and text, methods to quickly summarize knowledge about a specific topic from large bodies of text or data are critical. WordClouds are emerging as a method of choice on the web to accomplish this task.

Genes2WordCloud generates WordClouds from the following sources:

How does it work?

There are two tasks for creating WordClouds: first, generating the keywords to display; and secondly, displaying the keywords.

Generating the keywords

The keywords are generated on the server in several ways depending on the source chosen. In each case the process can be divided into two main tasks: obtain the text related to the user input, and text-mine the text.


Diagram 1
Diagram 1 - Main task 1: obtain text from the user input

Diagram 2
Diagram 2 - Main task 2: text-mining

Diagram 3
Diagram 3 -Text-mining task details

Python's NLTK has a good Lemmatizer which works well for English and offers benefits over the commonly used Porter stemming algorithm. Lemmantizers are more language aware and don't join words that don't actually refer to the same concept. The Lemmantized words are used for the word cloud output.

Users have the option to remove text from the keywords, for instance common English words such as the, is, or are, the complete list is available in the following file. Common biological terms such as: experiments, abstracts, contributes can also be removed. These terms are available here. These terms were chosen by hand curation after experimenting with many WordClouds. Text-mining of generifs and gene ontology annotations also contains removed common terms. Finally, a stopwords input box is provided for users to blacklist any words they want.

The source files used to create the database for processing lists of genes to create WordClouds were taken from:

The different methods to obtain text from the user input and the text-mining algorithms consume a lot of CPU time and memory. For each query we only use a maximum of 150 abstracts or 500 annotations picked randomly when the queries return more than these limits.

Displaying the WordCloud

While a number of general purpose WordCloud generators exist, there are also a number of javascript libraries. The two primary ones being d3-cloud.js and wordcloud2.js. Both were tried and ultimately wordcloud2.js was modified to work more like d3-cloud.js because of d3-cloud's strength of being svg and wordcloud2's better drawing routine. After processing the text-mining server side, your web browser handles generating and displaying the wordcloud itself.

A web-based user-interface was added to Genes2WordCloud where several parameters such as the font or the layout can be changed.

Examples

In this section we provide some examples of using Genes2WordCloud.

A generif based Wordcloud for NANOG and SOX2

wordcloud

NANOG and SOX2 are both genes encoding transcription factors involved in embryonic stem cells self-renewal and pluripotency maintenance. The WordCloud automatically obtained relevant terms such as stem (the word cell was automatically removed as it is considered a biological common term), differentiate, pluripotent, self-renewal . Also Oct4, a gene that is often associated with NANOG and SOX2 was recovered by Genes2WordCloud.

A WordCloud that is based on our laboratory web-page was also created as an example

wordcloud

The Ma'ayan Laboratory is a computational systems biology laboratory and the program correctly extracted the most relevant terms that describe the function of the lab, for example: network, mammalian, software, database, compute, web-based tool.

A WordCloud for the p38 pathway based on a PubMed search

wordcloud

This WordCloud was obtained with the PubMed search: p38 pathway. The algorithm recovered terms such as: kinase, signal, MAPK, phosphorylate, apoptosis which are relevant to the p38 pathway, a signaling pathway involved in cell differentiation and apoptosis.

Troubleshooting

What to do if you don't see the WordCloud?

There are a few possible explanations:

If it still doesn't work, you can try to figure out the error by opening a javascript console and reporting the error.

Contact us if you experience difficulties with your query and results, we will try to debug the error and get back to you.