Skip to main content

Main Content

“Smurf” Gene Aside, Most Gene Names Are a Mess, Expert Says

What’s in a name? With apologies to Shakespeare, Hongfang Liu, PhD, says that a specific gene that goes by any other name doesn’t “smell sweet” at all - or make her job a bed of roses.

That’s because Liu has tasked herself with trying to figure out how to make it easy for scientists – and non-scientists alike – to figure out which gene is which. Given historical looseness in the process of gene naming, and the evolving nature of biology research, different genes can have the same name – Asp, for example, refers to 14 genes across 8 species – or there can be many names associated with a single gene. For genes associated with the same name, some of those genes are related, such as being homologous to each other or having similar functions or phenotypes, and some of them are completely different.

Think of it like Church Street, which, it seems, every town and city has, says Liu, an assistant professor in the department of biostatistics, bioinformatics, and biomathematics. If you simply said you live on Church Street, no one outside your town would know where that is located, she says. And you could live on a Church Street that no longer has a church on it or is newly named to something like Maple Avenue.

“It’s a big mess out there. A lot of people assign names to genes not realizing other people have already done that,” she says.

Liu first became aware of the problem when she began her PhD project, which aimed to automatically extract information for genes from scientific papers at Columbia University. After studying mathematics and computer science, she says with a laugh, “there is no difference between computers and me when reading scientific papers since neither of us have a biological background. I found the project very challenging because different genes had the same name and different names were given to the same gene.”

Her solution? “Since so many papers have been published this way, we need a comprehensive, automatic resource that provides mapping between gene names and genes.” So Liu, who has been at GUMC since 2006, heads a project that is dedicated to helping people figure out which gene is being referred to in papers. Through a National Science Foundation (NSF) grant, she developed the online resource BioThesaurus which provides mappings between names and genes, and has been used by scientists extensively – receiving about 200,000 hits a month - to retrieve synonymous names of a gene and to identify ambiguous names.

With a $.8 million NSF CAREER grant and a new $1.2 million National Institutes of Health (NIH) two-year grant, awarded this fall by the American Recovery and Reinvestment Act of 2009, she plans to develop onto-BioThesaurus, which enhances BioThesaurus by delineating genes with the same name that are related to each other from those that are very different – except in name. It does this by comparing sequences and/or functions of the genes and arranging them into one or several hierarchical trees – which Liu calls an ontology-based approach.

Liu’s research is recognized as critical to storing, retrieving, and extracting knowledge and information in the biomedical domain, and therefore to enabling knowledge-based "-omics" data analysis for systems biology and medicine. And because of that Liu has many collaborators locally, nationally, and internationally.

But there are a few names that have no competitors, Liu says, and will likely never be used more than once – names such as the “pokemon” gene mutation and “fear of intimacy,” along with “smurf.” They provide a bit of comic relief, Liu says with a smile.

By Renee Twombly, GUMC Communications

(Published December 16, 2009)