TY - GEN
T1 - Characterising DNA/RNA signals with crisp hypermotifs
T2 - 5th European Conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics
AU - Pridgeon, Carey
AU - Corne, David
PY - 2007
Y1 - 2007
N2 - A common way to characterise important and conserved signals in nucleotide sequences, such as transcription factor binding sites, is via the use of so-called consensus sequences or consensus patterns. A well-known example is the so-called "TATA-box" commonly found in eukaryotic core promoters. Such patterns are valuable in that they offer an insight into basic molecular biology processes, and can support reasoning regarding the understanding, design and control of these processes. However it is rare for such patterns to be accurate; instead they represent a very approximate characterisation of the signal under study. At the opposite extreme, we may instead characterise such a signal via a neural network, or a high-order Markov model, and so on. These have better sensitivity and specificity, but are unreadable, and consequently unhelpful for conveying an understanding of the underlying molecular biology processes that could support insight or reasoning. We describe a simple pattern language, called crisp hypermotifs (CHMs), that leads to highly readable patterns that can support understanding and reasoning, yet achieve greater sensitivity and specificity than the commonly used approaches to crisply characterise a signal. We use evolutionary computation to discover high-performance CHMs from data, and we argue that CHMs be used in place of classical consensus motifs, and justify that by presenting examples derived from a large dataset of mammalian core promoters. We provide CHM alternatives to the well-known core promoter TATA-box and Initiator patterns that have better sensitivity and specificity than their classical counterparts. © Springer-Verlag Berlin Heidelberg 2007.
AB - A common way to characterise important and conserved signals in nucleotide sequences, such as transcription factor binding sites, is via the use of so-called consensus sequences or consensus patterns. A well-known example is the so-called "TATA-box" commonly found in eukaryotic core promoters. Such patterns are valuable in that they offer an insight into basic molecular biology processes, and can support reasoning regarding the understanding, design and control of these processes. However it is rare for such patterns to be accurate; instead they represent a very approximate characterisation of the signal under study. At the opposite extreme, we may instead characterise such a signal via a neural network, or a high-order Markov model, and so on. These have better sensitivity and specificity, but are unreadable, and consequently unhelpful for conveying an understanding of the underlying molecular biology processes that could support insight or reasoning. We describe a simple pattern language, called crisp hypermotifs (CHMs), that leads to highly readable patterns that can support understanding and reasoning, yet achieve greater sensitivity and specificity than the commonly used approaches to crisply characterise a signal. We use evolutionary computation to discover high-performance CHMs from data, and we argue that CHMs be used in place of classical consensus motifs, and justify that by presenting examples derived from a large dataset of mammalian core promoters. We provide CHM alternatives to the well-known core promoter TATA-box and Initiator patterns that have better sensitivity and specificity than their classical counterparts. © Springer-Verlag Berlin Heidelberg 2007.
UR - http://www.scopus.com/inward/record.url?scp=38049010029&partnerID=8YFLogxK
M3 - Conference contribution
SN - 9783540717829
VL - 4447 LNCS
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 227
EP - 235
BT - Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics - 5th European Conference, EvoBIO 2007, Proceedings
Y2 - 11 April 2007 through 13 April 2007
ER -