Sage Journals: Discover world-class research

Abstract

The biclustering problem has been extensively studied in many areas, including e-commerce, data mining, machine learning, pattern recognition, statistics, and, more recently, computational biology. Given an n × m matrix A (n ≥ m), the main goal of biclustering is to identify a subset of rows (called objects) and a subset of columns (called properties) such that some objective function that specifies the quality of the found bicluster (formed by the subsets of rows and of columns of A) is optimized. The problem has been proved or conjectured to be NP-hard for various objective functions. In this article, we study a probabilistic model for the implanted additive bicluster problem, where each element in the n × m background matrix is a random integer from [0, L − 1] for some integer L, and a k × k implanted additive bicluster is obtained from an error-free additive bicluster by randomly changing each element to a number in [0, L − 1] with probability θ. We propose an O (n²m) time algorithm based on voting to solve the problem. We show that when \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$k \geq \Omega (\sqrt{n \log n})$$\end{document} , the voting algorithm can correctly find the implanted bicluster with probability at least \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$1 - {\frac {9} {n^ {2}}}$$\end{document} . We also implement our algorithm as a C++ program named VOTE. The implementation incorporates several ideas for estimating the size of an implanted bicluster, adjusting the threshold in voting, dealing with small biclusters, and dealing with overlapping implanted biclusters. Our experimental results on both simulated and real datasets show that VOTE can find biclusters with a high accuracy and speed.

Get full access to this article

View all access options for this article.

References

Alon

, Krivelevich

, Sudakov

1998. Finding a large hidden clique in a random graph. Random Struct. Algorithms, 13:457–466.

Alon

, Barkai

, Notterman

D.A.

et al. 1999. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. USA, 96:6745–6750.

Barkow

, Bleuler

, Prelić

et al. 2006. BicAT: a biclustering analysis toolbox. Bioinformatics, 22:1282–1283.

Ben-Dor

, Chor

, Karp

et al. 2002. Discovering local structure in gene expression data: the order-preserving submatrix problem. Proc. RECOMB, 45–55.

Ben-Dor

, Shamir

, Yakhini

1999. Clustering gene expression patterns. J. Comput. Biol., 6:281–297.

Berriz

G.F.

, King

O.D.

, Bryant

et al. 2003. Charactering gene sets with FuncAssociate. Bioinformatics, 19:2502–2504.

Cheng

, Church

G.M.

2000. Biclustering of expression data. Proc. ISMB-00, 93–103.

Feige

, Krauthgamer

2000. Finding and certifying a large hidden clique in a semirandom graph. Random Struct. Algorithms, 16:195–208.

Gasch

A.P.

, Spellman

P.T.

, Kao

C.M.

et al. 2000. Genomic expression programs in the response of yeast cells to environmental changes. Mol. Biol. Cell, 11:4241–4257.

10.

Getz

, Levine

, Domany

et al. 2000. Super-paramagnetic clustering of yeast gene expression profiles. Physica A, 279:457–464.

11.

Hartigan

J.A.

1972. Direct clustering of a data matrix. J. Am. Statist. Assoc., 67:123–129.

12.

Ihmels

, Bergmann

, Barkai

2004. Defining transcription modules using large-scale gene expression data. Bioinformatics, 20:1993–2003.

13.

Kluger

, Basri

, Chang

et al. 2003. Spectral biclustering of microarray data: coclustering genes and conditions. Genome Res., 13:703–716.

14.

Kucera

1995. Expected complexity of graph partitioning problems. Discrete Appl. Math., 57:193–212.

15.

, Chen

, Zhang

et al. 2006. A general framework for biclustering gene expression data. J. Bioinform. Comput. Biol., 4:911–933.

16.

, Ma

, Wang

2002. On the closest string and substring problems. J. ACM, 49:157–171.

17.

Liu

, Wang

2007. Computing the maximum similarity biclusters of gene expression data. Bioinformatics, 23:50–56.

18.

Lonardi

, Szpankowski

, Yang

2004. Finding biclusters by random projections. Proc. 15th Annu. Symp. Combin. Pattern Matching, 102–116.

19.

Madeira

S.C.

, Oliveira

A.L.

2004. Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans. Comput. Biol. Bioinform., 1:24–45.

20.

Motwani

, Raghavan

1995. Randomized Algorithms. Cambridge University Press: Cambridge, UK.

21.

Murali

T.M.

, Kasif

2003. Extracting conserved gene expression motifs from gene expression data. Pac. Symp. Biocomput., 8.

22.

Peeters

2003. The maximum edge biclique problem is NP-complete. Discrete Appl. Math., 131:651–654.

23.

Prelić

, Bleuler

, Zimmermann

et al. 2006. A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics, 22:1122–1129.

24.

Tanay

, Sharan

, Shamir

2002. Discovering statistically significant biclusters in gene expression data. Bioinformatics, 18,Suppl 1:136–144.

25.

Westfall

P.H.

, Young

S.S.

1993. Resampling-Based Multiple Testing. Wiley: New York.

26.

Yang

, Wang

et al. 2002. δ-clusters: capturing subspace correlation in a large data set. Proc. 18th Int. Conf. Data Eng., 517–528.

An Efficient Voting Algorithm for Finding Additive Biclusters with Random Background

Abstract

Abstract

Get full access to this article

References