Minimal Conflicting Sets for the Consecutive Ones Property in ancestral genome reconstruction
This page contains experimental results for the article
Minimal Conflicting Sets for the Consecutive Ones Property in ancestral genome reconstruction, to appear in the proceedings of RECOMB-CG 2009.
Preliminaries
The Consecutive Ones Property (C1P).
A dataset is a set of m subsets, S0, ..., Sm-1, of a ground set {1,2,
..., n}. Each subset is weighted, with its weight being betwen 0 and
1. Such a dataset can naturally be encoded by a binary matrix M (cell
M[i,j]=1 iff j belongs to Si), with m rows and n columns. M is C1P if
one can order its columns in such a way that each row has all its
entries 1 consecutives. The size of a row of a binary matrix is the
number of entries 1.
Minimal Conflicting Set and Maximal C1P Set. A Minimal
Conflicting Set (MCS) is a subset of rows of M that is not C1P and
such that every proper subse is C1P. A MaxC1P Set (MC1P) is a subset
of rows of M that is C1P such that adding any other row of M results
in a non C1P matrix. A co-MaxC1P Set (coMC1P) is a subset of rows of M
whose complement is MC1P.
Datasets description. A dataset represents a set of
ancestral syntenies, also called clusters.
- The ground set represents genomic markers (DNA segments) that are
believed to origin from a unique genome segment of an ancestral genome
of interest G.
- Each row represents an ancestral synteny, that is a set of genomic
markers that are believed to have been contiguous in G. Its weight
corresponds to a measure of how these markers are contiguous in
present species (1 means that the ancestral synteny is contiguous in
all sequenced genomes considered in the experiment).
File formats.
For each dataset we considered, we computed all MCS and all
coMC1P.
- MCS are presented first with their size (number of rows), then
the list of rows.
- CoMC1P are presented first with their weight (sum of the
weights of the rows composing the cMC1P), then the list of rows.
We also show a stat file that shows the number of MCS that contain a
given row (count-B) and the number of coMC1P that contain a given row
(count-A). Datasets are presented with one row per line in the file,
the first field being the identifier of the row, the second field its
weight and the remaining fields its entries 1.
Simulated datasets
We considered 10 datasets, each composed of 45 adjacencies (rows of
size 2), on a ground set {1,2 ...,40}. The 39 first adjacencies are
(1,2), (2,3), ..., (39,40). They obviously define a C1P binary
matrix. We then added 6 adjacencies (i,j) with |i-j|>1 (called "false
positives" as they break the C1P property, numbered from 39 to 44, and
of weight 0.5 while true positives have weight 1). Compared to the
other data we study, this is a relatively high rate of false
positives, but it gives a better idea of what happens with
adjacencies. Here are the results:
- DATASET 1: MCS, coMC1P, STATS.
- DATASET 2: MCS, coMC1P, STATS.
- DATASET 3: MCS, coMC1P, STATS.
- DATASET 4: MCS, coMC1P, STATS.
- DATASET 5: MCS, coMC1P, STATS.
- DATASET 6: MCS, coMC1P, STATS.
- DATASET 7: MCS, coMC1P, STATS.
- DATASET 8: MCS, coMC1P, STATS.
- DATASET 9: MCS, coMC1P, STATS.
- DATASET 10: MCS, coMC1P, STATS.
Real datasets (not discussed in the paper)
We considered the data used in the paper Prediction
of contiguous ancestral regions in the amniote ancestral genome
(data available at Companion
Website), and the problem of reconstructing the ancestral
boreoeutherian (placental mammals) genome.
In these data, each element of the ground set represents a genomic
marker that is found, up to small changes, once and only once in a set
of sequenced amniote genomes: Chicken (Ggal), Opposum (Mdom), Cow
(Btau), Dog (Cfam), Horse (Ecab), Rat (Rnor), Mouse (Mmus), Macaca
(Mmul), Orang-Outan (Ppyg), Chimp (Ptro) and Human (Hsap). Here are
the genomic coordinates of these synteny blocks: GENOMIC_MARKERS.
There are 1101 genomic markers.
A boreoeutherian ancestral synteny is a set of markers that is
contiguous in both either the Chicken or Oppossum genome and another
of these amniote genomes, or in a pimate or a rodent genome (Human,
Chimp, Macaca, Orang-Outan, Mouse, Rat) and a carnivore or herbivore
(Cow, Horse, Dog)genome. We
generated binary matrices corresponding to
- ancestral syntenies of size 2 (adjacencies) (DATASET 2)
- ancestral syntenies of size at most 3 (DATASET 3)
- ancestral syntenies of size at most 4 (DATASET 4)
- ancestral syntenies of size at most 5 (DATASET 5)
- unconstrained ancestral syntenies (DATASET 0)
Each matrix is broken down into submatrices, corresponding to a
PQR-tree, and MCS can only be found in the submatrices induced by the
R-nodes of this PQR-tree (see paper Prediction
of contiguous ancestral regions in the amniote ancestral genome
for details). This explains why each matrix defines several disjoint
datasets.
Here are the results (note that not all computations could finish).
- DATASET 2: ALL ANCESTRAL SYNTENIES.
First a few facts about this dataset. It contains 1086 ancestral
syntenies (covering 1099 of the 1101 markers), 18 of which have been
removed to make it C1P. It defines an ancestral genome with 31 genome
segments. Its R-nodes contain 642 markers (more than 50%), 223 of them
belong to at least one MCS.
-
R-NODE 1:
MCS,
coMC1PS,
STATS.
-
R-NODE 2:
MCS,
coMC1PS,
STATS.
-
R-NODE 3:
MCS,
coMC1PS,
STATS.
-
R-NODE 4:
MCS,
coMC1PS,
STATS.
-
R-NODE 5:
MCS,
coMC1PS,
STATS.
-
R-NODE 6:
MCS,
coMC1PS,
STATS.
-
R-NODE 7:
MCS,
coMC1PS,
STATS.
-
R-NODE 8:
MCS,
coMC1PS,
STATS.
-
R-NODE 9:
MCS,
coMC1PS,
STATS.
-
R-NODE 10:
MCS,
coMC1PS,
STATS.
-
R-NODE 11:
MCS,
coMC1PS,
STATS.
-
R-NODE 12:
MCS,
coMC1PS,
STATS.
-
R-NODE 13:
MCS,
coMC1PS,
STATS.
- DATASET 3: ALL ANCESTRAL SYNTENIES. It contains 1272 ancestral
syntenies (covering 1099 of the 1101 markers), 25 of which have been
removed to make it C1P. It defines an ancestral genome with 31 genome
segments. Its R-nodes (for which computating MCS finished) contain 722 markers, 252 of them
belong to at least one MCS.
-
R-NODE 1:
MCS,
coMC1PS,
STATS.
-
R-NODE 2:
MCS,
coMC1PS,
STATS.
-
R-NODE 3:
MCS,
coMC1PS,
STATS.
-
R-NODE 4:
MCS,
coMC1PS,
STATS.
-
R-NODE 5:
MCS,
coMC1PS,
STATS.
-
R-NODE 6: computation not finished
(input files CC, OG).
-
R-NODE 7:
MCS,
coMC1PS,
STATS.
-
R-NODE 8:
MCS,
coMC1PS,
STATS.
-
R-NODE 9:
MCS,
coMC1PS,
STATS.
-
R-NODE 10:
MCS,
coMC1PS,
STATS.
-
R-NODE 11:
MCS,
coMC1PS,
STATS.
-
R-NODE 12:
MCS,
coMC1PS,
STATS.
-
R-NODE 13:
MCS,
coMC1PS,
STATS.
- DATASET 4: ALL ANCESTRAL SYNTENIES.
It contains 1393 ancestral
syntenies (covering 1100 of the 1101 markers), 32 of which have been
removed to make it C1P. It defines an ancestral genome with 30 genome
segments. Its R-nodes (for which computating MCS finished) contain 558 markers, 215 of them
belong to at least one MCS.
-
R-NODE 1: computation not finished
(input files
CC,
OG).
-
R-NODE 2:
MCS,
coMC1PS,
STATS.
-
R-NODE 3:
MCS,
coMC1PS,
STATS.
-
R-NODE 4:
MCS,
coMC1PS,
STATS.
-
R-NODE 5:computation not finished
(input files
CC,
OG).
-
R-NODE 6: computation not finished
(input files
CC,
OG).
-
R-NODE 7:
MCS,
coMC1PS,
STATS.
-
R-NODE 8:
MCS,
coMC1PS,
STATS.
-
R-NODE 9:
MCS,
coMC1PS,
STATS.
-
R-NODE 10: computation not finished
(input files
CC,
OG).
-
R-NODE 11:
MCS,
coMC1PS,
STATS.
-
R-NODE 12:
MCS,
coMC1PS,
STATS.
-
R-NODE 13:
MCS,
coMC1PS,
STATS.
- DATASET 5: ALL ANCESTRAL SYNTENIES.
It contain 1471 ancestral
syntenies (covering 1100 of the 1101 markers), 31 of which have been
removed to make it C1P. It defines an ancestral genome with 30 genome
segments. Its R-nodes (for which computating MCS finished) contain 623 markers, 281 of them
belong to at least one MCS.
-
R-NODE 1:
MCS,
coMC1PS,
STATS.
-
R-NODE 2:
MCS,
coMC1PS,
STATS.
-
R-NODE 3:
MCS,
coMC1PS,
STATS.
-
R-NODE 4: computation not finished
(input files
CC,
OG).
-
R-NODE 5: computation not finished
(input files
CC,
OG).
-
R-NODE 6:
MCS,
coMC1PS,
STATS.
-
R-NODE 7:
MCS,
coMC1PS,
STATS.
-
R-NODE 8:
MCS,
coMC1PS,
STATS.
-
R-NODE 9: computation not finished
(input files
CC,
OG).
-
R-NODE 10:
MCS,
coMC1PS,
STATS.
-
R-NODE 11:
MCS,
coMC1PS,
STATS.
-
R-NODE 12:
MCS,
coMC1PS,
STATS.
-
R-NODE 13:
MCS,
coMC1PS,
STATS.
- DATASET 0: ALL ANCESTRAL SYNTENIES.
It contains 1064 ancestral
syntenies (covering 1100 of the 1101 markers), 17 of which have been
removed to make it C1P. It defines an ancestral genome with 27 genome
segments. Its R-nodes contain 246 markers, 131 of them
belong to at least one MCS.
-
R-NODE 1:
MCS,
coMC1PS,
STATS.
-
R-NODE 2:
MCS,
coMC1PS,
STATS.
-
R-NODE 3:
MCS,
coMC1PS,
STATS.
-
R-NODE 4:
MCS,
coMC1PS,
STATS.
-
R-NODE 5:
MCS,
coMC1PS,
STATS.
-
R-NODE 6:
MCS,
coMC1PS,
STATS.
-
R-NODE 7:
MCS,
coMC1PS,
STATS.
A few comments.
-
The ratio of ancestral syntenies belonging to at least one MCS in the
R-nodes is smaller than in the simulated data, as it is closer to
50%. This justifies the interest to detect these ones in order to
speed-up later computations.
- Increasing the size of the ancestral syntenies seem to augment
significantly the relative number of false positives, or at least of
ancestral syntenies to diacard to have a C1P matrix, but when the
syntenies are unconstrained. However the difference is not very big.
- The profile of R-nodes in terms of MCS vary widely, from small
R-nodes with very few MCS to large ones with thousands of them. The
number of MCS jumps starting at DATASET 3, while it stays quite small
for DATASET 2.
- CoMC1P Sets show that the optimal solution of the optimization
problem to transform a matrix into a C1P one by removing ros has in
general several solutions that are relatively close to the optimal,
but it is rare to find solutions that are almost equal to the
optimal. In general the difference is of at least 0.5 which is
relatively significant.