Minimal Conflicting Sets for the Consecutive Ones Property in ancestral genome reconstruction

This page contains experimental results for the article Minimal Conflicting Sets for the Consecutive Ones Property in ancestral genome reconstruction, to appear in the proceedings of RECOMB-CG 2009.

Preliminaries

The Consecutive Ones Property (C1P). A dataset is a set of m subsets, S0, ..., Sm-1, of a ground set {1,2, ..., n}. Each subset is weighted, with its weight being betwen 0 and 1. Such a dataset can naturally be encoded by a binary matrix M (cell M[i,j]=1 iff j belongs to Si), with m rows and n columns. M is C1P if one can order its columns in such a way that each row has all its entries 1 consecutives. The size of a row of a binary matrix is the number of entries 1.

Minimal Conflicting Set and Maximal C1P Set. A Minimal Conflicting Set (MCS) is a subset of rows of M that is not C1P and such that every proper subse is C1P. A MaxC1P Set (MC1P) is a subset of rows of M that is C1P such that adding any other row of M results in a non C1P matrix. A co-MaxC1P Set (coMC1P) is a subset of rows of M whose complement is MC1P.

Datasets description. A dataset represents a set of ancestral syntenies, also called clusters.

The ground set represents genomic markers (DNA segments) that are believed to origin from a unique genome segment of an ancestral genome of interest G.
Each row represents an ancestral synteny, that is a set of genomic markers that are believed to have been contiguous in G. Its weight corresponds to a measure of how these markers are contiguous in present species (1 means that the ancestral synteny is contiguous in all sequenced genomes considered in the experiment).

File formats. For each dataset we considered, we computed all MCS and all coMC1P.

MCS are presented first with their size (number of rows), then the list of rows.
CoMC1P are presented first with their weight (sum of the weights of the rows composing the cMC1P), then the list of rows.

We also show a stat file that shows the number of MCS that contain a given row (count-B) and the number of coMC1P that contain a given row (count-A). Datasets are presented with one row per line in the file, the first field being the identifier of the row, the second field its weight and the remaining fields its entries 1.

Simulated datasets

We considered 10 datasets, each composed of 45 adjacencies (rows of size 2), on a ground set {1,2 ...,40}. The 39 first adjacencies are (1,2), (2,3), ..., (39,40). They obviously define a C1P binary matrix. We then added 6 adjacencies (i,j) with |i-j|>1 (called "false positives" as they break the C1P property, numbered from 39 to 44, and of weight 0.5 while true positives have weight 1). Compared to the other data we study, this is a relatively high rate of false positives, but it gives a better idea of what happens with adjacencies. Here are the results:

Real datasets (not discussed in the paper)

We considered the data used in the paper Prediction of contiguous ancestral regions in the amniote ancestral genome (data available at Companion Website), and the problem of reconstructing the ancestral boreoeutherian (placental mammals) genome. In these data, each element of the ground set represents a genomic marker that is found, up to small changes, once and only once in a set of sequenced amniote genomes: Chicken (Ggal), Opposum (Mdom), Cow (Btau), Dog (Cfam), Horse (Ecab), Rat (Rnor), Mouse (Mmus), Macaca (Mmul), Orang-Outan (Ppyg), Chimp (Ptro) and Human (Hsap). Here are the genomic coordinates of these synteny blocks: GENOMIC_MARKERS. There are 1101 genomic markers. A boreoeutherian ancestral synteny is a set of markers that is contiguous in both either the Chicken or Oppossum genome and another of these amniote genomes, or in a pimate or a rodent genome (Human, Chimp, Macaca, Orang-Outan, Mouse, Rat) and a carnivore or herbivore (Cow, Horse, Dog)genome. We generated binary matrices corresponding to

ancestral syntenies of size 2 (adjacencies) (DATASET 2)
ancestral syntenies of size at most 3 (DATASET 3)
ancestral syntenies of size at most 4 (DATASET 4)
ancestral syntenies of size at most 5 (DATASET 5)
unconstrained ancestral syntenies (DATASET 0)

Each matrix is broken down into submatrices, corresponding to a PQR-tree, and MCS can only be found in the submatrices induced by the R-nodes of this PQR-tree (see paper Prediction of contiguous ancestral regions in the amniote ancestral genome for details). This explains why each matrix defines several disjoint datasets. Here are the results (note that not all computations could finish).

DATASET 2: ALL ANCESTRAL SYNTENIES. First a few facts about this dataset. It contains 1086 ancestral syntenies (covering 1099 of the 1101 markers), 18 of which have been removed to make it C1P. It defines an ancestral genome with 31 genome segments. Its R-nodes contain 642 markers (more than 50%), 223 of them belong to at least one MCS.
1. R-NODE 1: MCS, coMC1PS, STATS.
2. R-NODE 2: MCS, coMC1PS, STATS.
3. R-NODE 3: MCS, coMC1PS, STATS.
4. R-NODE 4: MCS, coMC1PS, STATS.
5. R-NODE 5: MCS, coMC1PS, STATS.
6. R-NODE 6: MCS, coMC1PS, STATS.
7. R-NODE 7: MCS, coMC1PS, STATS.
8. R-NODE 8: MCS, coMC1PS, STATS.
9. R-NODE 9: MCS, coMC1PS, STATS.
10. R-NODE 10: MCS, coMC1PS, STATS.
11. R-NODE 11: MCS, coMC1PS, STATS.
12. R-NODE 12: MCS, coMC1PS, STATS.
13. R-NODE 13: MCS, coMC1PS, STATS.
DATASET 3: ALL ANCESTRAL SYNTENIES. It contains 1272 ancestral syntenies (covering 1099 of the 1101 markers), 25 of which have been removed to make it C1P. It defines an ancestral genome with 31 genome segments. Its R-nodes (for which computating MCS finished) contain 722 markers, 252 of them belong to at least one MCS.
1. R-NODE 1: MCS, coMC1PS, STATS.
2. R-NODE 2: MCS, coMC1PS, STATS.
3. R-NODE 3: MCS, coMC1PS, STATS.
4. R-NODE 4: MCS, coMC1PS, STATS.
5. R-NODE 5: MCS, coMC1PS, STATS.
6. R-NODE 6: computation not finished (input files CC, OG).
7. R-NODE 7: MCS, coMC1PS, STATS.
8. R-NODE 8: MCS, coMC1PS, STATS.
9. R-NODE 9: MCS, coMC1PS, STATS.
10. R-NODE 10: MCS, coMC1PS, STATS.
11. R-NODE 11: MCS, coMC1PS, STATS.
12. R-NODE 12: MCS, coMC1PS, STATS.
13. R-NODE 13: MCS, coMC1PS, STATS.
DATASET 4: ALL ANCESTRAL SYNTENIES. It contains 1393 ancestral syntenies (covering 1100 of the 1101 markers), 32 of which have been removed to make it C1P. It defines an ancestral genome with 30 genome segments. Its R-nodes (for which computating MCS finished) contain 558 markers, 215 of them belong to at least one MCS.
1. R-NODE 1: computation not finished (input files CC, OG).
2. R-NODE 2: MCS, coMC1PS, STATS.
3. R-NODE 3: MCS, coMC1PS, STATS.
4. R-NODE 4: MCS, coMC1PS, STATS.
5. R-NODE 5:computation not finished (input files CC, OG).
6. R-NODE 6: computation not finished (input files CC, OG).
7. R-NODE 7: MCS, coMC1PS, STATS.
8. R-NODE 8: MCS, coMC1PS, STATS.
9. R-NODE 9: MCS, coMC1PS, STATS.
10. R-NODE 10: computation not finished (input files CC, OG).
11. R-NODE 11: MCS, coMC1PS, STATS.
12. R-NODE 12: MCS, coMC1PS, STATS.
13. R-NODE 13: MCS, coMC1PS, STATS.
DATASET 5: ALL ANCESTRAL SYNTENIES. It contain 1471 ancestral syntenies (covering 1100 of the 1101 markers), 31 of which have been removed to make it C1P. It defines an ancestral genome with 30 genome segments. Its R-nodes (for which computating MCS finished) contain 623 markers, 281 of them belong to at least one MCS.
1. R-NODE 1: MCS, coMC1PS, STATS.
2. R-NODE 2: MCS, coMC1PS, STATS.
3. R-NODE 3: MCS, coMC1PS, STATS.
4. R-NODE 4: computation not finished (input files CC, OG).
5. R-NODE 5: computation not finished (input files CC, OG).
6. R-NODE 6: MCS, coMC1PS, STATS.
7. R-NODE 7: MCS, coMC1PS, STATS.
8. R-NODE 8: MCS, coMC1PS, STATS.
9. R-NODE 9: computation not finished (input files CC, OG).
10. R-NODE 10: MCS, coMC1PS, STATS.
11. R-NODE 11: MCS, coMC1PS, STATS.
12. R-NODE 12: MCS, coMC1PS, STATS.
13. R-NODE 13: MCS, coMC1PS, STATS.
DATASET 0: ALL ANCESTRAL SYNTENIES. It contains 1064 ancestral syntenies (covering 1100 of the 1101 markers), 17 of which have been removed to make it C1P. It defines an ancestral genome with 27 genome segments. Its R-nodes contain 246 markers, 131 of them belong to at least one MCS.
1. R-NODE 1: MCS, coMC1PS, STATS.
2. R-NODE 2: MCS, coMC1PS, STATS.
3. R-NODE 3: MCS, coMC1PS, STATS.
4. R-NODE 4: MCS, coMC1PS, STATS.
5. R-NODE 5: MCS, coMC1PS, STATS.
6. R-NODE 6: MCS, coMC1PS, STATS.
7. R-NODE 7: MCS, coMC1PS, STATS.

A few comments.

The ratio of ancestral syntenies belonging to at least one MCS in the R-nodes is smaller than in the simulated data, as it is closer to 50%. This justifies the interest to detect these ones in order to speed-up later computations.
Increasing the size of the ancestral syntenies seem to augment significantly the relative number of false positives, or at least of ancestral syntenies to diacard to have a C1P matrix, but when the syntenies are unconstrained. However the difference is not very big.
The profile of R-nodes in terms of MCS vary widely, from small R-nodes with very few MCS to large ones with thousands of them. The number of MCS jumps starting at DATASET 3, while it stays quite small for DATASET 2.
CoMC1P Sets show that the optimal solution of the optimization problem to transform a matrix into a C1P one by removing ros has in general several solutions that are relatively close to the optimal, but it is rare to find solutions that are almost equal to the optimal. In general the difference is of at least 0.5 which is relatively significant.