Minimal Conflicting Sets for the Consecutive Ones Property in ancestral genome reconstruction

This page contains experimental results for the article Minimal Conflicting Sets for the Consecutive Ones Property in ancestral genome reconstruction, to appear in the proceedings of RECOMB-CG 2009.

Preliminaries

The Consecutive Ones Property (C1P). A dataset is a set of m subsets, S0, ..., Sm-1, of a ground set {1,2, ..., n}. Each subset is weighted, with its weight being betwen 0 and 1. Such a dataset can naturally be encoded by a binary matrix M (cell M[i,j]=1 iff j belongs to Si), with m rows and n columns. M is C1P if one can order its columns in such a way that each row has all its entries 1 consecutives. The size of a row of a binary matrix is the number of entries 1.

Minimal Conflicting Set and Maximal C1P Set. A Minimal Conflicting Set (MCS) is a subset of rows of M that is not C1P and such that every proper subse is C1P. A MaxC1P Set (MC1P) is a subset of rows of M that is C1P such that adding any other row of M results in a non C1P matrix. A co-MaxC1P Set (coMC1P) is a subset of rows of M whose complement is MC1P.

Datasets description. A dataset represents a set of ancestral syntenies, also called clusters.

File formats. For each dataset we considered, we computed all MCS and all coMC1P.

We also show a stat file that shows the number of MCS that contain a given row (count-B) and the number of coMC1P that contain a given row (count-A). Datasets are presented with one row per line in the file, the first field being the identifier of the row, the second field its weight and the remaining fields its entries 1.

Simulated datasets

We considered 10 datasets, each composed of 45 adjacencies (rows of size 2), on a ground set {1,2 ...,40}. The 39 first adjacencies are (1,2), (2,3), ..., (39,40). They obviously define a C1P binary matrix. We then added 6 adjacencies (i,j) with |i-j|>1 (called "false positives" as they break the C1P property, numbered from 39 to 44, and of weight 0.5 while true positives have weight 1). Compared to the other data we study, this is a relatively high rate of false positives, but it gives a better idea of what happens with adjacencies. Here are the results:
  1. DATASET 1: MCS, coMC1P, STATS.
  2. DATASET 2: MCS, coMC1P, STATS.
  3. DATASET 3: MCS, coMC1P, STATS.
  4. DATASET 4: MCS, coMC1P, STATS.
  5. DATASET 5: MCS, coMC1P, STATS.
  6. DATASET 6: MCS, coMC1P, STATS.
  7. DATASET 7: MCS, coMC1P, STATS.
  8. DATASET 8: MCS, coMC1P, STATS.
  9. DATASET 9: MCS, coMC1P, STATS.
  10. DATASET 10: MCS, coMC1P, STATS.

Real datasets (not discussed in the paper)

We considered the data used in the paper Prediction of contiguous ancestral regions in the amniote ancestral genome (data available at Companion Website), and the problem of reconstructing the ancestral boreoeutherian (placental mammals) genome. In these data, each element of the ground set represents a genomic marker that is found, up to small changes, once and only once in a set of sequenced amniote genomes: Chicken (Ggal), Opposum (Mdom), Cow (Btau), Dog (Cfam), Horse (Ecab), Rat (Rnor), Mouse (Mmus), Macaca (Mmul), Orang-Outan (Ppyg), Chimp (Ptro) and Human (Hsap). Here are the genomic coordinates of these synteny blocks: GENOMIC_MARKERS. There are 1101 genomic markers. A boreoeutherian ancestral synteny is a set of markers that is contiguous in both either the Chicken or Oppossum genome and another of these amniote genomes, or in a pimate or a rodent genome (Human, Chimp, Macaca, Orang-Outan, Mouse, Rat) and a carnivore or herbivore (Cow, Horse, Dog)genome. We generated binary matrices corresponding to
  1. ancestral syntenies of size 2 (adjacencies) (DATASET 2)
  2. ancestral syntenies of size at most 3 (DATASET 3)
  3. ancestral syntenies of size at most 4 (DATASET 4)
  4. ancestral syntenies of size at most 5 (DATASET 5)
  5. unconstrained ancestral syntenies (DATASET 0)
Each matrix is broken down into submatrices, corresponding to a PQR-tree, and MCS can only be found in the submatrices induced by the R-nodes of this PQR-tree (see paper Prediction of contiguous ancestral regions in the amniote ancestral genome for details). This explains why each matrix defines several disjoint datasets. Here are the results (note that not all computations could finish).
  1. DATASET 2: ALL ANCESTRAL SYNTENIES. First a few facts about this dataset. It contains 1086 ancestral syntenies (covering 1099 of the 1101 markers), 18 of which have been removed to make it C1P. It defines an ancestral genome with 31 genome segments. Its R-nodes contain 642 markers (more than 50%), 223 of them belong to at least one MCS.
    1. R-NODE 1: MCS, coMC1PS, STATS.
    2. R-NODE 2: MCS, coMC1PS, STATS.
    3. R-NODE 3: MCS, coMC1PS, STATS.
    4. R-NODE 4: MCS, coMC1PS, STATS.
    5. R-NODE 5: MCS, coMC1PS, STATS.
    6. R-NODE 6: MCS, coMC1PS, STATS.
    7. R-NODE 7: MCS, coMC1PS, STATS.
    8. R-NODE 8: MCS, coMC1PS, STATS.
    9. R-NODE 9: MCS, coMC1PS, STATS.
    10. R-NODE 10: MCS, coMC1PS, STATS.
    11. R-NODE 11: MCS, coMC1PS, STATS.
    12. R-NODE 12: MCS, coMC1PS, STATS.
    13. R-NODE 13: MCS, coMC1PS, STATS.
  2. DATASET 3: ALL ANCESTRAL SYNTENIES. It contains 1272 ancestral syntenies (covering 1099 of the 1101 markers), 25 of which have been removed to make it C1P. It defines an ancestral genome with 31 genome segments. Its R-nodes (for which computating MCS finished) contain 722 markers, 252 of them belong to at least one MCS.
    1. R-NODE 1: MCS, coMC1PS, STATS.
    2. R-NODE 2: MCS, coMC1PS, STATS.
    3. R-NODE 3: MCS, coMC1PS, STATS.
    4. R-NODE 4: MCS, coMC1PS, STATS.
    5. R-NODE 5: MCS, coMC1PS, STATS.
    6. R-NODE 6: computation not finished (input files CC, OG).
    7. R-NODE 7: MCS, coMC1PS, STATS.
    8. R-NODE 8: MCS, coMC1PS, STATS.
    9. R-NODE 9: MCS, coMC1PS, STATS.
    10. R-NODE 10: MCS, coMC1PS, STATS.
    11. R-NODE 11: MCS, coMC1PS, STATS.
    12. R-NODE 12: MCS, coMC1PS, STATS.
    13. R-NODE 13: MCS, coMC1PS, STATS.
  3. DATASET 4: ALL ANCESTRAL SYNTENIES. It contains 1393 ancestral syntenies (covering 1100 of the 1101 markers), 32 of which have been removed to make it C1P. It defines an ancestral genome with 30 genome segments. Its R-nodes (for which computating MCS finished) contain 558 markers, 215 of them belong to at least one MCS.
    1. R-NODE 1: computation not finished (input files CC, OG).
    2. R-NODE 2: MCS, coMC1PS, STATS.
    3. R-NODE 3: MCS, coMC1PS, STATS.
    4. R-NODE 4: MCS, coMC1PS, STATS.
    5. R-NODE 5:computation not finished (input files CC, OG).
    6. R-NODE 6: computation not finished (input files CC, OG).
    7. R-NODE 7: MCS, coMC1PS, STATS.
    8. R-NODE 8: MCS, coMC1PS, STATS.
    9. R-NODE 9: MCS, coMC1PS, STATS.
    10. R-NODE 10: computation not finished (input files CC, OG).
    11. R-NODE 11: MCS, coMC1PS, STATS.
    12. R-NODE 12: MCS, coMC1PS, STATS.
    13. R-NODE 13: MCS, coMC1PS, STATS.
  4. DATASET 5: ALL ANCESTRAL SYNTENIES. It contain 1471 ancestral syntenies (covering 1100 of the 1101 markers), 31 of which have been removed to make it C1P. It defines an ancestral genome with 30 genome segments. Its R-nodes (for which computating MCS finished) contain 623 markers, 281 of them belong to at least one MCS.
    1. R-NODE 1: MCS, coMC1PS, STATS.
    2. R-NODE 2: MCS, coMC1PS, STATS.
    3. R-NODE 3: MCS, coMC1PS, STATS.
    4. R-NODE 4: computation not finished (input files CC, OG).
    5. R-NODE 5: computation not finished (input files CC, OG).
    6. R-NODE 6: MCS, coMC1PS, STATS.
    7. R-NODE 7: MCS, coMC1PS, STATS.
    8. R-NODE 8: MCS, coMC1PS, STATS.
    9. R-NODE 9: computation not finished (input files CC, OG).
    10. R-NODE 10: MCS, coMC1PS, STATS.
    11. R-NODE 11: MCS, coMC1PS, STATS.
    12. R-NODE 12: MCS, coMC1PS, STATS.
    13. R-NODE 13: MCS, coMC1PS, STATS.
  5. DATASET 0: ALL ANCESTRAL SYNTENIES. It contains 1064 ancestral syntenies (covering 1100 of the 1101 markers), 17 of which have been removed to make it C1P. It defines an ancestral genome with 27 genome segments. Its R-nodes contain 246 markers, 131 of them belong to at least one MCS.
    1. R-NODE 1: MCS, coMC1PS, STATS.
    2. R-NODE 2: MCS, coMC1PS, STATS.
    3. R-NODE 3: MCS, coMC1PS, STATS.
    4. R-NODE 4: MCS, coMC1PS, STATS.
    5. R-NODE 5: MCS, coMC1PS, STATS.
    6. R-NODE 6: MCS, coMC1PS, STATS.
    7. R-NODE 7: MCS, coMC1PS, STATS.
A few comments.