{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"navigation\"></a>\n",
    "# Hi-C data analysis\n",
    "\n",
    "Welcome to the [Jupyter notebook](http://jupyter.org/) dedicated to Hi-C data analysis. Here we will be working in interactive Python environment with some mixture of bash command line tools. \n",
    "\n",
    "Here is the outline of what we are going to do:\n",
    "\n",
    "0. [Notebook basics](#basics)\n",
    "1. [Reads maping](#mapping)\n",
    "2. [Data filtering](#filtering)\n",
    "3. [Binning](#binning)\n",
    "4. [Hi-C data visualisation](#visualisation)\n",
    "5. [Iterative correction](#correction)\n",
    "6. [Compartments and TADs](#meta)\n",
    "\n",
    "If you have any questions, please, contact Aleksandra Galitsyna (agalitzina@gmail.com)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"basics\"></a>\n",
    "## 0. Notebook basics\n",
    "\n",
    "If you are new to Python and Jupyter notebook, please, take a quick look through this small list of tips.\n",
    "\n",
    "- First of all, __Jupyter notebook is organised in cells__, which may contain text, comments and code blocks of any size."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# This is regular Python comment inside Jupyter \"Code\" cell.\n",
    "# You can easily run \"Hello world\" in the \"Code\" cell (focus on the cell and press Shift+Enter):\n",
    "print(\"Hello world!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- There are also other types of cells, for example, \"Markdown\". Double click this cell to view raw Markdown markup content.\n",
    "[comment]: <> (Wow, can you see it? This is Markdown commented line. Please, click Shift+Enter to render Markdown output again.)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- You can define functions, classes, run pipelines and visualisations, run thousands of code lines inside a Jupyter cell.\n",
    "But usually, it is convenient to write simple and clean blocks of code.\n",
    "\n",
    "\n",
    "- Note that behind this interactive notebook you have __regular Python session running__. Thus Python variables are accessible only throughout your history of actions in the notebook. To create a variable, you have to execute the corresponding block of code. All your variables will be lost when you restart the kernel of the notebook. \n",
    "\n",
    "- You can pause or stop the kernel, save notebook (.ipynb) file, copy and insert cells via __buttons in the toolbar__. Please, take a look at these useful buttons.\n",
    "\n",
    "- Also, try pressing 'Esc' and then 'h'. You will see __shortcuts help__. \n",
    "\n",
    "- Jupyter notebook allows you to create __[\"magical\" cells](http://ipython.readthedocs.io/en/stable/interactive/magics.html )__. We will use %%bash, %%capture, %matplotlib. For example, %%bash magic makes it easier to access bash commands:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "%%bash \n",
    "echo \"Current directory is: \"; pwd\n",
    "echo \"List of files in the current directory is: \"; ls"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- If you are not sure about the function, class or variable then use its name with '?' at the end to get available documentation. Here is an example for common module numpy:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Module import under custom name\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# You've started asking questions about it\n",
    "np?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "OK, it seems that now we are ready to start our Hi-C data analysis! I've placed [Go top](#navigation) shortcut for you in each section so that you can navigate quickly throughout the notebook."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"mapping\"></a>\n",
    "## 1. Reads mapping\n",
    "[Go top](#navigation)\n",
    "\n",
    "#### 1.1 Input raw data \n",
    "Hi-C results in paired-end sequencing, where each pair represents one possible contact. The analysis starts with raw sequencing data (.fastq files). \n",
    "\n",
    "I've downloaded raw files from [Flyamer et al. 2017](http://www.nature.com/nature/journal/v544/n7648/full/nature21711.html?foxtrotcallback=true) (GEO ID [GSE80006](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE80006)) and placed them in the DATA/FASTQ/ directory. \n",
    "\n",
    "We can view these files easily with bash help. Forward and reverse reads, correspondingly:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%%bash \n",
    "head -n 8 '../DATA/FASTQ/K562_B-bulk_R1.fastq'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%%bash \n",
    "head -n 8 '../DATA/FASTQ/K562_B-bulk_R2.fastq'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.2 Genome\n",
    "\n",
    "Now we have to map these reads to the genome of interest (*Homo sapiens* hg19 downloaded from [UCSC](ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/) in this case). \n",
    "We are going to use only chromosome 1 to minimise computational time. \n",
    "\n",
    "The genome is also pre-downloaded:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%%bash \n",
    "ls ../GENOMES/HG19_FASTA"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For Hi-C data mapping we will use [hiclib](https://bitbucket.org/mirnylab/hiclib). It utilizes [bowtie 2](http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml) read mapping software. Bowtie 2 indexes the genome prior to reads mapping in order to reduce memory usage. Usually, you have to run genome indexing, but I've already done this time-consuming step. That's why code for this step is included but commented."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#%%bash\n",
    "#bowtie2-build /home/jovyan/GENOMES/HG19_FASTA/chr1.fa /home/jovyan/GENOMES/HG19_IND/hg19_chr1\n",
    "#Time consuming step"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%%bash \n",
    "ls ../GENOMES/HG19_IND"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.3 Iterative mapping\n",
    "\n",
    "First of all, we need to import useful Python packages:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "from hiclib import mapping\n",
    "from mirnylib import h5dict, genome"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we need to set some parameters and prepare our environment:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%%bash \n",
    "which bowtie2\n",
    "# Bowtie 2 path"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "pwd\n",
    "# Current working directory path"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Setting parameters and environmental variables\n",
    "bowtie_path       = '/opt/conda/bin/bowtie2'\n",
    "\n",
    "enzyme = 'DpnII'\n",
    "\n",
    "bowtie_index_path = '/home/jovyan/GENOMES/HG19_IND/hg19_chr1'\n",
    "fasta_path        = '/home/jovyan/GENOMES/HG19_FASTA/'\n",
    "chrms             = ['1']\n",
    "\n",
    "# Reading the genome\n",
    "genome_db    = genome.Genome(fasta_path, readChrms=chrms)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Creating directories for further data processing\n",
    "\n",
    "if not os.path.exists('tmp/'):\n",
    "    os.mkdir('tmp/', exists_)\n",
    "if not os.path.exists('../DATA/SAM/'):\n",
    "    os.mkdir('../DATA/SAM/')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Set parameters for iterative mapping\n",
    "\n",
    "min_seq_len       = 25\n",
    "len_step          = 5\n",
    "nthreads          = 2\n",
    "temp_dir          = 'tmp'\n",
    "bowtie_flags      = '--very-sensitive'\n",
    "\n",
    "infile1           = '/home/jovyan/DATA/FASTQ1/K562_B-bulk_R1.fastq'\n",
    "infile2           = '/home/jovyan/DATA/FASTQ1/K562_B-bulk_R2.fastq'\n",
    "out1              = '/home/jovyan/DATA/SAM/K562_B-bulk_R1.chr1.sam'\n",
    "out2              = '/home/jovyan/DATA/SAM/K562_B-bulk_R2.chr1.sam'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Iterative mapping itself. Time consuming step!\n",
    "\n",
    "mapping.iterative_mapping(\n",
    "    bowtie_path       = bowtie_path,\n",
    "    bowtie_index_path = bowtie_index_path,\n",
    "    fastq_path        = infile1,\n",
    "    out_sam_path      = out1,\n",
    "    min_seq_len       = min_seq_len,\n",
    "    len_step          = len_step,\n",
    "    nthreads          = nthreads,\n",
    "    temp_dir          = temp_dir, \n",
    "    bowtie_flags      = bowtie_flags)\n",
    "\n",
    "mapping.iterative_mapping(\n",
    "    bowtie_path       = bowtie_path,\n",
    "    bowtie_index_path = bowtie_index_path,\n",
    "    fastq_path        = infile2,\n",
    "    out_sam_path      = out2,\n",
    "    min_seq_len       = min_seq_len,\n",
    "    len_step          = len_step,\n",
    "    nthreads          = nthreads,\n",
    "    temp_dir          = temp_dir, \n",
    "    bowtie_flags      = bowtie_flags)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's take a look at .sam files that were created during iterative mapping:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "ls /home/jovyan/DATA/SAM/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "head -n 10 /home/jovyan/DATA/SAM/K562_B-bulk_R1.chr1.sam.25"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.4 Making sense of mapping output \n",
    "For each read length and orientation, we have a file. Now we need to merge them into the single dataset ([.hdf5](https://en.wikipedia.org/wiki/Hierarchical_Data_Format) file):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Create the directory for output\n",
    "if not os.path.exists('../DATA/HDF5/'):\n",
    "    os.mkdir('../DATA/HDF5/')\n",
    "\n",
    "# Define file name for output\n",
    "out = '/home/jovyan/DATA/HDF5/K562_B-bulk.fragments.hdf5'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Open output file\n",
    "mapped_reads = h5dict.h5dict(out)\n",
    "\n",
    "# Parse mapping data and write to output file\n",
    "mapping.parse_sam(\n",
    "    sam_basename1  =  out1,\n",
    "    sam_basename2  =  out2,\n",
    "    out_dict       =  mapped_reads,\n",
    "    genome_db      =  genome_db,\n",
    "    enzyme_name    =  enzyme, \n",
    "    save_seqs      =  False,\n",
    "    keep_ids       =  False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's take a look at the created file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "ls /home/jovyan/DATA/HDF5/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import h5py\n",
    "\n",
    "# Reading the file\n",
    "a = h5py.File('/home/jovyan/DATA/HDF5/K562_B-bulk.fragments.hdf5')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# \"a\" variable has dictionary-like structure, we can view its keys, for example:\n",
    "list( a.keys() )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Mapping positions for forward reads are stored under 'cuts1' key:\n",
    "a['cuts1'].value"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"filtering\"></a>\n",
    "## 2. Data filtering\n",
    "[Go top](#navigation)\n",
    "\n",
    "The raw Hi-C data is mapped and interpreted, the next step is to filter out possible methodological artefacts:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from hiclib import fragmentHiC"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "inp = '/home/jovyan/DATA/HDF5/K562_B-bulk.fragments.hdf5'\n",
    "out = '/home/jovyan/DATA/HDF5/K562_B-bulk.fragments_filtered.hdf5'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Create output file\n",
    "fragments = fragmentHiC.HiCdataset(\n",
    "    filename             = out,\n",
    "    genome               = genome_db,\n",
    "    maximumMoleculeLength= 500,\n",
    "    mode                 = 'w')\n",
    "\n",
    "# Parse input data\n",
    "fragments.parseInputData(\n",
    "    dictLike=inp)\n",
    "\n",
    "# Filtering\n",
    "fragments.filterRsiteStart(offset=5) # reads map too close to restriction site\n",
    "fragments.filterDuplicates() # remove PCR duplicates\n",
    "fragments.filterLarge() # remove too large restriction fragments\n",
    "fragments.filterExtreme(cutH=0.005, cutL=0) # remove fragments with too high and low counts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Some hidden filteres were also applied, we can check them all:\n",
    "fragments.printMetadata()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Nice visualisation of the data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df_stat = pd.DataFrame(list(fragments.metadata.items()), columns=['Feature', 'Count'])\n",
    "df_stat"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "df_stat['Ratio of total'] = 100*df_stat['Count']/df_stat.loc[2,'Count']\n",
    "df_stat"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"binning\"></a>\n",
    "## 3. Data binning\n",
    "[Go top](#navigation)\n",
    "\n",
    "The previous analysis involved interactions of restriction fragments, now we would like to work with interactions of genomic bins."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Define file name for binned data. Note \"{}\" prepared for string formatting\n",
    "out_bin = '/home/jovyan/DATA/HDF5/K562_B-bulk.binned_{}.hdf5'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "res_kb = [100, 20] # Several resolutions in Kb\n",
    "\n",
    "for res in res_kb:\n",
    "    print(res)\n",
    "    \n",
    "    outmap = out_bin.format(str(res)+'kb') # String formatting\n",
    "\n",
    "    fragments.saveHeatmap(outmap, res*1000) # Save heatmap"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "del fragments # delete unwanted object"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"visualisation\"></a>\n",
    "## 4. Hi-C data visualisation\n",
    "[Go top](#navigation)\n",
    "\n",
    "Let's take a look at the resulting heat maps."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Importing visualisation modules\n",
    "import matplotlib as mpl\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "sns.set_style('ticks')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from hiclib.binnedData import binnedDataAnalysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "res = 100 # Resolution in Kb"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# prepare to read the data\n",
    "data_hic = binnedDataAnalysis(resolution=res*1000, genome=genome_db)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# read the data\n",
    "data_hic.simpleLoad('/home/jovyan/DATA/HDF5/K562_B-bulk.binned_{}.hdf5'.format(str(res)+'kb'),'hic')\n",
    "mtx = data_hic.dataDict['hic']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# show heatmap\n",
    "plt.figure(figsize=[15,15])\n",
    "plt.imshow(mtx[0:200, 0:200], cmap='jet', interpolation='None')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"correction\"></a>\n",
    "## 5. Iterative correction\n",
    "[Go top](#navigation)\n",
    "\n",
    "The next typical step is data correction for unequal amplification and accessibility of genomic regions. \n",
    "We will use iterative correction. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Additional data filtering \n",
    "data_hic.removeDiagonal()\n",
    "data_hic.removePoorRegions()\n",
    "data_hic.removeZeros()\n",
    "data_hic.iterativeCorrectWithoutSS(force=True)\n",
    "data_hic.restoreZeros()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "mtx = data_hic.dataDict['hic']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "plt.figure(figsize=[15,15])\n",
    "plt.imshow(mtx[200:500, 200:500], cmap='jet', interpolation='None')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"meta\"></a>\n",
    "## 7. Compartmanets and TADs\n",
    "[Go top](#navigation)\n",
    "\n",
    "#### 7.1 Comparison with compartments\n",
    "Compartments usually can be found at whole-genome datasets, but we have only chromosome 1. Still, we can try to find some visual signs of compartments."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Load compartments computed previously based on K562 dataset from Rao et al. 2014\n",
    "eig =  np.loadtxt('/home/jovyan/DATA/ANNOT/comp_K562_100Kb_chr1.tsv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "eig"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from matplotlib import gridspec"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "bgn = 0\n",
    "end = 500\n",
    "\n",
    "fig = plt.figure(figsize=(10,10))\n",
    "\n",
    "gs = gridspec.GridSpec(2, 1, height_ratios=[20,2]) \n",
    "gs.update(wspace=0.0, hspace=0.0)\n",
    "\n",
    "ax = plt.subplot(gs[0,0])\n",
    "\n",
    "ax.matshow(mtx[bgn:end, bgn:end], cmap='jet', origin='lower', aspect='auto')\n",
    "ax.set_xticks([])\n",
    "ax.set_yticks([])\n",
    "\n",
    "axl = plt.subplot(gs[1,0])\n",
    "plt.plot(range(end-bgn), eig[bgn:end] )\n",
    "plt.xlim(0, end-bgn)\n",
    "plt.xlabel('Eigenvector values')\n",
    "\n",
    "ticks = range(bgn, end+1, 100)\n",
    "ticklabels = ['{} Kb'.format(x) for x in ticks]\n",
    "plt.xticks(ticks, ticklabels)\n",
    "\n",
    "print('')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Seems to be nothing special with compartments. What if we had much better coverage by reads? Let's take a look at the dataset from [Rao et al. 2014](https://linkinghub.elsevier.com/retrieve/pii/S0092-8674(14)01497-4), GEO [ GSE63525, HIC069](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE63525):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "mtx_Rao = np.genfromtxt('../DATA/ANNOT/Rao_K562_chr1.csv', delimiter=',')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "bgn = 0\n",
    "end = 500\n",
    "\n",
    "fig = plt.figure(figsize=(10,10))\n",
    "\n",
    "gs = gridspec.GridSpec(2, 1, height_ratios=[20,2])\n",
    "gs.update(wspace=0.0, hspace=0.0)\n",
    "\n",
    "ax = plt.subplot(gs[0,0])\n",
    "\n",
    "ax.matshow(mtx_Rao[bgn:end, bgn:end], cmap='jet', origin='lower', aspect='auto', vmax=1000)\n",
    "ax.set_xticks([])\n",
    "ax.set_yticks([])\n",
    "\n",
    "axl = plt.subplot(gs[1,0])\n",
    "plt.plot(range(end-bgn), eig[bgn:end] )\n",
    "plt.xlim(0, end-bgn)\n",
    "plt.xlabel('Eigenvector values')\n",
    "\n",
    "ticks = range(bgn, end+1, 100)\n",
    "ticklabels = ['{} Kb'.format(x) for x in ticks]\n",
    "plt.xticks(ticks, ticklabels)\n",
    "\n",
    "print('')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "#### 7.2 Topologically associating domains (TADs)\n",
    "\n",
    "For TADs calling we will use [lavaburst](https://github.com/nezar-compbio/lavaburst) package. The code below is based on [this example](http://nbviewer.jupyter.org/github/nezar-compbio/lavaburst/blob/master/example/example.ipynb)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Import Python package \n",
    "import lavaburst"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "good_bins = mtx.astype(bool).sum(axis=0) > 1 # We have to mask rows/cols if data is missing\n",
    "\n",
    "gam=[0.15, 0.25, 0.5, 0.75, 1.0] # set of parameters gamma for TADs calling\n",
    "\n",
    "segments_dict = {}\n",
    "\n",
    "for gam_current in gam:\n",
    "    print(gam_current)\n",
    "    \n",
    "    S = lavaburst.scoring.armatus_score(mtx, gamma=gam_current, binmask=good_bins)\n",
    "    model = lavaburst.model.SegModel(S)\n",
    "    segments = model.optimal_segmentation() # Positions of TADs for input matrix\n",
    "\n",
    "    segments_dict[gam_current] = segments.copy()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "A = mtx.copy()\n",
    "\n",
    "good_bins = A.astype(bool).sum(axis=0) > 0\n",
    "\n",
    "At = lavaburst.utils.tilt_heatmap(mtx, n_diags=100)\n",
    "\n",
    "start_tmp = 0\n",
    "end_tmp = 500\n",
    "\n",
    "f = plt.figure(figsize=(20, 6))\n",
    "\n",
    "ax = f.add_subplot(111)\n",
    "blues = sns.cubehelix_palette(0.4, gamma=0.5, rot=-0.3, dark=0.1, light=0.9, as_cmap=True)\n",
    "ax.matshow(np.log(At[start_tmp: end_tmp]), cmap=blues)\n",
    "\n",
    "cmap = mpl.cm.get_cmap('brg')\n",
    "\n",
    "gammas = segments_dict.keys()\n",
    "for n, gamma in enumerate(gammas):\n",
    "\n",
    "    segments = segments_dict[gamma]\n",
    "\n",
    "    for a in segments[:-1]:\n",
    "        if a[1]<start_tmp or a[0]>end_tmp:\n",
    "            continue\n",
    "        ax.plot([a[0]-start_tmp, a[0]+(a[1]-a[0])/2-start_tmp], [0, -(a[1]-a[0])], c=cmap(n/len(gammas)), alpha=0.5)\n",
    "        ax.plot([a[0]+(a[1]-a[0])/2-start_tmp, a[1]-start_tmp], [-(a[1]-a[0]), 0], c=cmap(n/len(gammas)), alpha=0.5)\n",
    "\n",
    "    a = segments[-1]\n",
    "    ax.plot([a[0]-start_tmp, a[0]+(a[1]-a[0])/2-start_tmp], [0, -(a[1]-a[0])], c=cmap(n/len(gammas)), alpha=0.5, label=gamma)\n",
    "    ax.plot([a[0]+(a[1]-a[0])/2-start_tmp, a[1]-start_tmp], [-(a[1]-a[0]), 0], c=cmap(n/len(gammas)), alpha=0.5)\n",
    "    \n",
    "ax.set_xlim([0,end_tmp-start_tmp])\n",
    "ax.set_ylim([100,-100])\n",
    "        \n",
    "ax.legend(bbox_to_anchor=(1.1, 1.05))\n",
    "ax.set_aspect(0.5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#Let's check what are median TAD sized with different parameters:\n",
    "\n",
    "for gam_current in gam:\n",
    "    segments = segments_dict[gam_current]\n",
    "    tad_lens = segments[:,1]-segments[:,0]\n",
    "    good_lens = (tad_lens>=200/res)&(tad_lens<100)\n",
    "    print(res*1000*np.mean(tad_lens[good_lens]))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}