{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "# Single-cell Hi-C data analysis\n", "\n", "Welcome to the second part of our analysis. Here we will work specifically with single-cell data. \n", "\n", "The outline:\n", "\n", "1. [Reads mapping](#mapping)\n", "2. [Data filtering and binning](#filtering)\n", "3. [Hi-C data visualisation](#visualisation)\n", "4. [Compartments and TADs](#meta)\n", "\n", "If you have any questions, please, contact Aleksandra Galitsyna (agalitzina@gmail.com)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 1. Reads mapping\n", "[Go top](#navigation)\n", "\n", "In this notebook, we will be working with three datasets from [Flyamer et al. 2017](http://www.nature.com/nature/journal/v544/n7648/full/nature21711.html?foxtrotcallback=true) (GEO ID [GSE80006](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE80006)) placed in DATA/FASTQ/ directory.\n", "Opposite to previous analysis, now you can work with single-cell Hi-C results. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import os\n", "\n", "from hiclib import mapping\n", "from mirnylib import h5dict, genome" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "bowtie_path = '/opt/conda/bin/bowtie2'\n", "\n", "enzyme = 'DpnII'\n", "\n", "bowtie_index_path = '/home/jovyan/GENOMES/HG19_IND/hg19_chr1'\n", "fasta_path = '/home/jovyan/GENOMES/HG19_FASTA/'\n", "chrms = ['1']\n", "\n", "genome_db = genome.Genome(fasta_path, readChrms=chrms)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "min_seq_len = 120\n", "len_step = 5\n", "nthreads = 2\n", "temp_dir = 'tmp'\n", "bowtie_flags = '--very-sensitive'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case, we have multiple datasets, thus we have to iterate through the list of files. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "experiment_ids = ['72-sc-1', '54-sc-1', '58-sc-1']" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [], "source": [ "for exp_id in experiment_ids:\n", " \n", " infile1 = '/home/jovyan/DATA/FASTQ1/K562_{}_R1.fastq'.format(exp_id)\n", " infile2 = '/home/jovyan/DATA/FASTQ1/K562_{}_R2.fastq'.format(exp_id)\n", " out1 = '/home/jovyan/DATA/SAM/K562_{}_R1.chr1.sam'.format(exp_id)\n", " out2 = '/home/jovyan/DATA/SAM/K562_{}_R2.chr1.sam'.format(exp_id)\n", "\n", " mapping.iterative_mapping(\n", " bowtie_path = bowtie_path,\n", " bowtie_index_path = bowtie_index_path,\n", " fastq_path = infile1,\n", " out_sam_path = out1,\n", " min_seq_len = min_seq_len,\n", " len_step = len_step,\n", " nthreads = nthreads,\n", " temp_dir = temp_dir, \n", " bowtie_flags = bowtie_flags)\n", "\n", " mapping.iterative_mapping(\n", " bowtie_path = bowtie_path,\n", " bowtie_index_path = bowtie_index_path,\n", " fastq_path = infile2,\n", " out_sam_path = out2,\n", " min_seq_len = min_seq_len,\n", " len_step = len_step,\n", " nthreads = nthreads,\n", " temp_dir = temp_dir, \n", " bowtie_flags = bowtie_flags)\n", " \n", " out = '/home/jovyan/DATA/HDF5/K562_{}.fragments.hdf5'\n", "\n", " mapped_reads = h5dict.h5dict(out.format(exp_id))\n", "\n", " mapping.parse_sam(\n", " sam_basename1 = out1,\n", " sam_basename2 = out2,\n", " out_dict = mapped_reads,\n", " genome_db = genome_db,\n", " enzyme_name = enzyme, \n", " save_seqs = False,\n", " keep_ids = False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 2. Data filtering and binning\n", "[Go top](#navigation)\n", "\n", "In case of single cell analysis, the contact filtering is much more sophisticated. For example, standard PCR duplicates filter is replaced by \"filterLessThanDistance\". The code below is based on hiclib [single cell scripts for Flyamer et al. 2017](https://bitbucket.org/mirnylab/hiclib/src/85979ac69b55/examples/singleCellScripts/?at=default)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import h5py\n", "import numpy as np\n", "\n", "from hiclib import fragmentHiC\n", "from hiclib.fragmentHiC import HiCdataset as HiCdatasetorig\n", "from mirnylib.numutils import uniqueIndex\n", "\n", "class HiCdataset(HiCdatasetorig):\n", " \"Modification of HiCDataset to include all filters\"\n", "\n", " def filterLessThanDistance(self):\n", " # This is the old function used to filter \"duplicates\".\n", " #After the final submission of the manuscript, It was replaced by a better function that does the same,\n", " #but at bp resolution, not 100 bp.\n", " M = self.N\n", " for i in range(5):\n", " for j in range(5):\n", " chrStrandID = 10000000 * 10000000 * (np.array(self.chrms1 * (self.strands1 + 1), dtype = np.int64) * 100 + self.chrms2 * (self.strands2 + 1))\n", " print(len(np.unique(chrStrandID)))\n", " posid = np.array((self.cuts1 + i * 100) // 500, dtype = np.int64) * 10000000 + (self.cuts2 + i * 100) // 500\n", " N = self.N\n", " self.maskFilter(uniqueIndex(posid + chrStrandID))\n", " print(N, \"filtered to\", self.N)\n", " self.metadata[\"321_quasiDuplicatesRemoved\"] = M - self.N\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "output = []\n", "\n", "for exp_id in experiment_ids:\n", " \n", " inp = '/home/jovyan/DATA/HDF5/K562_{}.fragments.hdf5'.format(exp_id)\n", " \n", " out = '/home/jovyan/DATA/HDF5/K562_{}.tmp.hdf5'.format(exp_id)\n", " outstat = '/home/jovyan/DATA/HDF5/K562_{}.stat.txt'.format(exp_id)\n", " \n", " fragments = HiCdataset(\n", " filename = out,\n", " genome = genome_db,\n", " maximumMoleculeLength= 500,\n", " enzymeName = 1000,\n", " mode = 'w')\n", "\n", " fragments.parseInputData(\n", " dictLike=inp)\n", "\n", " fragments.filterLessThanDistance() \n", " fs = fragments.fragmentSum()\n", " fragments.fragmentFilter(fs < 9)\n", "\n", " output.append(list(fragments.metadata.items()))\n", " \n", " out_bin = '/home/jovyan/DATA/HDF5/K562_{}.binned_{}.hdf5'\n", "\n", " res_kb = [100, 20]\n", " \n", " for res in res_kb:\n", " print(res)\n", " outmap = out_bin.format(exp_id, str(res)+'kb')\n", "\n", " fragments.saveHeatmap(outmap, res*1000)\n", " \n", " del fragments" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "output" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 4. Hi-C data visualisation\n", "[Go top](#navigation)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from hiclib.binnedData import binnedDataAnalysis" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "res = 100\n", "data_hic = binnedDataAnalysis(resolution=res*1000, genome=genome_db)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "for exp_id in experiment_ids:\n", " data_hic.simpleLoad('/home/jovyan/DATA/HDF5/K562_{}.binned_{}.hdf5'.format(exp_id, str(res)+'kb'), exp_id)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib as mpl\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "sns.set_style('ticks')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [], "source": [ "plt.figure(figsize=[10,10])\n", "plt.imshow(data_hic.dataDict['54-sc-1'][200:500, 200:500], cmap='jet', interpolation='None')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Comparison of single-cell and bulk datasets" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data_hic.simpleLoad('/home/jovyan/DATA/HDF5/K562_B-bulk.binned_{}.hdf5'.format(str(res)+'kb'),'bulk')\n", "data_hic.removeDiagonal()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "mtx1 = data_hic.dataDict['bulk']\n", "mtx2 = data_hic.dataDict['54-sc-1']\n", "mtx_tmp = np.triu(mtx1)/np.mean(mtx1) + np.tril(mtx2)/np.mean(mtx2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "plt.figure(figsize=[10,10])\n", "plt.imshow(mtx_tmp[200:500, 200:500], cmap='Blues', interpolation='None', vmax=900)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "mtx_merged = sum([data_hic.dataDict[exp_id] for exp_id in experiment_ids]) " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "mtx1 = data_hic.dataDict['bulk']\n", "mtx2 = mtx_merged\n", "mtx_tmp = np.triu(mtx1)/np.mean(mtx1) + np.tril(mtx2)/np.mean(mtx2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "plt.figure(figsize=[10,10])\n", "plt.imshow(mtx_tmp[200:500, 200:500], cmap='Blues', interpolation='None', vmax=800)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 5. Compartmanets and TADs\n", "[Go top](#navigation)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from matplotlib import gridspec" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "eig = np.loadtxt('/home/jovyan/DATA/ANNOT/comp_K562_100Kb_chr1.tsv')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "bgn = 0\n", "end = 500\n", "\n", "fig = plt.figure(figsize=(10,10))\n", "\n", "gs = gridspec.GridSpec(2, 1, height_ratios=[20,2])#width_ratios=[2,20], \n", "gs.update(wspace=0.0, hspace=0.0)\n", "\n", "ax = plt.subplot(gs[0,0])\n", "\n", "ax.matshow(mtx_tmp[bgn:end, bgn:end], cmap='jet', origin='lower', aspect='auto')\n", "ax.set_xticks([])\n", "ax.set_yticks([])\n", "\n", "axl = plt.subplot(gs[1,0])#, sharey=ax)\n", "plt.plot(range(end-bgn), eig[bgn:end] )\n", "plt.xlim(0, end-bgn)\n", "plt.xlabel('Eigenvector values')\n", "\n", "ticks = range(bgn, end+1, 100)\n", "ticklabels = ['{} Kb'.format(x) for x in ticks]\n", "plt.xticks(ticks, ticklabels)\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "### Topologically associating domains (TADs)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import lavaburst" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "mtx = data_hic.dataDict['54-sc-1']" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "good_bins = mtx.astype(bool).sum(axis=0) > 1 # We have to mask rows/cols if data is missing\n", "\n", "gammas=[0.1, 1.0, 5.0, 10.0] # set of parameters gamma for TADs calling\n", "\n", "segments_dict = {}\n", "\n", "for gam_current in gammas:\n", " print(gam_current)\n", " \n", " S = lavaburst.scoring.armatus_score(mtx, gamma=gam_current, binmask=good_bins)\n", " model = lavaburst.model.SegModel(S)\n", " segments = model.optimal_segmentation() # Positions of TADs for input matrix\n", "\n", " segments_dict[gam_current] = segments.copy()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# TADs at different parameters for particular cell (54-sc-1)\n", "\n", "A = mtx.copy()\n", "\n", "good_bins = A.astype(bool).sum(axis=0) > 0\n", "\n", "At = lavaburst.utils.tilt_heatmap(mtx, n_diags=100)\n", "\n", "start_tmp = 0\n", "end_tmp = 500\n", "\n", "f = plt.figure(figsize=(20, 6))\n", "\n", "ax = f.add_subplot(111)\n", "blues = sns.cubehelix_palette(0.4, gamma=0.5, rot=-0.3, dark=0.1, light=0.9, as_cmap=True)\n", "ax.matshow(np.log(At[start_tmp: end_tmp]), cmap=blues)\n", "\n", "cmap = mpl.cm.get_cmap('brg')\n", "\n", "gammas = segments_dict.keys()\n", "for n, gamma in enumerate(gammas):\n", "\n", " segments = segments_dict[gamma]\n", "\n", " for a in segments[:-1]:\n", " if a[1]end_tmp:\n", " continue\n", " ax.plot([a[0]-start_tmp, a[0]+(a[1]-a[0])/2-start_tmp], [0, -(a[1]-a[0])], c=cmap(n/len(gammas)), alpha=0.5)\n", " ax.plot([a[0]+(a[1]-a[0])/2-start_tmp, a[1]-start_tmp], [-(a[1]-a[0]), 0], c=cmap(n/len(gammas)), alpha=0.5)\n", "\n", " a = segments[-1]\n", " ax.plot([a[0]-start_tmp, a[0]+(a[1]-a[0])/2-start_tmp], [0, -(a[1]-a[0])], c=cmap(n/len(gammas)), alpha=0.5, label=gamma)\n", " ax.plot([a[0]+(a[1]-a[0])/2-start_tmp, a[1]-start_tmp], [-(a[1]-a[0]), 0], c=cmap(n/len(gammas)), alpha=0.5)\n", " \n", "ax.set_xlim([0,end_tmp-start_tmp])\n", "ax.set_ylim([100,-100])\n", " \n", "ax.legend(bbox_to_anchor=(1.1, 1.05))\n", "ax.set_aspect(0.5)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "The problem of the variable parameter for TADs calling might be resolved via parameter optimization. For example, the best parameter could be selected based on mean TADs size fitting expectations (~ 500 Kb in this case). " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "optimal_gammas = {}\n", "\n", "for exp_id in experiment_ids:\n", " \n", " mtx = data_hic.dataDict[exp_id][0:1000, 0:1000]\n", " \n", " good_bins = mtx.astype(bool).sum(axis=0) > 1 # We have to mask rows/cols if data is missing\n", "\n", " gammas = np.arange(2, 24, 1)*1000/3250 # Desired set of gammas for testing\n", "\n", " means = []\n", " \n", " for gam_current in gammas:\n", "\n", " S = lavaburst.scoring.armatus_score(mtx, gamma=gam_current, binmask=good_bins)\n", " model = lavaburst.model.SegModel(S)\n", " segments = model.optimal_segmentation() # Positions of TADs for input matrix\n", "\n", " tad_lens = segments[:,1]-segments[:,0]\n", " good_lens = (tad_lens>=200/res)&(tad_lens<900) # We do not consider too large or too small segments as TADs\n", " \n", " means.append(np.mean(tad_lens[good_lens]))\n", " \n", " idx = np.argmin(np.abs(np.array(means)-500/res))\n", " opt_mean, opt_gamma = means[idx], gammas[idx]\n", " \n", " print(exp_id, opt_mean*res, opt_gamma)\n", " \n", " optimal_gammas[exp_id] = opt_gamma" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# TADs in single cells compared with merged single-cell data\n", "\n", "A = mtx.copy()\n", "\n", "good_bins = A.astype(bool).sum(axis=0) > 0\n", "\n", "At = lavaburst.utils.tilt_heatmap(mtx_merged, n_diags=100)\n", "\n", "start_tmp = 0\n", "end_tmp = 500\n", "\n", "f = plt.figure(figsize=(20, 6))\n", "\n", "ax = f.add_subplot(111)\n", "ax.matshow(np.log(At[start_tmp: end_tmp]), cmap='Reds')\n", "\n", "for n, exp in enumerate(experiment_ids):\n", " A = data_hic.dataDict[exp][bgn:end, bgn:end].copy()\n", "\n", " good_bins = A.astype(bool).sum(axis=0) > 0\n", "\n", " gamma = optimal_gammas[exp]\n", " \n", " S = lavaburst.scoring.modularity_score(A, gamma=gamma, binmask=good_bins)\n", " model = lavaburst.model.SegModel(S)\n", " segments = model.optimal_segmentation()\n", "\n", " for a in segments[:-1]:\n", " if a[1]end_tmp:\n", " continue\n", " tad_len = a[1]-a[0]\n", " if (tad_len<200/res)|(tad_len>=900):\n", " continue\n", " ax.fill_between([a[0]-start_tmp, a[0]+(a[1]-a[0])/2-start_tmp, a[1]-start_tmp], [0, -(a[1]-a[0]), 0], 0, \n", " facecolor='#6100FF', interpolate=True, alpha=0.2)\n", "\n", " a = segments[-1]\n", " tad_len = a[1]-a[0]\n", " if (tad_len<200/res)|(tad_len>=900):\n", " continue\n", " ax.fill_between([a[0]-start_tmp, a[0]+(a[1]-a[0])/2-start_tmp, a[1]-start_tmp], [0, -(a[1]-a[0]), 0], 0, \n", " facecolor='#6100FF', interpolate=True, alpha=0.2)\n", "\n", "ax.set_xlim([start_tmp,end_tmp])\n", "ax.set_ylim([100,-100])\n", "\n", "ax.set_aspect(0.5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# TADs in single cells compared with bulk Hi-C data\n", "\n", "A = mtx.copy()\n", "\n", "good_bins = A.astype(bool).sum(axis=0) > 0\n", "\n", "At = lavaburst.utils.tilt_heatmap(data_hic.dataDict['bulk'], n_diags=100)\n", "\n", "start_tmp = 0\n", "end_tmp = 300\n", "\n", "f = plt.figure(figsize=(20, 6))\n", "\n", "ax = f.add_subplot(111)\n", "ax.matshow(np.log(At[start_tmp: end_tmp]), cmap='Reds')\n", "\n", "for n, exp in enumerate(experiment_ids):\n", " A = data_hic.dataDict[exp][bgn:end, bgn:end].copy()\n", "\n", " good_bins = A.astype(bool).sum(axis=0) > 0\n", "\n", " gamma = optimal_gammas[exp]\n", " \n", " S = lavaburst.scoring.modularity_score(A, gamma=gamma, binmask=good_bins)\n", " model = lavaburst.model.SegModel(S)\n", " segments = model.optimal_segmentation()\n", "\n", " for a in segments[:-1]:\n", " if a[1]end_tmp:\n", " continue\n", " tad_len = a[1]-a[0]\n", " if (tad_len<200/res)|(tad_len>=900):\n", " continue\n", " ax.fill_between([a[0]-start_tmp, a[0]+(a[1]-a[0])/2-start_tmp, a[1]-start_tmp], [0, -(a[1]-a[0]), 0], 0, \n", " facecolor='#6100FF', interpolate=True, alpha=0.2)\n", "\n", " a = segments[-1]\n", " tad_len = a[1]-a[0]\n", " if (tad_len<200/res)|(tad_len>=900):\n", " continue\n", " ax.fill_between([a[0]-start_tmp, a[0]+(a[1]-a[0])/2-start_tmp, a[1]-start_tmp], [0, -(a[1]-a[0]), 0], 0, \n", " facecolor='#6100FF', interpolate=True, alpha=0.2)\n", "\n", "ax.set_xlim([start_tmp,end_tmp])\n", "ax.set_ylim([100,-100])\n", "\n", "ax.set_aspect(0.5)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.12" } }, "nbformat": 4, "nbformat_minor": 2 }