{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"# Single-cell Hi-C data analysis\n",
"\n",
"Welcome to the second part of our analysis. Here we will work specifically with single-cell data. \n",
"\n",
"The outline:\n",
"\n",
"1. [Reads mapping](#mapping)\n",
"2. [Data filtering and binning](#filtering)\n",
"3. [Hi-C data visualisation](#visualisation)\n",
"4. [Compartments and TADs](#meta)\n",
"\n",
"If you have any questions, please, contact Aleksandra Galitsyna (agalitzina@gmail.com)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## 1. Reads mapping\n",
"[Go top](#navigation)\n",
"\n",
"In this notebook, we will be working with three datasets from [Flyamer et al. 2017](http://www.nature.com/nature/journal/v544/n7648/full/nature21711.html?foxtrotcallback=true) (GEO ID [GSE80006](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE80006)) placed in DATA/FASTQ/ directory.\n",
"Opposite to previous analysis, now you can work with single-cell Hi-C results. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import os\n",
"\n",
"from hiclib import mapping\n",
"from mirnylib import h5dict, genome"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"bowtie_path = '/opt/conda/bin/bowtie2'\n",
"\n",
"enzyme = 'DpnII'\n",
"\n",
"bowtie_index_path = '/home/jovyan/GENOMES/HG19_IND/hg19_chr1'\n",
"fasta_path = '/home/jovyan/GENOMES/HG19_FASTA/'\n",
"chrms = ['1']\n",
"\n",
"genome_db = genome.Genome(fasta_path, readChrms=chrms)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"min_seq_len = 120\n",
"len_step = 5\n",
"nthreads = 2\n",
"temp_dir = 'tmp'\n",
"bowtie_flags = '--very-sensitive'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this case, we have multiple datasets, thus we have to iterate through the list of files. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"experiment_ids = ['72-sc-1', '54-sc-1', '58-sc-1']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"scrolled": false
},
"outputs": [],
"source": [
"for exp_id in experiment_ids:\n",
" \n",
" infile1 = '/home/jovyan/DATA/FASTQ1/K562_{}_R1.fastq'.format(exp_id)\n",
" infile2 = '/home/jovyan/DATA/FASTQ1/K562_{}_R2.fastq'.format(exp_id)\n",
" out1 = '/home/jovyan/DATA/SAM/K562_{}_R1.chr1.sam'.format(exp_id)\n",
" out2 = '/home/jovyan/DATA/SAM/K562_{}_R2.chr1.sam'.format(exp_id)\n",
"\n",
" mapping.iterative_mapping(\n",
" bowtie_path = bowtie_path,\n",
" bowtie_index_path = bowtie_index_path,\n",
" fastq_path = infile1,\n",
" out_sam_path = out1,\n",
" min_seq_len = min_seq_len,\n",
" len_step = len_step,\n",
" nthreads = nthreads,\n",
" temp_dir = temp_dir, \n",
" bowtie_flags = bowtie_flags)\n",
"\n",
" mapping.iterative_mapping(\n",
" bowtie_path = bowtie_path,\n",
" bowtie_index_path = bowtie_index_path,\n",
" fastq_path = infile2,\n",
" out_sam_path = out2,\n",
" min_seq_len = min_seq_len,\n",
" len_step = len_step,\n",
" nthreads = nthreads,\n",
" temp_dir = temp_dir, \n",
" bowtie_flags = bowtie_flags)\n",
" \n",
" out = '/home/jovyan/DATA/HDF5/K562_{}.fragments.hdf5'\n",
"\n",
" mapped_reads = h5dict.h5dict(out.format(exp_id))\n",
"\n",
" mapping.parse_sam(\n",
" sam_basename1 = out1,\n",
" sam_basename2 = out2,\n",
" out_dict = mapped_reads,\n",
" genome_db = genome_db,\n",
" enzyme_name = enzyme, \n",
" save_seqs = False,\n",
" keep_ids = False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## 2. Data filtering and binning\n",
"[Go top](#navigation)\n",
"\n",
"In case of single cell analysis, the contact filtering is much more sophisticated. For example, standard PCR duplicates filter is replaced by \"filterLessThanDistance\". The code below is based on hiclib [single cell scripts for Flyamer et al. 2017](https://bitbucket.org/mirnylab/hiclib/src/85979ac69b55/examples/singleCellScripts/?at=default)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import h5py\n",
"import numpy as np\n",
"\n",
"from hiclib import fragmentHiC\n",
"from hiclib.fragmentHiC import HiCdataset as HiCdatasetorig\n",
"from mirnylib.numutils import uniqueIndex\n",
"\n",
"class HiCdataset(HiCdatasetorig):\n",
" \"Modification of HiCDataset to include all filters\"\n",
"\n",
" def filterLessThanDistance(self):\n",
" # This is the old function used to filter \"duplicates\".\n",
" #After the final submission of the manuscript, It was replaced by a better function that does the same,\n",
" #but at bp resolution, not 100 bp.\n",
" M = self.N\n",
" for i in range(5):\n",
" for j in range(5):\n",
" chrStrandID = 10000000 * 10000000 * (np.array(self.chrms1 * (self.strands1 + 1), dtype = np.int64) * 100 + self.chrms2 * (self.strands2 + 1))\n",
" print(len(np.unique(chrStrandID)))\n",
" posid = np.array((self.cuts1 + i * 100) // 500, dtype = np.int64) * 10000000 + (self.cuts2 + i * 100) // 500\n",
" N = self.N\n",
" self.maskFilter(uniqueIndex(posid + chrStrandID))\n",
" print(N, \"filtered to\", self.N)\n",
" self.metadata[\"321_quasiDuplicatesRemoved\"] = M - self.N\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"output = []\n",
"\n",
"for exp_id in experiment_ids:\n",
" \n",
" inp = '/home/jovyan/DATA/HDF5/K562_{}.fragments.hdf5'.format(exp_id)\n",
" \n",
" out = '/home/jovyan/DATA/HDF5/K562_{}.tmp.hdf5'.format(exp_id)\n",
" outstat = '/home/jovyan/DATA/HDF5/K562_{}.stat.txt'.format(exp_id)\n",
" \n",
" fragments = HiCdataset(\n",
" filename = out,\n",
" genome = genome_db,\n",
" maximumMoleculeLength= 500,\n",
" enzymeName = 1000,\n",
" mode = 'w')\n",
"\n",
" fragments.parseInputData(\n",
" dictLike=inp)\n",
"\n",
" fragments.filterLessThanDistance() \n",
" fs = fragments.fragmentSum()\n",
" fragments.fragmentFilter(fs < 9)\n",
"\n",
" output.append(list(fragments.metadata.items()))\n",
" \n",
" out_bin = '/home/jovyan/DATA/HDF5/K562_{}.binned_{}.hdf5'\n",
"\n",
" res_kb = [100, 20]\n",
" \n",
" for res in res_kb:\n",
" print(res)\n",
" outmap = out_bin.format(exp_id, str(res)+'kb')\n",
"\n",
" fragments.saveHeatmap(outmap, res*1000)\n",
" \n",
" del fragments"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"output"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## 4. Hi-C data visualisation\n",
"[Go top](#navigation)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from hiclib.binnedData import binnedDataAnalysis"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"res = 100\n",
"data_hic = binnedDataAnalysis(resolution=res*1000, genome=genome_db)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"for exp_id in experiment_ids:\n",
" data_hic.simpleLoad('/home/jovyan/DATA/HDF5/K562_{}.binned_{}.hdf5'.format(exp_id, str(res)+'kb'), exp_id)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import numpy as np\n",
"import matplotlib as mpl\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"sns.set_style('ticks')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"%matplotlib inline"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"scrolled": false
},
"outputs": [],
"source": [
"plt.figure(figsize=[10,10])\n",
"plt.imshow(data_hic.dataDict['54-sc-1'][200:500, 200:500], cmap='jet', interpolation='None')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Comparison of single-cell and bulk datasets"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"data_hic.simpleLoad('/home/jovyan/DATA/HDF5/K562_B-bulk.binned_{}.hdf5'.format(str(res)+'kb'),'bulk')\n",
"data_hic.removeDiagonal()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"mtx1 = data_hic.dataDict['bulk']\n",
"mtx2 = data_hic.dataDict['54-sc-1']\n",
"mtx_tmp = np.triu(mtx1)/np.mean(mtx1) + np.tril(mtx2)/np.mean(mtx2)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"plt.figure(figsize=[10,10])\n",
"plt.imshow(mtx_tmp[200:500, 200:500], cmap='Blues', interpolation='None', vmax=900)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"mtx_merged = sum([data_hic.dataDict[exp_id] for exp_id in experiment_ids]) "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"mtx1 = data_hic.dataDict['bulk']\n",
"mtx2 = mtx_merged\n",
"mtx_tmp = np.triu(mtx1)/np.mean(mtx1) + np.tril(mtx2)/np.mean(mtx2)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"plt.figure(figsize=[10,10])\n",
"plt.imshow(mtx_tmp[200:500, 200:500], cmap='Blues', interpolation='None', vmax=800)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## 5. Compartmanets and TADs\n",
"[Go top](#navigation)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from matplotlib import gridspec"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"eig = np.loadtxt('/home/jovyan/DATA/ANNOT/comp_K562_100Kb_chr1.tsv')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"bgn = 0\n",
"end = 500\n",
"\n",
"fig = plt.figure(figsize=(10,10))\n",
"\n",
"gs = gridspec.GridSpec(2, 1, height_ratios=[20,2])#width_ratios=[2,20], \n",
"gs.update(wspace=0.0, hspace=0.0)\n",
"\n",
"ax = plt.subplot(gs[0,0])\n",
"\n",
"ax.matshow(mtx_tmp[bgn:end, bgn:end], cmap='jet', origin='lower', aspect='auto')\n",
"ax.set_xticks([])\n",
"ax.set_yticks([])\n",
"\n",
"axl = plt.subplot(gs[1,0])#, sharey=ax)\n",
"plt.plot(range(end-bgn), eig[bgn:end] )\n",
"plt.xlim(0, end-bgn)\n",
"plt.xlabel('Eigenvector values')\n",
"\n",
"ticks = range(bgn, end+1, 100)\n",
"ticklabels = ['{} Kb'.format(x) for x in ticks]\n",
"plt.xticks(ticks, ticklabels)\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"### Topologically associating domains (TADs)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import lavaburst"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"mtx = data_hic.dataDict['54-sc-1']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"good_bins = mtx.astype(bool).sum(axis=0) > 1 # We have to mask rows/cols if data is missing\n",
"\n",
"gammas=[0.1, 1.0, 5.0, 10.0] # set of parameters gamma for TADs calling\n",
"\n",
"segments_dict = {}\n",
"\n",
"for gam_current in gammas:\n",
" print(gam_current)\n",
" \n",
" S = lavaburst.scoring.armatus_score(mtx, gamma=gam_current, binmask=good_bins)\n",
" model = lavaburst.model.SegModel(S)\n",
" segments = model.optimal_segmentation() # Positions of TADs for input matrix\n",
"\n",
" segments_dict[gam_current] = segments.copy()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# TADs at different parameters for particular cell (54-sc-1)\n",
"\n",
"A = mtx.copy()\n",
"\n",
"good_bins = A.astype(bool).sum(axis=0) > 0\n",
"\n",
"At = lavaburst.utils.tilt_heatmap(mtx, n_diags=100)\n",
"\n",
"start_tmp = 0\n",
"end_tmp = 500\n",
"\n",
"f = plt.figure(figsize=(20, 6))\n",
"\n",
"ax = f.add_subplot(111)\n",
"blues = sns.cubehelix_palette(0.4, gamma=0.5, rot=-0.3, dark=0.1, light=0.9, as_cmap=True)\n",
"ax.matshow(np.log(At[start_tmp: end_tmp]), cmap=blues)\n",
"\n",
"cmap = mpl.cm.get_cmap('brg')\n",
"\n",
"gammas = segments_dict.keys()\n",
"for n, gamma in enumerate(gammas):\n",
"\n",
" segments = segments_dict[gamma]\n",
"\n",
" for a in segments[:-1]:\n",
" if a[1]end_tmp:\n",
" continue\n",
" ax.plot([a[0]-start_tmp, a[0]+(a[1]-a[0])/2-start_tmp], [0, -(a[1]-a[0])], c=cmap(n/len(gammas)), alpha=0.5)\n",
" ax.plot([a[0]+(a[1]-a[0])/2-start_tmp, a[1]-start_tmp], [-(a[1]-a[0]), 0], c=cmap(n/len(gammas)), alpha=0.5)\n",
"\n",
" a = segments[-1]\n",
" ax.plot([a[0]-start_tmp, a[0]+(a[1]-a[0])/2-start_tmp], [0, -(a[1]-a[0])], c=cmap(n/len(gammas)), alpha=0.5, label=gamma)\n",
" ax.plot([a[0]+(a[1]-a[0])/2-start_tmp, a[1]-start_tmp], [-(a[1]-a[0]), 0], c=cmap(n/len(gammas)), alpha=0.5)\n",
" \n",
"ax.set_xlim([0,end_tmp-start_tmp])\n",
"ax.set_ylim([100,-100])\n",
" \n",
"ax.legend(bbox_to_anchor=(1.1, 1.05))\n",
"ax.set_aspect(0.5)"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"The problem of the variable parameter for TADs calling might be resolved via parameter optimization. For example, the best parameter could be selected based on mean TADs size fitting expectations (~ 500 Kb in this case). "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"optimal_gammas = {}\n",
"\n",
"for exp_id in experiment_ids:\n",
" \n",
" mtx = data_hic.dataDict[exp_id][0:1000, 0:1000]\n",
" \n",
" good_bins = mtx.astype(bool).sum(axis=0) > 1 # We have to mask rows/cols if data is missing\n",
"\n",
" gammas = np.arange(2, 24, 1)*1000/3250 # Desired set of gammas for testing\n",
"\n",
" means = []\n",
" \n",
" for gam_current in gammas:\n",
"\n",
" S = lavaburst.scoring.armatus_score(mtx, gamma=gam_current, binmask=good_bins)\n",
" model = lavaburst.model.SegModel(S)\n",
" segments = model.optimal_segmentation() # Positions of TADs for input matrix\n",
"\n",
" tad_lens = segments[:,1]-segments[:,0]\n",
" good_lens = (tad_lens>=200/res)&(tad_lens<900) # We do not consider too large or too small segments as TADs\n",
" \n",
" means.append(np.mean(tad_lens[good_lens]))\n",
" \n",
" idx = np.argmin(np.abs(np.array(means)-500/res))\n",
" opt_mean, opt_gamma = means[idx], gammas[idx]\n",
" \n",
" print(exp_id, opt_mean*res, opt_gamma)\n",
" \n",
" optimal_gammas[exp_id] = opt_gamma"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# TADs in single cells compared with merged single-cell data\n",
"\n",
"A = mtx.copy()\n",
"\n",
"good_bins = A.astype(bool).sum(axis=0) > 0\n",
"\n",
"At = lavaburst.utils.tilt_heatmap(mtx_merged, n_diags=100)\n",
"\n",
"start_tmp = 0\n",
"end_tmp = 500\n",
"\n",
"f = plt.figure(figsize=(20, 6))\n",
"\n",
"ax = f.add_subplot(111)\n",
"ax.matshow(np.log(At[start_tmp: end_tmp]), cmap='Reds')\n",
"\n",
"for n, exp in enumerate(experiment_ids):\n",
" A = data_hic.dataDict[exp][bgn:end, bgn:end].copy()\n",
"\n",
" good_bins = A.astype(bool).sum(axis=0) > 0\n",
"\n",
" gamma = optimal_gammas[exp]\n",
" \n",
" S = lavaburst.scoring.modularity_score(A, gamma=gamma, binmask=good_bins)\n",
" model = lavaburst.model.SegModel(S)\n",
" segments = model.optimal_segmentation()\n",
"\n",
" for a in segments[:-1]:\n",
" if a[1]end_tmp:\n",
" continue\n",
" tad_len = a[1]-a[0]\n",
" if (tad_len<200/res)|(tad_len>=900):\n",
" continue\n",
" ax.fill_between([a[0]-start_tmp, a[0]+(a[1]-a[0])/2-start_tmp, a[1]-start_tmp], [0, -(a[1]-a[0]), 0], 0, \n",
" facecolor='#6100FF', interpolate=True, alpha=0.2)\n",
"\n",
" a = segments[-1]\n",
" tad_len = a[1]-a[0]\n",
" if (tad_len<200/res)|(tad_len>=900):\n",
" continue\n",
" ax.fill_between([a[0]-start_tmp, a[0]+(a[1]-a[0])/2-start_tmp, a[1]-start_tmp], [0, -(a[1]-a[0]), 0], 0, \n",
" facecolor='#6100FF', interpolate=True, alpha=0.2)\n",
"\n",
"ax.set_xlim([start_tmp,end_tmp])\n",
"ax.set_ylim([100,-100])\n",
"\n",
"ax.set_aspect(0.5)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# TADs in single cells compared with bulk Hi-C data\n",
"\n",
"A = mtx.copy()\n",
"\n",
"good_bins = A.astype(bool).sum(axis=0) > 0\n",
"\n",
"At = lavaburst.utils.tilt_heatmap(data_hic.dataDict['bulk'], n_diags=100)\n",
"\n",
"start_tmp = 0\n",
"end_tmp = 300\n",
"\n",
"f = plt.figure(figsize=(20, 6))\n",
"\n",
"ax = f.add_subplot(111)\n",
"ax.matshow(np.log(At[start_tmp: end_tmp]), cmap='Reds')\n",
"\n",
"for n, exp in enumerate(experiment_ids):\n",
" A = data_hic.dataDict[exp][bgn:end, bgn:end].copy()\n",
"\n",
" good_bins = A.astype(bool).sum(axis=0) > 0\n",
"\n",
" gamma = optimal_gammas[exp]\n",
" \n",
" S = lavaburst.scoring.modularity_score(A, gamma=gamma, binmask=good_bins)\n",
" model = lavaburst.model.SegModel(S)\n",
" segments = model.optimal_segmentation()\n",
"\n",
" for a in segments[:-1]:\n",
" if a[1]end_tmp:\n",
" continue\n",
" tad_len = a[1]-a[0]\n",
" if (tad_len<200/res)|(tad_len>=900):\n",
" continue\n",
" ax.fill_between([a[0]-start_tmp, a[0]+(a[1]-a[0])/2-start_tmp, a[1]-start_tmp], [0, -(a[1]-a[0]), 0], 0, \n",
" facecolor='#6100FF', interpolate=True, alpha=0.2)\n",
"\n",
" a = segments[-1]\n",
" tad_len = a[1]-a[0]\n",
" if (tad_len<200/res)|(tad_len>=900):\n",
" continue\n",
" ax.fill_between([a[0]-start_tmp, a[0]+(a[1]-a[0])/2-start_tmp, a[1]-start_tmp], [0, -(a[1]-a[0]), 0], 0, \n",
" facecolor='#6100FF', interpolate=True, alpha=0.2)\n",
"\n",
"ax.set_xlim([start_tmp,end_tmp])\n",
"ax.set_ylim([100,-100])\n",
"\n",
"ax.set_aspect(0.5)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.12"
}
},
"nbformat": 4,
"nbformat_minor": 2
}