Analysis review§
Overview§
This notebook follows the tutorial by mousepixels/sanbomics, which has an accompanying screencast.
Analysis is illustrated with single-nucleus RNA sequencing data from the following paper [MBH+21]
Melms JC, Biermann J, Huang H, Wang Y, Nair A, Tagore S, et al. A molecular single-cell lung atlas of lethal COVID-19. Nature. 2021;595: 114–119. doi:10.1038/s41586-021-03569-0
This paper examined 116,000 nuclei from the lungs of nineteen patients who underwent autopsy following death in association with COVID-19. Findings reported in the abstract of the paper include:
activated monocyte-derived macrophages and alveolar macrophages
impaired T cell activation
monocyte/macrophage-derived interleukin-1β and epithelial cell-derived interleukin-6
alveolar type 2 cells adopted an inflammation-associated transient progenitor cell state and failed to undergo full transition into alveolar type 1 cells
expansion of CTHRC1+ pathological fibroblasts
protein activity and ligand–receptor interactions suggest putative drug targets
This notebook makes extensive use of [WAT18] and [LRC+18] including updates that have been made to the underlying software packages, scanpy and scvi-tools, since their initial publication.
Setup§
Import libraries§
[1]:
from inspect import getmembers
from pprint import pprint
from types import FunctionType
from datetime import datetime
import os
import numpy as np
import pickle
import pandas as pd
import scanpy as sc
from scipy.sparse import csr_matrix
import scvi
import seaborn as sns
Global seed set to 0
/usr/lib/python3.10/site-packages/pytorch_lightning/utilities/warnings.py:53: LightningDeprecationWarning: pytorch_lightning.utilities.warnings.rank_zero_deprecation has been deprecated in v1.6 and will be removed in v1.8. Use the equivalent function from the pytorch_lightning.utilities.rank_zero module instead.
new_rank_zero_deprecation(
/usr/lib/python3.10/site-packages/pytorch_lightning/utilities/warnings.py:58: LightningDeprecationWarning: The `pytorch_lightning.loggers.base.rank_zero_experiment` is deprecated in v1.7 and will be removed in v1.9. Please use `pytorch_lightning.loggers.logger.rank_zero_experiment` instead.
return new_rank_zero_deprecation(*args, **kwargs)
Setup plotting§
[2]:
import matplotlib.font_manager
import matplotlib.pyplot as plt
# import matplotlib_inline
[3]:
# fonts_path = "/usr/share/texmf/fonts/opentype/public/lm/" #ubuntu
# fonts_path = "~/Library/Fonts/" # macos
fonts_path = "/usr/share/fonts/OTF/" # arch
# user_path = "$HOME/" # user
# fonts_path = user_path + "fonts/latinmodern/opentype/public/lm/" # home
matplotlib.font_manager.fontManager.addfont(fonts_path + "lmsans10-regular.otf")
matplotlib.font_manager.fontManager.addfont(fonts_path + "lmroman10-regular.otf")
[4]:
# https://stackoverflow.com/a/36622238/446907
%config InlineBackend.figure_formats = ['svg']
[5]:
plt.style.use("default") # reset default parameters
# https://stackoverflow.com/a/3900167/446907
plt.rcParams.update(
{
"font.size": 16,
"font.family": ["sans-serif"],
"font.serif": ["Latin Modern Roman"] + plt.rcParams["font.serif"],
"font.sans-serif": ["Latin Modern Sans"] + plt.rcParams["font.sans-serif"],
}
)
Utility functions§
[6]:
def attributes(obj):
"""
get object attributes
"""
disallowed_names = {
name for name, value in getmembers(type(obj)) if isinstance(value, FunctionType)
}
return {
name: getattr(obj, name)
for name in dir(obj)
if name[0] != "_" and name not in disallowed_names and hasattr(obj, name)
}
def print_attributes(obj):
"""
print object attributes
"""
pprint(attributes(obj))
Set output folder for session§
[7]:
now = datetime.now()
timestamp = now.strftime("%Y%m%d_%H%M%S")
timestamp = "20220930_215746"
[8]:
output_directory = f"output/{timestamp}"
[9]:
if not os.path.exists(output_directory):
os.makedirs(output_directory)
print(f"created {output_directory}")
[10]:
print(f"using {output_directory}")
using output/20220930_215746
Import data§
Here we review how the data were downloaded, and proceed to import and inspect the data.
Data download§
Data with GEO accession GSE171524 was downloaded using ./data/download_geo_data.sh with parameters
./download_geo_data.sh \
-a GSE132771 \
-f 'ftp.*RAW.*' \
-j '..|.supplementary_files?|..|.url?|select(length>0)'
A skeleton of this script that may work in this case is
!/usr/bin/env bash
#-- debugging (comment to reduce stderr output)
#-- https://wiki.bash-hackers.org/scripting/debuggingtips
export PS4='+(${BASH_SOURCE}:${LINENO}): ${FUNCNAME[0]:+${FUNCNAME[0]}(): }'
set -o xtrace
# get metadata
# Melms JC, Biermann J, Huang H, Wang Y, Nair A, Tagore S, et al.
# A molecular single-cell lung atlas of lethal COVID-19.
# Nature. 2021;595: 114–119. doi:10.1038/s41586-021-03569-0
# GSE171524
ffq -l 1 -o GSE171524.json GSE171524
# download raw data
wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE171nnn/GSE171524/suppl/GSE171524_RAW.tar
# list contents
tar -tvf GSE171524_RAW.tar
# untar
mkdir -p GSE171524 && \
tar -xvf GSE171524_RAW.tar -C GSE171524
Data load§
See the documentation for scanpy read csv which returns an AnnData object.
[11]:
adata = None
adata = sc.read_csv(
"data/GSE171524/supplementary/GSM5226574_C51ctr_raw_counts.csv.gz"
).T
adata
[11]:
AnnData object with n_obs × n_vars = 6099 × 34546
Note the scanpy.read_csv function accepts gzipped files.
Data properties§
[12]:
type(adata)
[12]:
anndata._core.anndata.AnnData
[13]:
adata.X.shape
[13]:
(6099, 34546)
[14]:
print_attributes(adata)
{'T': AnnData object with n_obs × n_vars = 34546 × 6099,
'X': array([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]], dtype=float32),
'file': Backing file manager: no file is set.,
'filename': None,
'is_view': False,
'isbacked': False,
'isview': False,
'layers': Layers with keys: ,
'n_obs': 6099,
'n_vars': 34546,
'obs': Empty DataFrame
Columns: []
Index: [TAGGTACCATGGCCAC-1_1, ATTCACTGTAACAGGC-1_1, TAACTTCCAACCACGC-1_1, TTGGGTACACGACAAG-1_1, AGGCCACAGAGTCACG-1_1, CACTGAAGTCGAAGCA-1_1, ACTGATGTCTGCACCT-1_1, TTACCGCCACTCAGAT-1_1, TTGGTTTTCCTAGCTC-1_1, TGGGAAGTCAGTGATC-1_1, CCACGAGTCTCTTAAC-1_1, ACTTCCGCACAACGCC-1_1, GGGAAGTAGCGACCCT-1_1, TGGTAGTTCCCGTGTT-1_1, CGCATAACATGCCGGT-1_1, TCTATCACAAGGCTTT-1_1, ATCCACCAGAGGTATT-1_1, TAACGACAGATGACCG-1_1, TCTTAGTGTATGAGGC-1_1, CACTTCGCAGTACTAC-1_1, GTCAAACAGAACGTGC-1_1, GCAACCGAGGGCAGGA-1_1, CATACTTTCATCACTT-1_1, AAGAACATCGGATTAC-1_1, GGGTATTGTACGATGG-1_1, CTGTAGATCAACGTGT-1_1, GTCATTTGTATCTCGA-1_1, CCTTGTGCAGAGGGTT-1_1, AAGTTCGCAACACGTT-1_1, TCATTCACAAATCAAG-1_1, TCCATGCCAACGACTT-1_1, TCCTTCTCAGTTTCAG-1_1, TGTGAGTCAAATGATG-1_1, AAACGAAGTACAGAGC-1_1, CAACCAAAGTATTCCG-1_1, CTTCTCTCAGAGACTG-1_1, TACAACGGTGGCTGAA-1_1, AACGGGACATGCCGGT-1_1, AACCAACGTTGGGAAC-1_1, TATATCCAGCGTCAGA-1_1, AGACAAACATCCCGTT-1_1, ATGACCAGTCTTCATT-1_1, CTTACCGTCAGACATC-1_1, CGGGACTGTTAGTTCG-1_1, ATTCATCCACTGAGTT-1_1, TCATGAGAGAGGCGGA-1_1, TCCCACATCTAGTACG-1_1, CTTCCTTCATATCTCT-1_1, CCGGACACACTCGATA-1_1, ACACGCGCACCTGTCT-1_1, GAATCGTCAGAAGTGC-1_1, GGTGGCTCAAGCTCTA-1_1, CCTGCATCACATATGC-1_1, GTGGGAAGTTAAAGTG-1_1, CGTTCTGGTACTAGCT-1_1, ACCCTCACAATAGTCC-1_1, GCCCGAACAAACTAAG-1_1, GTGGAAGCACATGACT-1_1, GTTGTGACATCGATAC-1_1, GACAGCCCAGGTCCGT-1_1, TAAGCACGTTGGCTAT-1_1, GGGACAAGTCACCACG-1_1, CTGGCAGGTTCGGTAT-1_1, GACTCAACACTGTGAT-1_1, GCCAGTGGTGTGGTCC-1_1, TCTAACTGTAGGCAGT-1_1, GAAGAATGTAGCTTGT-1_1, TCACTCGCAATCTCTT-1_1, CAAGACTTCCCACAGG-1_1, CAGATACGTGACTCTA-1_1, TGGGATTAGAGGGTCT-1_1, ACCTGAACACTCCTTG-1_1, GACACGCCACTCGATA-1_1, CTCATCGTCACCGCTT-1_1, AGGTAGGGTCCCTGTT-1_1, TACCCGTCAACACTAC-1_1, TGCTGAAAGACGGATC-1_1, ACACCAACACAACGCC-1_1, AAGATAGCAAATGGAT-1_1, CTTCAATGTGACAGGT-1_1, GAAGCGAAGAGTTGAT-1_1, GCCGTGACACAAGCCC-1_1, CCTCAACCATACAGGG-1_1, ACAAAGATCCACAGGC-1_1, CAGATACAGTCCCAGC-1_1, GGCAGTCTCCGGTTCT-1_1, TAAGTCGAGCTGAGCA-1_1, GAGACCCGTCTGTGCG-1_1, TAACACGCATGTGTCA-1_1, TCAATTCGTTCTCGCT-1_1, GCTTTCGCACAGTGTT-1_1, AACCAACAGATAACAC-1_1, ATCGGCGCACATCATG-1_1, TCATCCGCACGAGGAT-1_1, CTGATCCTCTTTACAC-1_1, TCACACCCAACTTCTT-1_1, TGAGGGACACCGTACG-1_1, GTGCACGTCATCTGTT-1_1, GGTAATCAGTTGCATC-1_1, ATACTTCCAAGGTCTT-1_1, ...]
[6099 rows x 0 columns],
'obs_names': Index(['TAGGTACCATGGCCAC-1_1', 'ATTCACTGTAACAGGC-1_1', 'TAACTTCCAACCACGC-1_1',
'TTGGGTACACGACAAG-1_1', 'AGGCCACAGAGTCACG-1_1', 'CACTGAAGTCGAAGCA-1_1',
'ACTGATGTCTGCACCT-1_1', 'TTACCGCCACTCAGAT-1_1', 'TTGGTTTTCCTAGCTC-1_1',
'TGGGAAGTCAGTGATC-1_1',
...
'AAGTCGTGTGTGAATA-1_1', 'GTCGTTCTCCAAGGGA-1_1', 'GTTTGGATCGGCCTTT-1_1',
'GTACAGTCACGTATAC-1_1', 'TCATGCCCAAGAGGTC-1_1', 'CGCCATTGTTTGCCGG-1_1',
'CACTGGGGTCTACGTA-1_1', 'CATACTTGTAGAGGAA-1_1', 'TTTGGTTTCCACGGAC-1_1',
'ATGCATGAGTCATGAA-1_1'],
dtype='object', length=6099),
'obsm': AxisArrays with keys: ,
'obsp': PairwiseArrays with keys: ,
'raw': None,
'shape': (6099, 34546),
'uns': OverloadedDict, wrapping:
OrderedDict()
With overloaded keys:
['neighbors'].,
'var': Empty DataFrame
Columns: []
Index: [AL627309.1, AL627309.5, AL627309.4, AL669831.2, LINC01409, FAM87B, LINC01128, LINC00115, FAM41C, AL645608.6, AL645608.2, LINC02593, SAMD11, NOC2L, KLHL17, PLEKHN1, PERM1, AL645608.7, HES4, ISG15, AL645608.1, AGRN, C1orf159, AL390719.3, LINC01342, AL390719.2, TTLL10-AS1, TTLL10, TNFRSF18, TNFRSF4, SDF4, B3GALT6, C1QTNF12, UBE2J2, LINC01786, SCNN1D, ACAP3, PUSL1, INTS11, AL139287.1, CPTP, DVL1, MXRA8, AURKAIP1, CCNL2, MRPL20-AS1, MRPL20, AL391244.2, ANKRD65, AL391244.1, LINC01770, VWA1, ATAD3C, ATAD3B, ATAD3A, TMEM240, SSU72, AL645728.1, FNDC10, AL691432.4, AL691432.2, MIB2, MMP23B, CDK11B, FO704657.1, SLC35E2B, CDK11A, SLC35E2A, NADK, GNB1, AL109917.1, CALML6, TMEM52, CFAP74, GABRD, AL391845.1, PRKCZ, AL590822.2, PRKCZ-AS1, FAAP20, AL590822.1, SKI, AL590822.3, MORN1, AL589739.1, AL513477.2, RER1, PEX10, PLCH2, AL139246.4, PANK4, HES5, AL139246.5, TNFRSF14-AS1, TNFRSF14, AL139246.3, PRXL2B, MMEL1, TTC34, AC242022.2, ...]
[34546 rows x 0 columns],
'var_names': Index(['AL627309.1', 'AL627309.5', 'AL627309.4', 'AL669831.2', 'LINC01409',
'FAM87B', 'LINC01128', 'LINC00115', 'FAM41C', 'AL645608.6',
...
'AC087190.2', 'AC136428.1', 'AC019183.1', 'AC105094.1', 'AC010485.1',
'VN1R2', 'AL031676.1', 'SMIM34A', 'AL050402.1', 'AL445072.1'],
dtype='object', length=34546),
'varm': AxisArrays with keys: ,
'varp': PairwiseArrays with keys: }
/tmp/ipykernel_16232/22660072.py:11: DeprecationWarning: Use is_view instead of isview, isview will be removed in the future.
if name[0] != "_" and name not in disallowed_names and hasattr(obj, name)
/tmp/ipykernel_16232/22660072.py:9: DeprecationWarning: Use is_view instead of isview, isview will be removed in the future.
name: getattr(obj, name)
[15]:
adata.obs
[15]:
| TAGGTACCATGGCCAC-1_1 |
|---|
| ATTCACTGTAACAGGC-1_1 |
| TAACTTCCAACCACGC-1_1 |
| TTGGGTACACGACAAG-1_1 |
| AGGCCACAGAGTCACG-1_1 |
| ... |
| CGCCATTGTTTGCCGG-1_1 |
| CACTGGGGTCTACGTA-1_1 |
| CATACTTGTAGAGGAA-1_1 |
| TTTGGTTTCCACGGAC-1_1 |
| ATGCATGAGTCATGAA-1_1 |
6099 rows × 0 columns
Gene names are saved as adata.var.
[16]:
adata.var
[16]:
| AL627309.1 |
|---|
| AL627309.5 |
| AL627309.4 |
| AL669831.2 |
| LINC01409 |
| ... |
| VN1R2 |
| AL031676.1 |
| SMIM34A |
| AL050402.1 |
| AL445072.1 |
34546 rows × 0 columns
[17]:
adata.obs_names
[17]:
Index(['TAGGTACCATGGCCAC-1_1', 'ATTCACTGTAACAGGC-1_1', 'TAACTTCCAACCACGC-1_1',
'TTGGGTACACGACAAG-1_1', 'AGGCCACAGAGTCACG-1_1', 'CACTGAAGTCGAAGCA-1_1',
'ACTGATGTCTGCACCT-1_1', 'TTACCGCCACTCAGAT-1_1', 'TTGGTTTTCCTAGCTC-1_1',
'TGGGAAGTCAGTGATC-1_1',
...
'AAGTCGTGTGTGAATA-1_1', 'GTCGTTCTCCAAGGGA-1_1', 'GTTTGGATCGGCCTTT-1_1',
'GTACAGTCACGTATAC-1_1', 'TCATGCCCAAGAGGTC-1_1', 'CGCCATTGTTTGCCGG-1_1',
'CACTGGGGTCTACGTA-1_1', 'CATACTTGTAGAGGAA-1_1', 'TTTGGTTTCCACGGAC-1_1',
'ATGCATGAGTCATGAA-1_1'],
dtype='object', length=6099)
[18]:
adata.var_names
[18]:
Index(['AL627309.1', 'AL627309.5', 'AL627309.4', 'AL669831.2', 'LINC01409',
'FAM87B', 'LINC01128', 'LINC00115', 'FAM41C', 'AL645608.6',
...
'AC087190.2', 'AC136428.1', 'AC019183.1', 'AC105094.1', 'AC010485.1',
'VN1R2', 'AL031676.1', 'SMIM34A', 'AL050402.1', 'AL445072.1'],
dtype='object', length=34546)
There are no layers in this data set.
[19]:
adata.layers
[19]:
Layers with keys:
There are no multidimensional observations or variables.
[20]:
print(adata.obsm)
print(adata.varm)
print(adata.obsp)
print(adata.varp)
AxisArrays with keys:
AxisArrays with keys:
PairwiseArrays with keys:
PairwiseArrays with keys:
The data appears to contain reads mapped to 6099 cell-associated barcodes and 34546 RNA molecule-associated features.
Doublet removal§
Filter transcripts by minimum number of cells with non-zero counts§
We may choose to filter out transcript types that are detected in a relatively small number of cells. The optimum threshold is not known. Here we use 10 as a base case.
[21]:
adata
[21]:
AnnData object with n_obs × n_vars = 6099 × 34546
See the documentation for scanpy pre-processing filter-genes.
[22]:
sc.pp.filter_genes(adata, min_cells=10)
[23]:
adata
[23]:
AnnData object with n_obs × n_vars = 6099 × 19896
var: 'n_cells'
Following filtration there are 19896 transcript types remaining that are present in at least 10 cells. This means 14650 transcript types were removed at the threshold of 10. Notice that an annotation named n_cells has been added to the genes to indicate the number of cells with non-zero values for that transcript type.
[24]:
adata.var
[24]:
| n_cells | |
|---|---|
| AL627309.5 | 33 |
| LINC01409 | 274 |
| LINC01128 | 81 |
| LINC00115 | 15 |
| SAMD11 | 16 |
| ... | ... |
| MAFIP | 34 |
| AC011043.1 | 11 |
| AL354822.1 | 133 |
| AL592183.1 | 1003 |
| AC240274.1 | 162 |
19896 rows × 1 columns
[25]:
adata.var.describe()
[25]:
| n_cells | |
|---|---|
| count | 19896.000000 |
| mean | 293.307600 |
| std | 451.440237 |
| min | 10.000000 |
| 25% | 34.000000 |
| 50% | 111.000000 |
| 75% | 345.000000 |
| max | 6090.000000 |
[26]:
sns.histplot(adata.var['n_cells'])
/usr/lib/python3.10/site-packages/matplotlib_inline/config.py:59: DeprecationWarning: InlineBackend._figure_formats_changed is deprecated in traitlets 4.1: use @observe and @unobserve instead.
def _figure_formats_changed(self, name, old, new):
[26]:
<AxesSubplot:xlabel='n_cells', ylabel='Count'>
Select highly variable genes§
It is common to select transcript types with high variability among the cell population under the assumption that this will help to focus on features that distinguish the cells. Again there is no perfect threshold. Here we select the 2000 highest variability genes.
[27]:
adata
[27]:
AnnData object with n_obs × n_vars = 6099 × 19896
var: 'n_cells'
See the documentation for scanpy pre-processing highly-variable-genes.
[28]:
sc.pp.highly_variable_genes(adata, n_top_genes=2000, subset=True, flavor="seurat_v3")
[29]:
adata
[29]:
AnnData object with n_obs × n_vars = 6099 × 2000
var: 'n_cells', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm'
uns: 'hvg'
We have filtered down to 2000 transcript types and added 5 annotations to the genes including a binary variable indicating membership in the highly-variable class, the ranking among highly variable genes, and the mean, variance, and normalized variance for each gene across cells.
[30]:
adata.var
[30]:
| n_cells | highly_variable | highly_variable_rank | means | variances | variances_norm | |
|---|---|---|---|---|---|---|
| TTLL10 | 112 | True | 903.0 | 0.028857 | 0.069354 | 1.901045 |
| TNFRSF18 | 15 | True | 1604.0 | 0.002951 | 0.004911 | 1.513808 |
| CFAP74 | 159 | True | 1370.0 | 0.041154 | 0.087024 | 1.607399 |
| TTC34 | 209 | True | 245.0 | 0.080341 | 0.363502 | 3.044086 |
| AJAP1 | 31 | True | 1922.0 | 0.006886 | 0.011432 | 1.421002 |
| ... | ... | ... | ... | ... | ... | ... |
| MT-ND4L | 650 | True | 1212.0 | 0.191671 | 0.559353 | 1.695637 |
| MT-ND4 | 1328 | True | 178.0 | 0.833087 | 9.742224 | 3.486984 |
| MT-ND5 | 886 | True | 443.0 | 0.332514 | 1.669344 | 2.430210 |
| MT-ND6 | 821 | True | 123.0 | 0.383178 | 3.357087 | 3.985056 |
| MT-CYB | 1295 | True | 414.0 | 0.622397 | 4.384615 | 2.487058 |
2000 rows × 6 columns
Doublet removal§
Here we scvi-tools model as an interface to solo from [BFL+20]. scvi adds some unstructured data to interface with torch.
[31]:
adata.uns
[31]:
OverloadedDict, wrapping:
OrderedDict([('hvg', {'flavor': 'seurat_v3'})])
With overloaded keys:
['neighbors'].
[32]:
scvi.model.SCVI.setup_anndata(adata)
[33]:
adata.uns
[33]:
OverloadedDict, wrapping:
OrderedDict([('hvg', {'flavor': 'seurat_v3'}), ('_scvi_uuid', 'efb3a791-15ff-4864-b57c-7536e8ea9df6'), ('_scvi_manager_uuid', 'ccc84e47-0ab2-4ab3-852b-55f888509f9e')])
With overloaded keys:
['neighbors'].
Train the scvi model.
[34]:
vae = scvi.model.SCVI(adata)
vae.train()
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Epoch 400/400: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 400/400 [03:36<00:00, 1.98it/s, loss=323, v_num=1]
`Trainer.fit` stopped: `max_epochs=400` reached.
Epoch 400/400: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 400/400 [03:36<00:00, 1.85it/s, loss=323, v_num=1]
After training the model adata.obs now contains _scvi_batch and scvi_labels annotations.
[35]:
adata
[35]:
AnnData object with n_obs × n_vars = 6099 × 2000
obs: '_scvi_batch', '_scvi_labels'
var: 'n_cells', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm'
uns: 'hvg', '_scvi_uuid', '_scvi_manager_uuid'
[36]:
adata.obs
[36]:
| _scvi_batch | _scvi_labels | |
|---|---|---|
| TAGGTACCATGGCCAC-1_1 | 0 | 0 |
| ATTCACTGTAACAGGC-1_1 | 0 | 0 |
| TAACTTCCAACCACGC-1_1 | 0 | 0 |
| TTGGGTACACGACAAG-1_1 | 0 | 0 |
| AGGCCACAGAGTCACG-1_1 | 0 | 0 |
| ... | ... | ... |
| CGCCATTGTTTGCCGG-1_1 | 0 | 0 |
| CACTGGGGTCTACGTA-1_1 | 0 | 0 |
| CATACTTGTAGAGGAA-1_1 | 0 | 0 |
| TTTGGTTTCCACGGAC-1_1 | 0 | 0 |
| ATGCATGAGTCATGAA-1_1 | 0 | 0 |
6099 rows × 2 columns
Train the solo model.
[37]:
solo = scvi.external.SOLO.from_scvi_model(vae)
solo.train()
INFO Creating doublets, preparing SOLO model.
/usr/lib/python3.10/site-packages/anndata/_core/anndata.py:1785: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.
[AnnData(sparse.csr_matrix(a.shape), obs=a.obs) for a in all_adatas],
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Epoch 207/400: 52%|██████████████████████████████████████████████████▏ | 207/400 [01:51<01:44, 1.85it/s, loss=0.299, v_num=1]
Monitored metric validation_loss did not improve in the last 30 records. Best score: 0.295. Signaling Trainer to stop.
[38]:
type(solo)
[38]:
scvi.external.solo._model.SOLO
We can now use the solo model to add a prediction annotation for doublet or singlet to our data.
[39]:
df = solo.predict()
df["prediction"] = solo.predict(soft=False)
df.index = df.index.map(lambda x: x[:-2])
df
[39]:
| doublet | singlet | prediction | |
|---|---|---|---|
| TAGGTACCATGGCCAC-1_1 | 1.729540 | -0.992126 | doublet |
| ATTCACTGTAACAGGC-1_1 | 2.275840 | -1.639090 | doublet |
| TAACTTCCAACCACGC-1_1 | 0.735813 | -0.259502 | doublet |
| TTGGGTACACGACAAG-1_1 | 1.618703 | -0.680344 | doublet |
| AGGCCACAGAGTCACG-1_1 | 1.398426 | -0.673206 | doublet |
| ... | ... | ... | ... |
| CGCCATTGTTTGCCGG-1_1 | -0.691126 | 1.362337 | singlet |
| CACTGGGGTCTACGTA-1_1 | -2.632991 | 2.468916 | singlet |
| CATACTTGTAGAGGAA-1_1 | -2.506424 | 2.567346 | singlet |
| TTTGGTTTCCACGGAC-1_1 | -3.308819 | 2.476630 | singlet |
| ATGCATGAGTCATGAA-1_1 | -2.357370 | 2.206591 | singlet |
6099 rows × 3 columns
Counting the predicted doublets and singlets reveals about 20% doublets.
[40]:
df.groupby("prediction").count()
[40]:
| doublet | singlet | |
|---|---|---|
| prediction | ||
| doublet | 1206 | 1206 |
| singlet | 4893 | 4893 |
We can assess the magnitude of the prediction by taking the difference between the doublet and singlet scores.
[41]:
df["dif"] = df.doublet - df.singlet
df
[41]:
| doublet | singlet | prediction | dif | |
|---|---|---|---|---|
| TAGGTACCATGGCCAC-1_1 | 1.729540 | -0.992126 | doublet | 2.721666 |
| ATTCACTGTAACAGGC-1_1 | 2.275840 | -1.639090 | doublet | 3.914930 |
| TAACTTCCAACCACGC-1_1 | 0.735813 | -0.259502 | doublet | 0.995315 |
| TTGGGTACACGACAAG-1_1 | 1.618703 | -0.680344 | doublet | 2.299047 |
| AGGCCACAGAGTCACG-1_1 | 1.398426 | -0.673206 | doublet | 2.071632 |
| ... | ... | ... | ... | ... |
| CGCCATTGTTTGCCGG-1_1 | -0.691126 | 1.362337 | singlet | -2.053463 |
| CACTGGGGTCTACGTA-1_1 | -2.632991 | 2.468916 | singlet | -5.101907 |
| CATACTTGTAGAGGAA-1_1 | -2.506424 | 2.567346 | singlet | -5.073770 |
| TTTGGTTTCCACGGAC-1_1 | -3.308819 | 2.476630 | singlet | -5.785450 |
| ATGCATGAGTCATGAA-1_1 | -2.357370 | 2.206591 | singlet | -4.563961 |
6099 rows × 4 columns
Plotting the distribution of doublet-singlet score differences we see there are a large fraction that marginally favor doublet to singlet.
[42]:
sns.histplot(df[df.prediction == "doublet"], x="dif")
[42]:
<AxesSubplot:xlabel='dif', ylabel='Count'>
Since we would like to avoid unnecessarily discarding barcodes, we will arbitrarily retain those with a score from zero to one (however this is not intended to be principled).
[43]:
doublets = df[(df.prediction == 'doublet') & (df.dif > 1)]
doublets
[43]:
| doublet | singlet | prediction | dif | |
|---|---|---|---|---|
| TAGGTACCATGGCCAC-1_1 | 1.729540 | -0.992126 | doublet | 2.721666 |
| ATTCACTGTAACAGGC-1_1 | 2.275840 | -1.639090 | doublet | 3.914930 |
| TTGGGTACACGACAAG-1_1 | 1.618703 | -0.680344 | doublet | 2.299047 |
| AGGCCACAGAGTCACG-1_1 | 1.398426 | -0.673206 | doublet | 2.071632 |
| CACTGAAGTCGAAGCA-1_1 | 1.305717 | -0.544905 | doublet | 1.850623 |
| ... | ... | ... | ... | ... |
| CAATACGCAATGTGGG-1_1 | 0.471522 | -0.923330 | doublet | 1.394853 |
| GGGTATTTCAGCGCAC-1_1 | 1.101736 | -0.658175 | doublet | 1.759911 |
| TTGCTGCAGTGCGACA-1_1 | 0.720216 | -0.624341 | doublet | 1.344557 |
| CATCCCAAGACGCCAA-1_1 | 0.937599 | -0.486121 | doublet | 1.423720 |
| CACTGTCAGGGACAGG-1_1 | 1.190305 | -0.678417 | doublet | 1.868721 |
451 rows × 4 columns
[44]:
sns.histplot(doublets[doublets.prediction == "doublet"], x="dif")
[44]:
<AxesSubplot:xlabel='dif', ylabel='Count'>
Save/load checkpoint§
Save current variables to file.
[45]:
pickle.dump([df, doublets, solo, vae], open(f"{output_directory}/auxiliary.p", "wb"))
Variables can be reloaded if necessary.
[46]:
df, doublets, solo, vae = pickle.load(open(f"{output_directory}/auxiliary.p", "rb"))
[47]:
adata.write(f"{output_directory}/adata_doublets.h5ad", compression="gzip")
[48]:
bdata_doublets = None
bdata_doublets = sc.read_h5ad(f"{output_directory}/adata_doublets.h5ad")
bdata_doublets
[48]:
AnnData object with n_obs × n_vars = 6099 × 2000
obs: '_scvi_batch', '_scvi_labels'
var: 'n_cells', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm'
uns: '_scvi_manager_uuid', '_scvi_uuid', 'hvg'
Reload data and filter doublets§
[49]:
adata = sc.read_csv(
"data/GSE171524/supplementary/GSM5226574_C51ctr_raw_counts.csv.gz"
).T
adata
[49]:
AnnData object with n_obs × n_vars = 6099 × 34546
Annotate adata with doublet column.
[50]:
adata.obs['doublet'] = adata.obs.index.isin(doublets.index)
[51]:
adata
[51]:
AnnData object with n_obs × n_vars = 6099 × 34546
obs: 'doublet'
[52]:
adata.obs
[52]:
| doublet | |
|---|---|
| TAGGTACCATGGCCAC-1_1 | True |
| ATTCACTGTAACAGGC-1_1 | True |
| TAACTTCCAACCACGC-1_1 | False |
| TTGGGTACACGACAAG-1_1 | True |
| AGGCCACAGAGTCACG-1_1 | True |
| ... | ... |
| CGCCATTGTTTGCCGG-1_1 | False |
| CACTGGGGTCTACGTA-1_1 | False |
| CATACTTGTAGAGGAA-1_1 | False |
| TTTGGTTTCCACGGAC-1_1 | False |
| ATGCATGAGTCATGAA-1_1 | False |
6099 rows × 1 columns
Use doublet column to filter doublets.
[53]:
adata = adata[~adata.obs.doublet]
[54]:
adata
[54]:
View of AnnData object with n_obs × n_vars = 5648 × 34546
obs: 'doublet'
We filtered about 7% of the barcodes as putative doublets.
Preprocessing§
Filter mitochondrial transcripts§
We can add a binary annotation to indicate presence of mitochondrial transcripts.
[55]:
adata.var['mt'] = adata.var.index.str.startswith('MT-')
/tmp/ipykernel_16232/310233584.py:1: ImplicitModificationWarning: Trying to modify attribute `.var` of view, initializing view as actual.
adata.var['mt'] = adata.var.index.str.startswith('MT-')
[56]:
adata.var
[56]:
| mt | |
|---|---|
| AL627309.1 | False |
| AL627309.5 | False |
| AL627309.4 | False |
| AL669831.2 | False |
| LINC01409 | False |
| ... | ... |
| VN1R2 | False |
| AL031676.1 | False |
| SMIM34A | False |
| AL050402.1 | False |
| AL445072.1 | False |
34546 rows × 1 columns
Filter ribosomal transcripts§
We can download a list of ribosomal genes from msigdb.
[57]:
ribo_url = "http://software.broadinstitute.org/gsea/msigdb/download_geneset.jsp?geneSetName=KEGG_RIBOSOME&fileType=txt"
ribo_genes = pd.read_table(ribo_url, skiprows=2, header = None)
ribo_genes
[57]:
| 0 | |
|---|---|
| 0 | FAU |
| 1 | MRPL13 |
| 2 | RPL10 |
| 3 | RPL10A |
| 4 | RPL10L |
| ... | ... |
| 83 | RPS9 |
| 84 | RPSA |
| 85 | RSL24D1 |
| 86 | RSL24D1P11 |
| 87 | UBA52 |
88 rows × 1 columns
We add a binary annotation to indicate presence in the list of ribosomal genes.
[58]:
adata.var['ribo'] = adata.var_names.isin(ribo_genes[0].values)
[59]:
adata.var
[59]:
| mt | ribo | |
|---|---|---|
| AL627309.1 | False | False |
| AL627309.5 | False | False |
| AL627309.4 | False | False |
| AL669831.2 | False | False |
| LINC01409 | False | False |
| ... | ... | ... |
| VN1R2 | False | False |
| AL031676.1 | False | False |
| SMIM34A | False | False |
| AL050402.1 | False | False |
| AL445072.1 | False | False |
34546 rows × 2 columns
Calculate QC metrics§
Recall the observation/barcode annotation currently just indicates doublets based on our solo model.
[60]:
adata.obs
[60]:
| doublet | |
|---|---|
| TAACTTCCAACCACGC-1_1 | False |
| TTACCGCCACTCAGAT-1_1 | False |
| TTGGTTTTCCTAGCTC-1_1 | False |
| TCTATCACAAGGCTTT-1_1 | False |
| ATCCACCAGAGGTATT-1_1 | False |
| ... | ... |
| CGCCATTGTTTGCCGG-1_1 | False |
| CACTGGGGTCTACGTA-1_1 | False |
| CATACTTGTAGAGGAA-1_1 | False |
| TTTGGTTTCCACGGAC-1_1 | False |
| ATGCATGAGTCATGAA-1_1 | False |
5648 rows × 1 columns
We can calculate standard quality control metrics with scanpy. See the documentation for scanpy calculate_qc_metrics. These include the number of
[61]:
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt', 'ribo'], percent_top=None, log1p=False, inplace=True)
Sorting genes by the number of cells with non-zero counts, we see a number of genes that were not found in association with any barcode.
[62]:
adata.var.sort_values("n_cells_by_counts")
[62]:
| mt | ribo | n_cells_by_counts | mean_counts | pct_dropout_by_counts | total_counts | |
|---|---|---|---|---|---|---|
| AL445072.1 | False | False | 0 | 0.000000 | 100.000000 | 0.0 |
| UGT1A5 | False | False | 0 | 0.000000 | 100.000000 | 0.0 |
| AGXT | False | False | 0 | 0.000000 | 100.000000 | 0.0 |
| TEX44 | False | False | 0 | 0.000000 | 100.000000 | 0.0 |
| AC093585.1 | False | False | 0 | 0.000000 | 100.000000 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... |
| AKAP13 | False | False | 4059 | 2.881374 | 28.133853 | 16274.0 |
| NEAT1 | False | False | 4148 | 5.001062 | 26.558074 | 28246.0 |
| MBNL1 | False | False | 4150 | 2.695822 | 26.522663 | 15226.0 |
| ZBTB20 | False | False | 4281 | 2.425992 | 24.203258 | 13702.0 |
| MALAT1 | False | False | 5639 | 64.417313 | 0.159348 | 363829.0 |
34546 rows × 6 columns
Sorting barcodes by total counts we see a minimum around 400 suggesting a previously applied filter upstream of this pre-processing.
[63]:
adata.obs.sort_values("total_counts")
[63]:
| doublet | n_genes_by_counts | total_counts | total_counts_mt | pct_counts_mt | total_counts_ribo | pct_counts_ribo | |
|---|---|---|---|---|---|---|---|
| AGGCATTCATCCGTTC-1_1 | False | 290 | 401.0 | 1.0 | 0.249377 | 0.0 | 0.000000 |
| TGGTACAGTTGGTGTT-1_1 | False | 323 | 401.0 | 0.0 | 0.000000 | 0.0 | 0.000000 |
| CAGGGCTTCATGCGGC-1_1 | False | 330 | 401.0 | 7.0 | 1.745636 | 1.0 | 0.249377 |
| GTCGTTCTCCAAGGGA-1_1 | False | 300 | 401.0 | 0.0 | 0.000000 | 0.0 | 0.000000 |
| CGAGAAGGTGAACTAA-1_1 | False | 308 | 401.0 | 0.0 | 0.000000 | 0.0 | 0.000000 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| TCTATCACAAGGCTTT-1_1 | False | 3582 | 8276.0 | 183.0 | 2.211213 | 5.0 | 0.060416 |
| ATCCACCAGAGGTATT-1_1 | False | 3913 | 8286.0 | 191.0 | 2.305093 | 34.0 | 0.410331 |
| TTACCGCCACTCAGAT-1_1 | False | 4144 | 11369.0 | 39.0 | 0.343038 | 5.0 | 0.043979 |
| TTGGTTTTCCTAGCTC-1_1 | False | 3902 | 11472.0 | 463.0 | 4.035913 | 13.0 | 0.113319 |
| TAACTTCCAACCACGC-1_1 | False | 5158 | 15645.0 | 221.0 | 1.412592 | 211.0 | 1.348674 |
5648 rows × 7 columns
[64]:
sns.jointplot(
data=adata.obs,
x="total_counts",
y="n_genes_by_counts",
kind="hex",
)
[64]:
<seaborn.axisgrid.JointGrid at 0x7fedd44cbbb0>
Filter genes by counts§
We can filter genes found in some minimum number of cells, in this case we arbitrarily choose three.
[65]:
sc.pp.filter_genes(adata, min_cells=3)
[66]:
adata.var.sort_values("n_cells_by_counts")
[66]:
| mt | ribo | n_cells_by_counts | mean_counts | pct_dropout_by_counts | total_counts | n_cells | |
|---|---|---|---|---|---|---|---|
| GSG1 | False | False | 3 | 0.000531 | 99.946884 | 3.0 | 3 |
| TRAJ47 | False | False | 3 | 0.000531 | 99.946884 | 3.0 | 3 |
| AC002465.2 | False | False | 3 | 0.000531 | 99.946884 | 3.0 | 3 |
| LINC02709 | False | False | 3 | 0.000531 | 99.946884 | 3.0 | 3 |
| AC096536.1 | False | False | 3 | 0.000531 | 99.946884 | 3.0 | 3 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| AKAP13 | False | False | 4059 | 2.881374 | 28.133853 | 16274.0 | 4059 |
| NEAT1 | False | False | 4148 | 5.001062 | 26.558074 | 28246.0 | 4148 |
| MBNL1 | False | False | 4150 | 2.695822 | 26.522663 | 15226.0 | 4150 |
| ZBTB20 | False | False | 4281 | 2.425992 | 24.203258 | 13702.0 | 4281 |
| MALAT1 | False | False | 5639 | 64.417313 | 0.159348 | 363829.0 | 5639 |
24136 rows × 7 columns
Filter cells by number of genes§
As a matter of completeness we filter cells with fewer than 200 genes; however, we see the minimum number of genes in a cell is already greater than 200.
[67]:
adata.obs.sort_values('n_genes_by_counts')
[67]:
| doublet | n_genes_by_counts | total_counts | total_counts_mt | pct_counts_mt | total_counts_ribo | pct_counts_ribo | |
|---|---|---|---|---|---|---|---|
| TAGGGTTTCTGGCTGG-1_1 | False | 276 | 419.0 | 1.0 | 0.238663 | 1.0 | 0.238663 |
| CGTGCTTCAAAGGGCT-1_1 | False | 277 | 432.0 | 41.0 | 9.490741 | 0.0 | 0.000000 |
| TGACAGTTCTAAACGC-1_1 | False | 278 | 414.0 | 0.0 | 0.000000 | 0.0 | 0.000000 |
| CTCTGGTCACGACGAA-1_1 | False | 285 | 407.0 | 0.0 | 0.000000 | 0.0 | 0.000000 |
| GTAAGTCGTATCGCGC-1_1 | False | 289 | 430.0 | 0.0 | 0.000000 | 0.0 | 0.000000 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| TAACGACAGATGACCG-1_1 | False | 3791 | 8187.0 | 94.0 | 1.148162 | 4.0 | 0.048858 |
| TTGGTTTTCCTAGCTC-1_1 | False | 3902 | 11472.0 | 463.0 | 4.035913 | 13.0 | 0.113319 |
| ATCCACCAGAGGTATT-1_1 | False | 3913 | 8286.0 | 191.0 | 2.305093 | 34.0 | 0.410331 |
| TTACCGCCACTCAGAT-1_1 | False | 4144 | 11369.0 | 39.0 | 0.343038 | 5.0 | 0.043979 |
| TAACTTCCAACCACGC-1_1 | False | 5158 | 15645.0 | 221.0 | 1.412592 | 211.0 | 1.348674 |
5648 rows × 7 columns
[68]:
sc.pp.filter_cells(adata, min_genes=200)
[69]:
adata
[69]:
AnnData object with n_obs × n_vars = 5648 × 24136
obs: 'doublet', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'n_genes'
var: 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells'
Review and filter QC outliers§
We can plot distributions of the number of genes by counts and total counts as well as the percentage of mitochondrial and ribosomal genes.
[70]:
sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt', 'pct_counts_ribo'],
jitter=0.4, multi_panel=True)
If we choose to filter cells beyond the 98th percentile, we can compute the bound via the numpy quantile function.
[71]:
upper_lim = np.quantile(adata.obs.n_genes_by_counts.values, .98)
upper_lim
[71]:
2310.0599999999995
[72]:
adata = adata[adata.obs.n_genes_by_counts < upper_lim]
This filters a few more than 100 additional cells.
[73]:
adata.obs
[73]:
| doublet | n_genes_by_counts | total_counts | total_counts_mt | pct_counts_mt | total_counts_ribo | pct_counts_ribo | n_genes | |
|---|---|---|---|---|---|---|---|---|
| CCTCAACCATACAGGG-1_1 | False | 2276 | 5434.0 | 38.0 | 0.699301 | 0.0 | 0.000000 | 2275 |
| ACAAAGATCCACAGGC-1_1 | False | 2310 | 5504.0 | 1.0 | 0.018169 | 1.0 | 0.018169 | 2308 |
| ATACTTCCAAGGTCTT-1_1 | False | 2122 | 5252.0 | 1.0 | 0.019040 | 4.0 | 0.076161 | 2120 |
| GGGTCACTCTATTCGT-1_1 | False | 2286 | 5056.0 | 0.0 | 0.000000 | 1.0 | 0.019778 | 2280 |
| GATGATCCACAACCGC-1_1 | False | 2263 | 4938.0 | 2.0 | 0.040502 | 2.0 | 0.040502 | 2262 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| CGCCATTGTTTGCCGG-1_1 | False | 355 | 410.0 | 3.0 | 0.731707 | 0.0 | 0.000000 | 355 |
| CACTGGGGTCTACGTA-1_1 | False | 346 | 403.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 346 |
| CATACTTGTAGAGGAA-1_1 | False | 360 | 410.0 | 2.0 | 0.487805 | 0.0 | 0.000000 | 360 |
| TTTGGTTTCCACGGAC-1_1 | False | 299 | 405.0 | 0.0 | 0.000000 | 2.0 | 0.493827 | 299 |
| ATGCATGAGTCATGAA-1_1 | False | 351 | 411.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 350 |
5535 rows × 8 columns
Finally, we filter mitochondrial (at 20%) and ribosomal (at 2%) outliers.
[74]:
adata = adata[adata.obs.pct_counts_mt < 20]
[75]:
adata = adata[adata.obs.pct_counts_ribo < 2]
[76]:
adata
[76]:
View of AnnData object with n_obs × n_vars = 5518 × 24136
obs: 'doublet', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'n_genes'
var: 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells'
We end our preprocessing with 5518 cells and 24136 genes.
Save/load checkpoint§
Save current version of post processed data to file.
[77]:
adata.write(f"{output_directory}/adata_post_processed.h5ad", compression="gzip")
Variables can be reloaded if necessary.
[78]:
df, doublets, solo, vae = pickle.load(open(f"{output_directory}/auxiliary.p", "rb"))
[79]:
bdata_post_processed = None
bdata_post_processed = sc.read_h5ad(f"{output_directory}/adata_post_processed.h5ad")
bdata_post_processed
[79]:
AnnData object with n_obs × n_vars = 5518 × 24136
obs: 'doublet', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'n_genes'
var: 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells'
Normalization§
[80]:
adata.X.sum(axis=1)
[80]:
ArrayView([5433., 5502., 5250., ..., 410., 405., 410.], dtype=float32)
[81]:
sc.pp.normalize_total(adata, target_sum=1e4)
/usr/lib/python3.10/site-packages/scanpy/preprocessing/_normalization.py:170: UserWarning: Received a view of an AnnData. Making a copy.
view_to_actual(adata)
[82]:
adata.X.sum(axis=1)
[82]:
array([10000., 10000., 10000., ..., 10000., 10000., 10000.], dtype=float32)
[83]:
sc.pp.log1p(adata)
[84]:
adata.X.sum(axis=1)
[84]:
array([3285.3794, 3163.8867, 2858.1602, ..., 1191.2706, 1022.7512,
1162.8784], dtype=float32)
[85]:
adata.raw = adata
Clustering§
Estimate expression variability§
[86]:
adata.var
[86]:
| mt | ribo | n_cells_by_counts | mean_counts | pct_dropout_by_counts | total_counts | n_cells | |
|---|---|---|---|---|---|---|---|
| AL627309.1 | False | False | 4 | 0.000708 | 99.929178 | 4.0 | 4 |
| AL627309.5 | False | False | 26 | 0.004603 | 99.539660 | 26.0 | 26 |
| AL627309.4 | False | False | 4 | 0.000708 | 99.929178 | 4.0 | 4 |
| LINC01409 | False | False | 233 | 0.044972 | 95.874646 | 254.0 | 233 |
| FAM87B | False | False | 6 | 0.001062 | 99.893768 | 6.0 | 6 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| AL354822.1 | False | False | 121 | 0.022132 | 97.857649 | 125.0 | 121 |
| AL592183.1 | False | False | 871 | 0.182188 | 84.578612 | 1029.0 | 871 |
| AC240274.1 | False | False | 142 | 0.026027 | 97.485836 | 147.0 | 142 |
| AC007325.4 | False | False | 4 | 0.000708 | 99.929178 | 4.0 | 4 |
| AC007325.2 | False | False | 6 | 0.001062 | 99.893768 | 6.0 | 6 |
24136 rows × 7 columns
Here we use the scanpy pre-processing function for highly-variable genes to annotate genes by their expression variability across cells.
[87]:
sc.pp.highly_variable_genes(adata, n_top_genes = 2000)
[88]:
adata.var
[88]:
| mt | ribo | n_cells_by_counts | mean_counts | pct_dropout_by_counts | total_counts | n_cells | highly_variable | means | dispersions | dispersions_norm | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| AL627309.1 | False | False | 4 | 0.000708 | 99.929178 | 4.0 | 4 | True | 0.009588 | 2.873559 | 1.581447 |
| AL627309.5 | False | False | 26 | 0.004603 | 99.539660 | 26.0 | 26 | False | 0.029244 | 2.300264 | -0.151067 |
| AL627309.4 | False | False | 4 | 0.000708 | 99.929178 | 4.0 | 4 | False | 0.002517 | 1.702826 | -1.956543 |
| LINC01409 | False | False | 233 | 0.044972 | 95.874646 | 254.0 | 233 | False | 0.265776 | 2.334326 | -0.048131 |
| FAM87B | False | False | 6 | 0.001062 | 99.893768 | 6.0 | 6 | False | 0.007217 | 1.947624 | -1.216757 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| AL354822.1 | False | False | 121 | 0.022132 | 97.857649 | 125.0 | 121 | False | 0.158275 | 2.407786 | 0.173866 |
| AL592183.1 | False | False | 871 | 0.182188 | 84.578612 | 1029.0 | 871 | False | 0.868576 | 2.511827 | -0.388573 |
| AC240274.1 | False | False | 142 | 0.026027 | 97.485836 | 147.0 | 142 | False | 0.153063 | 2.300034 | -0.151763 |
| AC007325.4 | False | False | 4 | 0.000708 | 99.929178 | 4.0 | 4 | False | 0.003025 | 1.579867 | -2.328129 |
| AC007325.2 | False | False | 6 | 0.001062 | 99.893768 | 6.0 | 6 | False | 0.006377 | 1.995220 | -1.072921 |
24136 rows × 11 columns
[89]:
sc.pl.highly_variable_genes(adata)
Filter highly variable genes§
We finally filter for the highly variable genes noting the change in n_vars.
[90]:
adata
[90]:
AnnData object with n_obs × n_vars = 5518 × 24136
obs: 'doublet', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'n_genes'
var: 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
uns: 'log1p', 'hvg'
[91]:
adata = adata[:, adata.var.highly_variable]
[92]:
adata
[92]:
View of AnnData object with n_obs × n_vars = 5518 × 2000
obs: 'doublet', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'n_genes'
var: 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
uns: 'log1p', 'hvg'
Principle components analysis§
Regression and rescaling§
We would like to avoid interpreting differences that derive from differences in total counts, percentage of mitochondrial counts, and percentage of ribosomal counts. We can reduce the impact of these differences using the `scanpy regress out function <https://scanpy.readthedocs.io/en/latest/generated/scanpy.pp.regress_out.html>`__.
[93]:
sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt', 'pct_counts_ribo'])
We also want to normalize the data scaling it to zero mean and unit variance while clipping any remaining outliers above 10.
[94]:
sc.pp.scale(adata, max_value=10)
Perform PCA§
PCA can be performed with the `scanpy PCA tool <https://scanpy.readthedocs.io/en/latest/generated/scanpy.tl.pca.html>`__.
[95]:
sc.tl.pca(adata, svd_solver='arpack')
[96]:
sc.pl.pca_variance_ratio(adata, log=True, n_pcs = 50)
Clustering§
The UMAP algorithm contains a method for estimating distances between cells to which an interface is provided by scanpy neighbors.
[97]:
sc.pp.neighbors(adata, n_pcs = 30)
This function adds distances and connectivities as pairwise relationships among the cells in adata.obsp.
[98]:
adata
[98]:
AnnData object with n_obs × n_vars = 5518 × 2000
obs: 'doublet', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'n_genes'
var: 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
uns: 'log1p', 'hvg', 'pca', 'neighbors'
obsm: 'X_pca'
varm: 'PCs'
obsp: 'distances', 'connectivities'
[99]:
adata.obsp
[99]:
PairwiseArrays with keys: distances, connectivities
We can now plot the UMAP embedding, but note there are no defined clusters of cells even though they may be hypothesized visually.
[100]:
sc.tl.umap(adata)
[101]:
sc.pl.umap(adata)
The `scanpy interface to the Leiden algorithm <https://scanpy.readthedocs.io/en/latest/generated/scanpy.tl.leiden.html>`__ can be used to cluster cells into subgroups.
[102]:
sc.tl.leiden(adata, resolution = 0.5)
[103]:
adata.obs
[103]:
| doublet | n_genes_by_counts | total_counts | total_counts_mt | pct_counts_mt | total_counts_ribo | pct_counts_ribo | n_genes | leiden | |
|---|---|---|---|---|---|---|---|---|---|
| CCTCAACCATACAGGG-1_1 | False | 2276 | 5434.0 | 38.0 | 0.699301 | 0.0 | 0.000000 | 2275 | 11 |
| ACAAAGATCCACAGGC-1_1 | False | 2310 | 5504.0 | 1.0 | 0.018169 | 1.0 | 0.018169 | 2308 | 2 |
| ATACTTCCAAGGTCTT-1_1 | False | 2122 | 5252.0 | 1.0 | 0.019040 | 4.0 | 0.076161 | 2120 | 2 |
| GGGTCACTCTATTCGT-1_1 | False | 2286 | 5056.0 | 0.0 | 0.000000 | 1.0 | 0.019778 | 2280 | 2 |
| GATGATCCACAACCGC-1_1 | False | 2263 | 4938.0 | 2.0 | 0.040502 | 2.0 | 0.040502 | 2262 | 2 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| CGCCATTGTTTGCCGG-1_1 | False | 355 | 410.0 | 3.0 | 0.731707 | 0.0 | 0.000000 | 355 | 7 |
| CACTGGGGTCTACGTA-1_1 | False | 346 | 403.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 346 | 1 |
| CATACTTGTAGAGGAA-1_1 | False | 360 | 410.0 | 2.0 | 0.487805 | 0.0 | 0.000000 | 360 | 0 |
| TTTGGTTTCCACGGAC-1_1 | False | 299 | 405.0 | 0.0 | 0.000000 | 2.0 | 0.493827 | 299 | 5 |
| ATGCATGAGTCATGAA-1_1 | False | 351 | 411.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 350 | 1 |
5518 rows × 9 columns
We can now color the UMAP embedding by the Louvain clusters.
[104]:
sc.pl.umap(adata, color=['leiden'])
Save/load checkpoint§
Save current data with cluster information to file.
[105]:
adata.write(f"{output_directory}/adata_clustered.h5ad", compression="gzip")
[106]:
adata
[106]:
AnnData object with n_obs × n_vars = 5518 × 2000
obs: 'doublet', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'n_genes', 'leiden'
var: 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
uns: 'log1p', 'hvg', 'pca', 'neighbors', 'umap', 'leiden', 'leiden_colors'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
obsp: 'distances', 'connectivities'
Variables can be reloaded if necessary.
[107]:
bdata_clustered = None
bdata_clustered = sc.read_h5ad(f"{output_directory}/adata_clustered.h5ad")
bdata_clustered
[107]:
AnnData object with n_obs × n_vars = 5518 × 2000
obs: 'doublet', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'n_genes', 'leiden'
var: 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
uns: 'hvg', 'leiden', 'leiden_colors', 'log1p', 'neighbors', 'pca', 'umap'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
obsp: 'connectivities', 'distances'
Sample integration§
Batch processing§
See the command line program scpreprocess. It can be installed with
$ pip install -e .
and run, for example
$ scpreprocess --input-directory="data/GSE171524/supplementary/"
Where the contents of input-directory was, in this case, created by running download_geo_data.sh on GSE171524. In this case, scpreprocess requires about 30GB of RAM and about 30 - 60 minutes to complete with 1 NVIDIA Tesla T4 GPU.
Filter genes§
[12]:
adata = None
adata = sc.read_h5ad(f"{output_directory}/adata_combined.h5ad")
adata
[12]:
AnnData object with n_obs × n_vars = 98878 × 34546
obs: 'Sample', 'doublet', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo'
[13]:
sc.pp.filter_genes(adata, min_cells = 10)
[14]:
adata.X = csr_matrix(adata.X)
[15]:
adata.write(f"{output_directory}/adata_combined_filtered.h5ad", compression="gzip")
[16]:
adata = None
adata = sc.read_h5ad(f"{output_directory}/adata_combined_filtered.h5ad")
adata
[16]:
AnnData object with n_obs × n_vars = 98878 × 29029
obs: 'Sample', 'doublet', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo'
var: 'n_cells'
[17]:
adata.obs.groupby('Sample').count()
[17]:
| doublet | n_genes | n_genes_by_counts | total_counts | total_counts_mt | pct_counts_mt | total_counts_ribo | pct_counts_ribo | |
|---|---|---|---|---|---|---|---|---|
| Sample | ||||||||
| C51ctr | 5501 | 5501 | 5501 | 5501 | 5501 | 5501 | 5501 | 5501 |
| C52ctr | 4055 | 4055 | 4055 | 4055 | 4055 | 4055 | 4055 | 4055 |
| C53ctr | 6241 | 6241 | 6241 | 6241 | 6241 | 6241 | 6241 | 6241 |
| C54ctr | 3961 | 3961 | 3961 | 3961 | 3961 | 3961 | 3961 | 3961 |
| C55ctr | 5088 | 5088 | 5088 | 5088 | 5088 | 5088 | 5088 | 5088 |
| C56ctr | 3613 | 3613 | 3613 | 3613 | 3613 | 3613 | 3613 | 3613 |
| C57ctr | 4242 | 4242 | 4242 | 4242 | 4242 | 4242 | 4242 | 4242 |
| L01cov | 2645 | 2645 | 2645 | 2645 | 2645 | 2645 | 2645 | 2645 |
| L03cov | 3583 | 3583 | 3583 | 3583 | 3583 | 3583 | 3583 | 3583 |
| L04cov | 3062 | 3062 | 3062 | 3062 | 3062 | 3062 | 3062 | 3062 |
| L04covaddon | 3933 | 3933 | 3933 | 3933 | 3933 | 3933 | 3933 | 3933 |
| L05cov | 2382 | 2382 | 2382 | 2382 | 2382 | 2382 | 2382 | 2382 |
| L06cov | 5737 | 5737 | 5737 | 5737 | 5737 | 5737 | 5737 | 5737 |
| L07cov | 4289 | 4289 | 4289 | 4289 | 4289 | 4289 | 4289 | 4289 |
| L08cov | 3474 | 3474 | 3474 | 3474 | 3474 | 3474 | 3474 | 3474 |
| L09cov | 3117 | 3117 | 3117 | 3117 | 3117 | 3117 | 3117 | 3117 |
| L10cov | 1357 | 1357 | 1357 | 1357 | 1357 | 1357 | 1357 | 1357 |
| L11cov | 2682 | 2682 | 2682 | 2682 | 2682 | 2682 | 2682 | 2682 |
| L12cov | 3298 | 3298 | 3298 | 3298 | 3298 | 3298 | 3298 | 3298 |
| L13cov | 4293 | 4293 | 4293 | 4293 | 4293 | 4293 | 4293 | 4293 |
| L15cov | 3635 | 3635 | 3635 | 3635 | 3635 | 3635 | 3635 | 3635 |
| L16cov | 1593 | 1593 | 1593 | 1593 | 1593 | 1593 | 1593 | 1593 |
| L17cov | 3952 | 3952 | 3952 | 3952 | 3952 | 3952 | 3952 | 3952 |
| L18cov | 2440 | 2440 | 2440 | 2440 | 2440 | 2440 | 2440 | 2440 |
| L19cov | 2162 | 2162 | 2162 | 2162 | 2162 | 2162 | 2162 | 2162 |
| L21cov | 2917 | 2917 | 2917 | 2917 | 2917 | 2917 | 2917 | 2917 |
| L22cov | 5626 | 5626 | 5626 | 5626 | 5626 | 5626 | 5626 | 5626 |
[18]:
sc.pp.filter_genes(adata, min_cells = 100)
[19]:
adata
[19]:
AnnData object with n_obs × n_vars = 98878 × 20625
obs: 'Sample', 'doublet', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo'
var: 'n_cells'
[20]:
adata.layers['counts'] = adata.X.copy()
[21]:
adata
[21]:
AnnData object with n_obs × n_vars = 98878 × 20625
obs: 'Sample', 'doublet', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo'
var: 'n_cells'
layers: 'counts'
Normalize§
[22]:
sc.pp.normalize_total(adata, target_sum = 1e4)
sc.pp.log1p(adata)
adata.raw = adata
[23]:
adata.obs.head()
[23]:
| Sample | doublet | n_genes | n_genes_by_counts | total_counts | total_counts_mt | pct_counts_mt | total_counts_ribo | pct_counts_ribo | |
|---|---|---|---|---|---|---|---|---|---|
| CTGGCAGCACTCCTGT-1_21 | L15cov | False | 1889 | 1889 | 4812.0 | 0.0 | 0.0 | 9.0 | 0.187032 |
| TCCCACAAGCGACTTT-1_21 | L15cov | False | 2125 | 2125 | 4318.0 | 0.0 | 0.0 | 0.0 | 0.000000 |
| CCTCCAAAGACCAACG-1_21 | L15cov | False | 2099 | 2099 | 4279.0 | 0.0 | 0.0 | 5.0 | 0.116850 |
| GGTTGTAAGGTTACCT-1_21 | L15cov | False | 2019 | 2019 | 4194.0 | 0.0 | 0.0 | 1.0 | 0.023844 |
| GGGATGACAGTATTCG-1_21 | L15cov | False | 2077 | 2077 | 4198.0 | 0.0 | 0.0 | 5.0 | 0.119104 |
[24]:
scvi.model.SCVI.setup_anndata(adata, layer = "counts",
categorical_covariate_keys=["Sample"], # batch, technology
continuous_covariate_keys=['pct_counts_mt', 'total_counts', 'pct_counts_ribo'])
[25]:
model = scvi.model.SCVI(adata)
[26]:
model.train()
/usr/lib/python3.10/site-packages/scvi/model/base/_training_mixin.py:67: UserWarning: max_epochs=81 is less than n_epochs_kl_warmup=400. The max_kl_weight will not be reached during training.
warnings.warn(
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Epoch 81/81: 100% 81/81 [17:06<00:00, 12.42s/it, loss=2.76e+03, v_num=1]
`Trainer.fit` stopped: `max_epochs=81` reached.
Epoch 81/81: 100% 81/81 [17:06<00:00, 12.68s/it, loss=2.76e+03, v_num=1]
[28]:
# pickle.dump([model], open(f"{output_directory}/model_combined.p", "wb"))
model.save(f"{output_directory}/model_combined/")
# model = scvi.model.SCVI.load(f"{output_directory}/model_combined/")
[29]:
adata.obsm['X_scVI'] = model.get_latent_representation()
[30]:
adata.layers['scvi_normalized'] = model.get_normalized_expression(library_size = 1e4)
Cluster§
[31]:
sc.pp.neighbors(adata, use_rep = 'X_scVI')
[33]:
sc.tl.umap(adata)
sc.tl.leiden(adata, resolution = 0.5)
[36]:
sc.pl.umap(adata, color = ['leiden'], frameon = True)
[37]:
sc.pl.umap(adata, color = ['Sample'], frameon = True)
Save/load checkpoint§
[38]:
adata.write(f"{output_directory}/adata_integrated.h5ad", compression="gzip")
[16]:
adata = None
adata = sc.read_h5ad(f"{output_directory}/adata_integrated.h5ad")
adata
[16]:
AnnData object with n_obs × n_vars = 98878 × 29029
obs: 'Sample', 'doublet', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo'
var: 'n_cells'
Marker identification§
Increase cluster resolution from 0.5 to default value of 1.
[39]:
sc.tl.leiden(adata, resolution = 1)
[40]:
sc.tl.rank_genes_groups(adata, 'leiden')
WARNING: Default of the method has been changed to 't-test' from 't-test_overestim_var'
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:394: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
self.stats[group_name, 'names'] = self.var_names[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:396: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
self.stats[group_name, 'scores'] = scores[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:399: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
self.stats[group_name, 'pvals'] = pvals[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:409: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:420: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
self.stats[group_name, 'logfoldchanges'] = np.log2(
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:394: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
self.stats[group_name, 'names'] = self.var_names[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:396: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
self.stats[group_name, 'scores'] = scores[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:399: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
self.stats[group_name, 'pvals'] = pvals[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:409: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:420: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
self.stats[group_name, 'logfoldchanges'] = np.log2(
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:394: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
self.stats[group_name, 'names'] = self.var_names[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:396: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
self.stats[group_name, 'scores'] = scores[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:399: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
self.stats[group_name, 'pvals'] = pvals[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:409: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:420: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
self.stats[group_name, 'logfoldchanges'] = np.log2(
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:394: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
self.stats[group_name, 'names'] = self.var_names[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:396: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
self.stats[group_name, 'scores'] = scores[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:399: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
self.stats[group_name, 'pvals'] = pvals[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:409: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:420: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
self.stats[group_name, 'logfoldchanges'] = np.log2(
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:394: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
self.stats[group_name, 'names'] = self.var_names[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:396: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
self.stats[group_name, 'scores'] = scores[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:399: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
self.stats[group_name, 'pvals'] = pvals[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:409: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:420: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
self.stats[group_name, 'logfoldchanges'] = np.log2(
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:394: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
self.stats[group_name, 'names'] = self.var_names[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:396: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
self.stats[group_name, 'scores'] = scores[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:399: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
self.stats[group_name, 'pvals'] = pvals[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:409: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:420: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
self.stats[group_name, 'logfoldchanges'] = np.log2(
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:394: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
self.stats[group_name, 'names'] = self.var_names[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:396: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
self.stats[group_name, 'scores'] = scores[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:399: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
self.stats[group_name, 'pvals'] = pvals[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:409: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:420: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
self.stats[group_name, 'logfoldchanges'] = np.log2(
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:394: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
self.stats[group_name, 'names'] = self.var_names[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:396: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
self.stats[group_name, 'scores'] = scores[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:399: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
self.stats[group_name, 'pvals'] = pvals[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:409: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:420: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
self.stats[group_name, 'logfoldchanges'] = np.log2(
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:394: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
self.stats[group_name, 'names'] = self.var_names[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:396: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
self.stats[group_name, 'scores'] = scores[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:399: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
self.stats[group_name, 'pvals'] = pvals[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:409: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:420: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
self.stats[group_name, 'logfoldchanges'] = np.log2(
Plot genes that distinguish each cluster.
[42]:
sc.pl.rank_genes_groups(adata, n_genes=20, sharey=False)
Filter markers by adjusted p-value and log fold change.
[41]:
markers = sc.get.rank_genes_groups_df(adata, None)
markers = markers[(markers.pvals_adj < 0.05) & (markers.logfoldchanges > .5)]
markers
[41]:
| group | names | scores | logfoldchanges | pvals | pvals_adj | |
|---|---|---|---|---|---|---|
| 0 | 0 | CTSB | 157.684479 | 4.972445 | 0.000000 | 0.000000 |
| 1 | 0 | LRMDA | 137.313080 | 2.813497 | 0.000000 | 0.000000 |
| 2 | 0 | PLXDC2 | 135.515442 | 3.123800 | 0.000000 | 0.000000 |
| 3 | 0 | ZEB2 | 133.443405 | 3.335379 | 0.000000 | 0.000000 |
| 4 | 0 | FMN1 | 133.045715 | 3.743495 | 0.000000 | 0.000000 |
| ... | ... | ... | ... | ... | ... | ... |
| 578222 | 28 | CCL3 | 2.141156 | 4.313347 | 0.037839 | 0.048973 |
| 578223 | 28 | VKORC1 | 2.140682 | 3.905211 | 0.037879 | 0.049022 |
| 578224 | 28 | NDUFA2 | 2.137703 | 4.700588 | 0.038135 | 0.049349 |
| 578225 | 28 | ZMPSTE24 | 2.136283 | 2.764338 | 0.038256 | 0.049503 |
| 578226 | 28 | LRPAP1 | 2.135819 | 2.313233 | 0.038296 | 0.049551 |
84872 rows × 6 columns
Estimate differential expression for leiden clusters.
[43]:
markers_scvi = model.differential_expression(groupby = 'leiden')
markers_scvi
DE...: 100% 29/29 [04:38<00:00, 9.61s/it]
[43]:
| proba_de | proba_not_de | bayes_factor | scale1 | scale2 | pseudocounts | delta | lfc_mean | lfc_median | lfc_std | ... | raw_mean1 | raw_mean2 | non_zeros_proportion1 | non_zeros_proportion2 | raw_normalized_mean1 | raw_normalized_mean2 | is_de_fdr_0.05 | comparison | group1 | group2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SPP1 | 0.9898 | 0.0102 | 4.575114 | 0.000638 | 0.000048 | 0.0 | 0.25 | 6.495259 | 6.727366 | 3.440886 | ... | 1.076499 | 0.040538 | 0.181072 | 0.009292 | 6.256733 | 0.294361 | True | 0 vs Rest | 0 | Rest |
| CCL18 | 0.9898 | 0.0102 | 4.575114 | 0.000326 | 0.000043 | 0.0 | 0.25 | 6.902542 | 7.447420 | 4.287405 | ... | 0.602506 | 0.042989 | 0.168400 | 0.017725 | 3.187437 | 0.270629 | True | 0 vs Rest | 0 | Rest |
| APOC1 | 0.9896 | 0.0104 | 4.555494 | 0.000148 | 0.000016 | 0.0 | 0.25 | 6.963109 | 7.430366 | 3.897622 | ... | 0.276810 | 0.019038 | 0.135315 | 0.012533 | 1.356255 | 0.119619 | True | 0 vs Rest | 0 | Rest |
| C1QA | 0.9894 | 0.0106 | 4.536244 | 0.000333 | 0.000045 | 0.0 | 0.25 | 7.422565 | 7.997816 | 4.413615 | ... | 0.539626 | 0.045370 | 0.252483 | 0.025287 | 3.077946 | 0.308842 | True | 0 vs Rest | 0 | Rest |
| TSPAN5 | 0.9890 | 0.0110 | 4.498798 | 0.000006 | 0.000235 | 0.0 | 0.25 | -5.034511 | -5.320078 | 3.038231 | ... | 0.002972 | 0.244487 | 0.002738 | 0.142439 | 0.029674 | 2.230635 | True | 0 vs Rest | 0 | Rest |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| PCBP2 | 0.6028 | 0.3972 | 0.417146 | 0.000112 | 0.000126 | 0.0 | 0.25 | -0.146613 | -0.167634 | 0.442000 | ... | 0.133333 | 0.161815 | 0.133333 | 0.140631 | 1.616527 | 1.317526 | False | 28 vs Rest | 28 | Rest |
| API5 | 0.6020 | 0.3980 | 0.413805 | 0.000040 | 0.000042 | 0.0 | 0.25 | -0.024984 | -0.050644 | 0.501002 | ... | 0.088889 | 0.053969 | 0.088889 | 0.050580 | 0.924425 | 0.441995 | False | 28 vs Rest | 28 | Rest |
| SEPTIN2 | 0.5982 | 0.4018 | 0.397971 | 0.000164 | 0.000179 | 0.0 | 0.25 | -0.067447 | -0.083257 | 0.518050 | ... | 0.244444 | 0.243477 | 0.222222 | 0.197394 | 2.164344 | 2.037301 | False | 28 vs Rest | 28 | Rest |
| PI4KB | 0.5702 | 0.4298 | 0.282667 | 0.000048 | 0.000047 | 0.0 | 0.25 | 0.065837 | 0.066301 | 0.450776 | ... | 0.022222 | 0.058350 | 0.022222 | 0.053980 | 0.204813 | 0.473826 | False | 28 vs Rest | 28 | Rest |
| YWHAE | 0.5520 | 0.4480 | 0.208755 | 0.000132 | 0.000140 | 0.0 | 0.25 | -0.079388 | -0.062251 | 0.425387 | ... | 0.088889 | 0.171559 | 0.088889 | 0.148847 | 0.608118 | 1.433012 | False | 28 vs Rest | 28 | Rest |
598125 rows × 22 columns
Filter differentially expressed genes by FDR and log fold change.
[47]:
markers_scvi = markers_scvi[(markers_scvi['is_de_fdr_0.05']) & (markers_scvi.lfc_mean > .5)]
markers_scvi
[47]:
| proba_de | proba_not_de | bayes_factor | scale1 | scale2 | pseudocounts | delta | lfc_mean | lfc_median | lfc_std | ... | raw_mean1 | raw_mean2 | non_zeros_proportion1 | non_zeros_proportion2 | raw_normalized_mean1 | raw_normalized_mean2 | is_de_fdr_0.05 | comparison | group1 | group2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SPP1 | 0.9898 | 0.0102 | 4.575114 | 0.000638 | 0.000048 | 0.0 | 0.25 | 6.495259 | 6.727366 | 3.440886 | ... | 1.076499 | 0.040538 | 0.181072 | 0.009292 | 6.256733 | 0.294361 | True | 0 vs Rest | 0 | Rest |
| CCL18 | 0.9898 | 0.0102 | 4.575114 | 0.000326 | 0.000043 | 0.0 | 0.25 | 6.902542 | 7.447420 | 4.287405 | ... | 0.602506 | 0.042989 | 0.168400 | 0.017725 | 3.187437 | 0.270629 | True | 0 vs Rest | 0 | Rest |
| APOC1 | 0.9896 | 0.0104 | 4.555494 | 0.000148 | 0.000016 | 0.0 | 0.25 | 6.963109 | 7.430366 | 3.897622 | ... | 0.276810 | 0.019038 | 0.135315 | 0.012533 | 1.356255 | 0.119619 | True | 0 vs Rest | 0 | Rest |
| C1QA | 0.9894 | 0.0106 | 4.536244 | 0.000333 | 0.000045 | 0.0 | 0.25 | 7.422565 | 7.997816 | 4.413615 | ... | 0.539626 | 0.045370 | 0.252483 | 0.025287 | 3.077946 | 0.308842 | True | 0 vs Rest | 0 | Rest |
| MCEMP1 | 0.9888 | 0.0112 | 4.480577 | 0.000039 | 0.000008 | 0.0 | 0.25 | 5.592349 | 5.895313 | 3.543164 | ... | 0.061634 | 0.013300 | 0.045366 | 0.010071 | 0.315246 | 0.083767 | True | 0 vs Rest | 0 | Rest |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| AL158206.1 | 0.8942 | 0.1058 | 2.134379 | 0.000003 | 0.000002 | 0.0 | 0.25 | 1.058239 | 0.996449 | 1.515791 | ... | 0.000000 | 0.002145 | 0.000000 | 0.002135 | 0.000000 | 0.016850 | True | 28 vs Rest | 28 | Rest |
| AC004847.1 | 0.8942 | 0.1058 | 2.134379 | 0.000002 | 0.000002 | 0.0 | 0.25 | 0.636359 | 0.664043 | 1.732896 | ... | 0.000000 | 0.001639 | 0.000000 | 0.001609 | 0.000000 | 0.014673 | True | 28 vs Rest | 28 | Rest |
| SNHG17 | 0.8942 | 0.1058 | 2.134379 | 0.000014 | 0.000008 | 0.0 | 0.25 | 1.102903 | 1.107450 | 1.149856 | ... | 0.000000 | 0.011150 | 0.000000 | 0.010796 | 0.000000 | 0.087149 | True | 28 vs Rest | 28 | Rest |
| BAIAP2 | 0.8942 | 0.1058 | 2.134379 | 0.000056 | 0.000046 | 0.0 | 0.25 | 1.501894 | 0.933272 | 2.498250 | ... | 0.066667 | 0.063237 | 0.066667 | 0.052938 | 0.852857 | 0.478457 | True | 28 vs Rest | 28 | Rest |
| ZNF530 | 0.8942 | 0.1058 | 2.134379 | 0.000003 | 0.000002 | 0.0 | 0.25 | 0.892193 | 0.836232 | 1.560018 | ... | 0.000000 | 0.002256 | 0.000000 | 0.002216 | 0.000000 | 0.019131 | True | 28 vs Rest | 28 | Rest |
92424 rows × 22 columns
[50]:
sc.pl.umap(adata, color = ['leiden'], frameon = False, legend_loc = "on data")
[ ]:
sc.pl.umap(adata, color = ['PTPRC', 'CD3E', 'CD4'], frameon = False, layer = 'scvi_normalized', vmax = 5)
sc.pl.umap(adata, color = ['PTPRC', 'CD3E', 'CD8A'], frameon = False, layer = 'scvi_normalized', vmax = 5)
sc.pl.umap(adata, color = ['EPCAM', 'MUC1'], frameon = False, layer = 'scvi_normalized', vmax = 5)
sc.pl.umap(adata, color = ['EPCAM', 'MUC1'], frameon = False, layer = 'scvi_normalized', vmax = 5)