This page was generated from source/02-analysis-review.ipynb. Binder badge or Colab badge

Analysis review§

Overview§

This notebook follows the tutorial by mousepixels/sanbomics, which has an accompanying screencast.

Analysis is illustrated with single-nucleus RNA sequencing data from the following paper [MBH+21]

Melms JC, Biermann J, Huang H, Wang Y, Nair A, Tagore S, et al. A molecular single-cell lung atlas of lethal COVID-19. Nature. 2021;595: 114–119. doi:10.1038/s41586-021-03569-0

This paper examined 116,000 nuclei from the lungs of nineteen patients who underwent autopsy following death in association with COVID-19. Findings reported in the abstract of the paper include:

  1. activated monocyte-derived macrophages and alveolar macrophages

  2. impaired T cell activation

  3. monocyte/macrophage-derived interleukin-1β and epithelial cell-derived interleukin-6

  4. alveolar type 2 cells adopted an inflammation-associated transient progenitor cell state and failed to undergo full transition into alveolar type 1 cells

  5. expansion of CTHRC1+ pathological fibroblasts

  6. protein activity and ligand–receptor interactions suggest putative drug targets

This notebook makes extensive use of [WAT18] and [LRC+18] including updates that have been made to the underlying software packages, scanpy and scvi-tools, since their initial publication.

Setup§

Import libraries§

[1]:
from inspect import getmembers
from pprint import pprint
from types import FunctionType
from datetime import datetime
import os

import numpy as np
import pickle
import pandas as pd
import scanpy as sc
from scipy.sparse import csr_matrix
import scvi
import seaborn as sns
Global seed set to 0
/usr/lib/python3.10/site-packages/pytorch_lightning/utilities/warnings.py:53: LightningDeprecationWarning: pytorch_lightning.utilities.warnings.rank_zero_deprecation has been deprecated in v1.6 and will be removed in v1.8. Use the equivalent function from the pytorch_lightning.utilities.rank_zero module instead.
  new_rank_zero_deprecation(
/usr/lib/python3.10/site-packages/pytorch_lightning/utilities/warnings.py:58: LightningDeprecationWarning: The `pytorch_lightning.loggers.base.rank_zero_experiment` is deprecated in v1.7 and will be removed in v1.9. Please use `pytorch_lightning.loggers.logger.rank_zero_experiment` instead.
  return new_rank_zero_deprecation(*args, **kwargs)

Setup plotting§

[2]:
import matplotlib.font_manager
import matplotlib.pyplot as plt

# import matplotlib_inline
[3]:
# fonts_path = "/usr/share/texmf/fonts/opentype/public/lm/" #ubuntu
# fonts_path = "~/Library/Fonts/" # macos
fonts_path = "/usr/share/fonts/OTF/"  # arch
# user_path = "$HOME/" # user
# fonts_path = user_path + "fonts/latinmodern/opentype/public/lm/"  # home
matplotlib.font_manager.fontManager.addfont(fonts_path + "lmsans10-regular.otf")
matplotlib.font_manager.fontManager.addfont(fonts_path + "lmroman10-regular.otf")
[4]:
# https://stackoverflow.com/a/36622238/446907
%config InlineBackend.figure_formats = ['svg']
[5]:
plt.style.use("default")  # reset default parameters
# https://stackoverflow.com/a/3900167/446907
plt.rcParams.update(
    {
        "font.size": 16,
        "font.family": ["sans-serif"],
        "font.serif": ["Latin Modern Roman"] + plt.rcParams["font.serif"],
        "font.sans-serif": ["Latin Modern Sans"] + plt.rcParams["font.sans-serif"],
    }
)

Utility functions§

[6]:
def attributes(obj):
    """
    get object attributes
    """
    disallowed_names = {
        name for name, value in getmembers(type(obj)) if isinstance(value, FunctionType)
    }
    return {
        name: getattr(obj, name)
        for name in dir(obj)
        if name[0] != "_" and name not in disallowed_names and hasattr(obj, name)
    }


def print_attributes(obj):
    """
    print object attributes
    """
    pprint(attributes(obj))

Set output folder for session§

[7]:
now = datetime.now()
timestamp = now.strftime("%Y%m%d_%H%M%S")
timestamp = "20220930_215746"
[8]:
output_directory = f"output/{timestamp}"
[9]:
if not os.path.exists(output_directory):
    os.makedirs(output_directory)
    print(f"created {output_directory}")
[10]:
print(f"using {output_directory}")
using output/20220930_215746

Import data§

Here we review how the data were downloaded, and proceed to import and inspect the data.

Data download§

Data with GEO accession GSE171524 was downloaded using ./data/download_geo_data.sh with parameters

./download_geo_data.sh \
       -a GSE132771 \
       -f 'ftp.*RAW.*' \
       -j '..|.supplementary_files?|..|.url?|select(length>0)'

A skeleton of this script that may work in this case is

!/usr/bin/env bash

#-- debugging (comment to reduce stderr output)
#-- https://wiki.bash-hackers.org/scripting/debuggingtips
export PS4='+(${BASH_SOURCE}:${LINENO}): ${FUNCNAME[0]:+${FUNCNAME[0]}(): }'
set -o xtrace

# get metadata
# Melms JC, Biermann J, Huang H, Wang Y, Nair A, Tagore S, et al.
# A molecular single-cell lung atlas of lethal COVID-19.
# Nature. 2021;595: 114–119. doi:10.1038/s41586-021-03569-0
# GSE171524
ffq -l 1 -o GSE171524.json GSE171524

# download raw data
wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE171nnn/GSE171524/suppl/GSE171524_RAW.tar

# list contents
tar -tvf GSE171524_RAW.tar

# untar
mkdir -p GSE171524 && \
tar -xvf GSE171524_RAW.tar -C GSE171524

Data load§

See the documentation for scanpy read csv which returns an AnnData object.

[11]:
adata = None
adata = sc.read_csv(
    "data/GSE171524/supplementary/GSM5226574_C51ctr_raw_counts.csv.gz"
).T
adata
[11]:
AnnData object with n_obs × n_vars = 6099 × 34546

Note the scanpy.read_csv function accepts gzipped files.

Data properties§

[12]:
type(adata)
[12]:
anndata._core.anndata.AnnData
[13]:
adata.X.shape
[13]:
(6099, 34546)
[14]:
print_attributes(adata)
{'T': AnnData object with n_obs × n_vars = 34546 × 6099,
 'X': array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32),
 'file': Backing file manager: no file is set.,
 'filename': None,
 'is_view': False,
 'isbacked': False,
 'isview': False,
 'layers': Layers with keys: ,
 'n_obs': 6099,
 'n_vars': 34546,
 'obs': Empty DataFrame
Columns: []
Index: [TAGGTACCATGGCCAC-1_1, ATTCACTGTAACAGGC-1_1, TAACTTCCAACCACGC-1_1, TTGGGTACACGACAAG-1_1, AGGCCACAGAGTCACG-1_1, CACTGAAGTCGAAGCA-1_1, ACTGATGTCTGCACCT-1_1, TTACCGCCACTCAGAT-1_1, TTGGTTTTCCTAGCTC-1_1, TGGGAAGTCAGTGATC-1_1, CCACGAGTCTCTTAAC-1_1, ACTTCCGCACAACGCC-1_1, GGGAAGTAGCGACCCT-1_1, TGGTAGTTCCCGTGTT-1_1, CGCATAACATGCCGGT-1_1, TCTATCACAAGGCTTT-1_1, ATCCACCAGAGGTATT-1_1, TAACGACAGATGACCG-1_1, TCTTAGTGTATGAGGC-1_1, CACTTCGCAGTACTAC-1_1, GTCAAACAGAACGTGC-1_1, GCAACCGAGGGCAGGA-1_1, CATACTTTCATCACTT-1_1, AAGAACATCGGATTAC-1_1, GGGTATTGTACGATGG-1_1, CTGTAGATCAACGTGT-1_1, GTCATTTGTATCTCGA-1_1, CCTTGTGCAGAGGGTT-1_1, AAGTTCGCAACACGTT-1_1, TCATTCACAAATCAAG-1_1, TCCATGCCAACGACTT-1_1, TCCTTCTCAGTTTCAG-1_1, TGTGAGTCAAATGATG-1_1, AAACGAAGTACAGAGC-1_1, CAACCAAAGTATTCCG-1_1, CTTCTCTCAGAGACTG-1_1, TACAACGGTGGCTGAA-1_1, AACGGGACATGCCGGT-1_1, AACCAACGTTGGGAAC-1_1, TATATCCAGCGTCAGA-1_1, AGACAAACATCCCGTT-1_1, ATGACCAGTCTTCATT-1_1, CTTACCGTCAGACATC-1_1, CGGGACTGTTAGTTCG-1_1, ATTCATCCACTGAGTT-1_1, TCATGAGAGAGGCGGA-1_1, TCCCACATCTAGTACG-1_1, CTTCCTTCATATCTCT-1_1, CCGGACACACTCGATA-1_1, ACACGCGCACCTGTCT-1_1, GAATCGTCAGAAGTGC-1_1, GGTGGCTCAAGCTCTA-1_1, CCTGCATCACATATGC-1_1, GTGGGAAGTTAAAGTG-1_1, CGTTCTGGTACTAGCT-1_1, ACCCTCACAATAGTCC-1_1, GCCCGAACAAACTAAG-1_1, GTGGAAGCACATGACT-1_1, GTTGTGACATCGATAC-1_1, GACAGCCCAGGTCCGT-1_1, TAAGCACGTTGGCTAT-1_1, GGGACAAGTCACCACG-1_1, CTGGCAGGTTCGGTAT-1_1, GACTCAACACTGTGAT-1_1, GCCAGTGGTGTGGTCC-1_1, TCTAACTGTAGGCAGT-1_1, GAAGAATGTAGCTTGT-1_1, TCACTCGCAATCTCTT-1_1, CAAGACTTCCCACAGG-1_1, CAGATACGTGACTCTA-1_1, TGGGATTAGAGGGTCT-1_1, ACCTGAACACTCCTTG-1_1, GACACGCCACTCGATA-1_1, CTCATCGTCACCGCTT-1_1, AGGTAGGGTCCCTGTT-1_1, TACCCGTCAACACTAC-1_1, TGCTGAAAGACGGATC-1_1, ACACCAACACAACGCC-1_1, AAGATAGCAAATGGAT-1_1, CTTCAATGTGACAGGT-1_1, GAAGCGAAGAGTTGAT-1_1, GCCGTGACACAAGCCC-1_1, CCTCAACCATACAGGG-1_1, ACAAAGATCCACAGGC-1_1, CAGATACAGTCCCAGC-1_1, GGCAGTCTCCGGTTCT-1_1, TAAGTCGAGCTGAGCA-1_1, GAGACCCGTCTGTGCG-1_1, TAACACGCATGTGTCA-1_1, TCAATTCGTTCTCGCT-1_1, GCTTTCGCACAGTGTT-1_1, AACCAACAGATAACAC-1_1, ATCGGCGCACATCATG-1_1, TCATCCGCACGAGGAT-1_1, CTGATCCTCTTTACAC-1_1, TCACACCCAACTTCTT-1_1, TGAGGGACACCGTACG-1_1, GTGCACGTCATCTGTT-1_1, GGTAATCAGTTGCATC-1_1, ATACTTCCAAGGTCTT-1_1, ...]

[6099 rows x 0 columns],
 'obs_names': Index(['TAGGTACCATGGCCAC-1_1', 'ATTCACTGTAACAGGC-1_1', 'TAACTTCCAACCACGC-1_1',
       'TTGGGTACACGACAAG-1_1', 'AGGCCACAGAGTCACG-1_1', 'CACTGAAGTCGAAGCA-1_1',
       'ACTGATGTCTGCACCT-1_1', 'TTACCGCCACTCAGAT-1_1', 'TTGGTTTTCCTAGCTC-1_1',
       'TGGGAAGTCAGTGATC-1_1',
       ...
       'AAGTCGTGTGTGAATA-1_1', 'GTCGTTCTCCAAGGGA-1_1', 'GTTTGGATCGGCCTTT-1_1',
       'GTACAGTCACGTATAC-1_1', 'TCATGCCCAAGAGGTC-1_1', 'CGCCATTGTTTGCCGG-1_1',
       'CACTGGGGTCTACGTA-1_1', 'CATACTTGTAGAGGAA-1_1', 'TTTGGTTTCCACGGAC-1_1',
       'ATGCATGAGTCATGAA-1_1'],
      dtype='object', length=6099),
 'obsm': AxisArrays with keys: ,
 'obsp': PairwiseArrays with keys: ,
 'raw': None,
 'shape': (6099, 34546),
 'uns': OverloadedDict, wrapping:
        OrderedDict()
With overloaded keys:
        ['neighbors'].,
 'var': Empty DataFrame
Columns: []
Index: [AL627309.1, AL627309.5, AL627309.4, AL669831.2, LINC01409, FAM87B, LINC01128, LINC00115, FAM41C, AL645608.6, AL645608.2, LINC02593, SAMD11, NOC2L, KLHL17, PLEKHN1, PERM1, AL645608.7, HES4, ISG15, AL645608.1, AGRN, C1orf159, AL390719.3, LINC01342, AL390719.2, TTLL10-AS1, TTLL10, TNFRSF18, TNFRSF4, SDF4, B3GALT6, C1QTNF12, UBE2J2, LINC01786, SCNN1D, ACAP3, PUSL1, INTS11, AL139287.1, CPTP, DVL1, MXRA8, AURKAIP1, CCNL2, MRPL20-AS1, MRPL20, AL391244.2, ANKRD65, AL391244.1, LINC01770, VWA1, ATAD3C, ATAD3B, ATAD3A, TMEM240, SSU72, AL645728.1, FNDC10, AL691432.4, AL691432.2, MIB2, MMP23B, CDK11B, FO704657.1, SLC35E2B, CDK11A, SLC35E2A, NADK, GNB1, AL109917.1, CALML6, TMEM52, CFAP74, GABRD, AL391845.1, PRKCZ, AL590822.2, PRKCZ-AS1, FAAP20, AL590822.1, SKI, AL590822.3, MORN1, AL589739.1, AL513477.2, RER1, PEX10, PLCH2, AL139246.4, PANK4, HES5, AL139246.5, TNFRSF14-AS1, TNFRSF14, AL139246.3, PRXL2B, MMEL1, TTC34, AC242022.2, ...]

[34546 rows x 0 columns],
 'var_names': Index(['AL627309.1', 'AL627309.5', 'AL627309.4', 'AL669831.2', 'LINC01409',
       'FAM87B', 'LINC01128', 'LINC00115', 'FAM41C', 'AL645608.6',
       ...
       'AC087190.2', 'AC136428.1', 'AC019183.1', 'AC105094.1', 'AC010485.1',
       'VN1R2', 'AL031676.1', 'SMIM34A', 'AL050402.1', 'AL445072.1'],
      dtype='object', length=34546),
 'varm': AxisArrays with keys: ,
 'varp': PairwiseArrays with keys: }
/tmp/ipykernel_16232/22660072.py:11: DeprecationWarning: Use is_view instead of isview, isview will be removed in the future.
  if name[0] != "_" and name not in disallowed_names and hasattr(obj, name)
/tmp/ipykernel_16232/22660072.py:9: DeprecationWarning: Use is_view instead of isview, isview will be removed in the future.
  name: getattr(obj, name)
[15]:
adata.obs
[15]:
TAGGTACCATGGCCAC-1_1
ATTCACTGTAACAGGC-1_1
TAACTTCCAACCACGC-1_1
TTGGGTACACGACAAG-1_1
AGGCCACAGAGTCACG-1_1
...
CGCCATTGTTTGCCGG-1_1
CACTGGGGTCTACGTA-1_1
CATACTTGTAGAGGAA-1_1
TTTGGTTTCCACGGAC-1_1
ATGCATGAGTCATGAA-1_1

6099 rows × 0 columns

Gene names are saved as adata.var.

[16]:
adata.var
[16]:
AL627309.1
AL627309.5
AL627309.4
AL669831.2
LINC01409
...
VN1R2
AL031676.1
SMIM34A
AL050402.1
AL445072.1

34546 rows × 0 columns

[17]:
adata.obs_names
[17]:
Index(['TAGGTACCATGGCCAC-1_1', 'ATTCACTGTAACAGGC-1_1', 'TAACTTCCAACCACGC-1_1',
       'TTGGGTACACGACAAG-1_1', 'AGGCCACAGAGTCACG-1_1', 'CACTGAAGTCGAAGCA-1_1',
       'ACTGATGTCTGCACCT-1_1', 'TTACCGCCACTCAGAT-1_1', 'TTGGTTTTCCTAGCTC-1_1',
       'TGGGAAGTCAGTGATC-1_1',
       ...
       'AAGTCGTGTGTGAATA-1_1', 'GTCGTTCTCCAAGGGA-1_1', 'GTTTGGATCGGCCTTT-1_1',
       'GTACAGTCACGTATAC-1_1', 'TCATGCCCAAGAGGTC-1_1', 'CGCCATTGTTTGCCGG-1_1',
       'CACTGGGGTCTACGTA-1_1', 'CATACTTGTAGAGGAA-1_1', 'TTTGGTTTCCACGGAC-1_1',
       'ATGCATGAGTCATGAA-1_1'],
      dtype='object', length=6099)
[18]:
adata.var_names
[18]:
Index(['AL627309.1', 'AL627309.5', 'AL627309.4', 'AL669831.2', 'LINC01409',
       'FAM87B', 'LINC01128', 'LINC00115', 'FAM41C', 'AL645608.6',
       ...
       'AC087190.2', 'AC136428.1', 'AC019183.1', 'AC105094.1', 'AC010485.1',
       'VN1R2', 'AL031676.1', 'SMIM34A', 'AL050402.1', 'AL445072.1'],
      dtype='object', length=34546)

There are no layers in this data set.

[19]:
adata.layers
[19]:
Layers with keys:

There are no multidimensional observations or variables.

[20]:
print(adata.obsm)
print(adata.varm)
print(adata.obsp)
print(adata.varp)
AxisArrays with keys:
AxisArrays with keys:
PairwiseArrays with keys:
PairwiseArrays with keys:

The data appears to contain reads mapped to 6099 cell-associated barcodes and 34546 RNA molecule-associated features.

Doublet removal§

Filter transcripts by minimum number of cells with non-zero counts§

We may choose to filter out transcript types that are detected in a relatively small number of cells. The optimum threshold is not known. Here we use 10 as a base case.

[21]:
adata
[21]:
AnnData object with n_obs × n_vars = 6099 × 34546

See the documentation for scanpy pre-processing filter-genes.

[22]:
sc.pp.filter_genes(adata, min_cells=10)
[23]:
adata
[23]:
AnnData object with n_obs × n_vars = 6099 × 19896
    var: 'n_cells'

Following filtration there are 19896 transcript types remaining that are present in at least 10 cells. This means 14650 transcript types were removed at the threshold of 10. Notice that an annotation named n_cells has been added to the genes to indicate the number of cells with non-zero values for that transcript type.

[24]:
adata.var
[24]:
n_cells
AL627309.5 33
LINC01409 274
LINC01128 81
LINC00115 15
SAMD11 16
... ...
MAFIP 34
AC011043.1 11
AL354822.1 133
AL592183.1 1003
AC240274.1 162

19896 rows × 1 columns

[25]:
adata.var.describe()
[25]:
n_cells
count 19896.000000
mean 293.307600
std 451.440237
min 10.000000
25% 34.000000
50% 111.000000
75% 345.000000
max 6090.000000
[26]:
sns.histplot(adata.var['n_cells'])
/usr/lib/python3.10/site-packages/matplotlib_inline/config.py:59: DeprecationWarning: InlineBackend._figure_formats_changed is deprecated in traitlets 4.1: use @observe and @unobserve instead.
  def _figure_formats_changed(self, name, old, new):
[26]:
<AxesSubplot:xlabel='n_cells', ylabel='Count'>
_images/02-analysis-review_50_2.svg

Select highly variable genes§

It is common to select transcript types with high variability among the cell population under the assumption that this will help to focus on features that distinguish the cells. Again there is no perfect threshold. Here we select the 2000 highest variability genes.

[27]:
adata
[27]:
AnnData object with n_obs × n_vars = 6099 × 19896
    var: 'n_cells'

See the documentation for scanpy pre-processing highly-variable-genes.

[28]:
sc.pp.highly_variable_genes(adata, n_top_genes=2000, subset=True, flavor="seurat_v3")
[29]:
adata
[29]:
AnnData object with n_obs × n_vars = 6099 × 2000
    var: 'n_cells', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm'
    uns: 'hvg'

We have filtered down to 2000 transcript types and added 5 annotations to the genes including a binary variable indicating membership in the highly-variable class, the ranking among highly variable genes, and the mean, variance, and normalized variance for each gene across cells.

[30]:
adata.var
[30]:
n_cells highly_variable highly_variable_rank means variances variances_norm
TTLL10 112 True 903.0 0.028857 0.069354 1.901045
TNFRSF18 15 True 1604.0 0.002951 0.004911 1.513808
CFAP74 159 True 1370.0 0.041154 0.087024 1.607399
TTC34 209 True 245.0 0.080341 0.363502 3.044086
AJAP1 31 True 1922.0 0.006886 0.011432 1.421002
... ... ... ... ... ... ...
MT-ND4L 650 True 1212.0 0.191671 0.559353 1.695637
MT-ND4 1328 True 178.0 0.833087 9.742224 3.486984
MT-ND5 886 True 443.0 0.332514 1.669344 2.430210
MT-ND6 821 True 123.0 0.383178 3.357087 3.985056
MT-CYB 1295 True 414.0 0.622397 4.384615 2.487058

2000 rows × 6 columns

Doublet removal§

Here we scvi-tools model as an interface to solo from [BFL+20]. scvi adds some unstructured data to interface with torch.

[31]:
adata.uns
[31]:
OverloadedDict, wrapping:
        OrderedDict([('hvg', {'flavor': 'seurat_v3'})])
With overloaded keys:
        ['neighbors'].
[32]:
scvi.model.SCVI.setup_anndata(adata)
[33]:
adata.uns
[33]:
OverloadedDict, wrapping:
        OrderedDict([('hvg', {'flavor': 'seurat_v3'}), ('_scvi_uuid', 'efb3a791-15ff-4864-b57c-7536e8ea9df6'), ('_scvi_manager_uuid', 'ccc84e47-0ab2-4ab3-852b-55f888509f9e')])
With overloaded keys:
        ['neighbors'].

Train the scvi model.

[34]:
vae = scvi.model.SCVI(adata)
vae.train()
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Epoch 400/400: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 400/400 [03:36<00:00,  1.98it/s, loss=323, v_num=1]
`Trainer.fit` stopped: `max_epochs=400` reached.
Epoch 400/400: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 400/400 [03:36<00:00,  1.85it/s, loss=323, v_num=1]

After training the model adata.obs now contains _scvi_batch and scvi_labels annotations.

[35]:
adata
[35]:
AnnData object with n_obs × n_vars = 6099 × 2000
    obs: '_scvi_batch', '_scvi_labels'
    var: 'n_cells', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm'
    uns: 'hvg', '_scvi_uuid', '_scvi_manager_uuid'
[36]:
adata.obs
[36]:
_scvi_batch _scvi_labels
TAGGTACCATGGCCAC-1_1 0 0
ATTCACTGTAACAGGC-1_1 0 0
TAACTTCCAACCACGC-1_1 0 0
TTGGGTACACGACAAG-1_1 0 0
AGGCCACAGAGTCACG-1_1 0 0
... ... ...
CGCCATTGTTTGCCGG-1_1 0 0
CACTGGGGTCTACGTA-1_1 0 0
CATACTTGTAGAGGAA-1_1 0 0
TTTGGTTTCCACGGAC-1_1 0 0
ATGCATGAGTCATGAA-1_1 0 0

6099 rows × 2 columns

Train the solo model.

[37]:
solo = scvi.external.SOLO.from_scvi_model(vae)
solo.train()
INFO     Creating doublets, preparing SOLO model.
/usr/lib/python3.10/site-packages/anndata/_core/anndata.py:1785: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.
  [AnnData(sparse.csr_matrix(a.shape), obs=a.obs) for a in all_adatas],
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Epoch 207/400:  52%|██████████████████████████████████████████████████▏                                              | 207/400 [01:51<01:44,  1.85it/s, loss=0.299, v_num=1]
Monitored metric validation_loss did not improve in the last 30 records. Best score: 0.295. Signaling Trainer to stop.
[38]:
type(solo)
[38]:
scvi.external.solo._model.SOLO

We can now use the solo model to add a prediction annotation for doublet or singlet to our data.

[39]:
df = solo.predict()
df["prediction"] = solo.predict(soft=False)

df.index = df.index.map(lambda x: x[:-2])

df
[39]:
doublet singlet prediction
TAGGTACCATGGCCAC-1_1 1.729540 -0.992126 doublet
ATTCACTGTAACAGGC-1_1 2.275840 -1.639090 doublet
TAACTTCCAACCACGC-1_1 0.735813 -0.259502 doublet
TTGGGTACACGACAAG-1_1 1.618703 -0.680344 doublet
AGGCCACAGAGTCACG-1_1 1.398426 -0.673206 doublet
... ... ... ...
CGCCATTGTTTGCCGG-1_1 -0.691126 1.362337 singlet
CACTGGGGTCTACGTA-1_1 -2.632991 2.468916 singlet
CATACTTGTAGAGGAA-1_1 -2.506424 2.567346 singlet
TTTGGTTTCCACGGAC-1_1 -3.308819 2.476630 singlet
ATGCATGAGTCATGAA-1_1 -2.357370 2.206591 singlet

6099 rows × 3 columns

Counting the predicted doublets and singlets reveals about 20% doublets.

[40]:
df.groupby("prediction").count()
[40]:
doublet singlet
prediction
doublet 1206 1206
singlet 4893 4893

We can assess the magnitude of the prediction by taking the difference between the doublet and singlet scores.

[41]:
df["dif"] = df.doublet - df.singlet
df
[41]:
doublet singlet prediction dif
TAGGTACCATGGCCAC-1_1 1.729540 -0.992126 doublet 2.721666
ATTCACTGTAACAGGC-1_1 2.275840 -1.639090 doublet 3.914930
TAACTTCCAACCACGC-1_1 0.735813 -0.259502 doublet 0.995315
TTGGGTACACGACAAG-1_1 1.618703 -0.680344 doublet 2.299047
AGGCCACAGAGTCACG-1_1 1.398426 -0.673206 doublet 2.071632
... ... ... ... ...
CGCCATTGTTTGCCGG-1_1 -0.691126 1.362337 singlet -2.053463
CACTGGGGTCTACGTA-1_1 -2.632991 2.468916 singlet -5.101907
CATACTTGTAGAGGAA-1_1 -2.506424 2.567346 singlet -5.073770
TTTGGTTTCCACGGAC-1_1 -3.308819 2.476630 singlet -5.785450
ATGCATGAGTCATGAA-1_1 -2.357370 2.206591 singlet -4.563961

6099 rows × 4 columns

Plotting the distribution of doublet-singlet score differences we see there are a large fraction that marginally favor doublet to singlet.

[42]:
sns.histplot(df[df.prediction == "doublet"], x="dif")
[42]:
<AxesSubplot:xlabel='dif', ylabel='Count'>
_images/02-analysis-review_79_1.svg

Since we would like to avoid unnecessarily discarding barcodes, we will arbitrarily retain those with a score from zero to one (however this is not intended to be principled).

[43]:
doublets = df[(df.prediction == 'doublet') & (df.dif > 1)]
doublets
[43]:
doublet singlet prediction dif
TAGGTACCATGGCCAC-1_1 1.729540 -0.992126 doublet 2.721666
ATTCACTGTAACAGGC-1_1 2.275840 -1.639090 doublet 3.914930
TTGGGTACACGACAAG-1_1 1.618703 -0.680344 doublet 2.299047
AGGCCACAGAGTCACG-1_1 1.398426 -0.673206 doublet 2.071632
CACTGAAGTCGAAGCA-1_1 1.305717 -0.544905 doublet 1.850623
... ... ... ... ...
CAATACGCAATGTGGG-1_1 0.471522 -0.923330 doublet 1.394853
GGGTATTTCAGCGCAC-1_1 1.101736 -0.658175 doublet 1.759911
TTGCTGCAGTGCGACA-1_1 0.720216 -0.624341 doublet 1.344557
CATCCCAAGACGCCAA-1_1 0.937599 -0.486121 doublet 1.423720
CACTGTCAGGGACAGG-1_1 1.190305 -0.678417 doublet 1.868721

451 rows × 4 columns

[44]:
sns.histplot(doublets[doublets.prediction == "doublet"], x="dif")
[44]:
<AxesSubplot:xlabel='dif', ylabel='Count'>
_images/02-analysis-review_82_1.svg

Save/load checkpoint§

Save current variables to file.

[45]:
pickle.dump([df, doublets, solo, vae], open(f"{output_directory}/auxiliary.p", "wb"))

Variables can be reloaded if necessary.

[46]:
df, doublets, solo, vae = pickle.load(open(f"{output_directory}/auxiliary.p", "rb"))
[47]:
adata.write(f"{output_directory}/adata_doublets.h5ad", compression="gzip")
[48]:
bdata_doublets = None
bdata_doublets = sc.read_h5ad(f"{output_directory}/adata_doublets.h5ad")
bdata_doublets
[48]:
AnnData object with n_obs × n_vars = 6099 × 2000
    obs: '_scvi_batch', '_scvi_labels'
    var: 'n_cells', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm'
    uns: '_scvi_manager_uuid', '_scvi_uuid', 'hvg'

Reload data and filter doublets§

[49]:
adata = sc.read_csv(
    "data/GSE171524/supplementary/GSM5226574_C51ctr_raw_counts.csv.gz"
).T
adata
[49]:
AnnData object with n_obs × n_vars = 6099 × 34546

Annotate adata with doublet column.

[50]:
adata.obs['doublet'] = adata.obs.index.isin(doublets.index)
[51]:
adata
[51]:
AnnData object with n_obs × n_vars = 6099 × 34546
    obs: 'doublet'
[52]:
adata.obs
[52]:
doublet
TAGGTACCATGGCCAC-1_1 True
ATTCACTGTAACAGGC-1_1 True
TAACTTCCAACCACGC-1_1 False
TTGGGTACACGACAAG-1_1 True
AGGCCACAGAGTCACG-1_1 True
... ...
CGCCATTGTTTGCCGG-1_1 False
CACTGGGGTCTACGTA-1_1 False
CATACTTGTAGAGGAA-1_1 False
TTTGGTTTCCACGGAC-1_1 False
ATGCATGAGTCATGAA-1_1 False

6099 rows × 1 columns

Use doublet column to filter doublets.

[53]:
adata = adata[~adata.obs.doublet]
[54]:
adata
[54]:
View of AnnData object with n_obs × n_vars = 5648 × 34546
    obs: 'doublet'

We filtered about 7% of the barcodes as putative doublets.

Preprocessing§

Filter mitochondrial transcripts§

We can add a binary annotation to indicate presence of mitochondrial transcripts.

[55]:
adata.var['mt'] = adata.var.index.str.startswith('MT-')
/tmp/ipykernel_16232/310233584.py:1: ImplicitModificationWarning: Trying to modify attribute `.var` of view, initializing view as actual.
  adata.var['mt'] = adata.var.index.str.startswith('MT-')
[56]:
adata.var
[56]:
mt
AL627309.1 False
AL627309.5 False
AL627309.4 False
AL669831.2 False
LINC01409 False
... ...
VN1R2 False
AL031676.1 False
SMIM34A False
AL050402.1 False
AL445072.1 False

34546 rows × 1 columns

Filter ribosomal transcripts§

We can download a list of ribosomal genes from msigdb.

[57]:
ribo_url = "http://software.broadinstitute.org/gsea/msigdb/download_geneset.jsp?geneSetName=KEGG_RIBOSOME&fileType=txt"
ribo_genes = pd.read_table(ribo_url, skiprows=2, header = None)
ribo_genes
[57]:
0
0 FAU
1 MRPL13
2 RPL10
3 RPL10A
4 RPL10L
... ...
83 RPS9
84 RPSA
85 RSL24D1
86 RSL24D1P11
87 UBA52

88 rows × 1 columns

We add a binary annotation to indicate presence in the list of ribosomal genes.

[58]:
adata.var['ribo'] = adata.var_names.isin(ribo_genes[0].values)
[59]:
adata.var
[59]:
mt ribo
AL627309.1 False False
AL627309.5 False False
AL627309.4 False False
AL669831.2 False False
LINC01409 False False
... ... ...
VN1R2 False False
AL031676.1 False False
SMIM34A False False
AL050402.1 False False
AL445072.1 False False

34546 rows × 2 columns

Calculate QC metrics§

Recall the observation/barcode annotation currently just indicates doublets based on our solo model.

[60]:
adata.obs
[60]:
doublet
TAACTTCCAACCACGC-1_1 False
TTACCGCCACTCAGAT-1_1 False
TTGGTTTTCCTAGCTC-1_1 False
TCTATCACAAGGCTTT-1_1 False
ATCCACCAGAGGTATT-1_1 False
... ...
CGCCATTGTTTGCCGG-1_1 False
CACTGGGGTCTACGTA-1_1 False
CATACTTGTAGAGGAA-1_1 False
TTTGGTTTCCACGGAC-1_1 False
ATGCATGAGTCATGAA-1_1 False

5648 rows × 1 columns

We can calculate standard quality control metrics with scanpy. See the documentation for scanpy calculate_qc_metrics. These include the number of

[61]:
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt', 'ribo'], percent_top=None, log1p=False, inplace=True)

Sorting genes by the number of cells with non-zero counts, we see a number of genes that were not found in association with any barcode.

[62]:
adata.var.sort_values("n_cells_by_counts")
[62]:
mt ribo n_cells_by_counts mean_counts pct_dropout_by_counts total_counts
AL445072.1 False False 0 0.000000 100.000000 0.0
UGT1A5 False False 0 0.000000 100.000000 0.0
AGXT False False 0 0.000000 100.000000 0.0
TEX44 False False 0 0.000000 100.000000 0.0
AC093585.1 False False 0 0.000000 100.000000 0.0
... ... ... ... ... ... ...
AKAP13 False False 4059 2.881374 28.133853 16274.0
NEAT1 False False 4148 5.001062 26.558074 28246.0
MBNL1 False False 4150 2.695822 26.522663 15226.0
ZBTB20 False False 4281 2.425992 24.203258 13702.0
MALAT1 False False 5639 64.417313 0.159348 363829.0

34546 rows × 6 columns

Sorting barcodes by total counts we see a minimum around 400 suggesting a previously applied filter upstream of this pre-processing.

[63]:
adata.obs.sort_values("total_counts")
[63]:
doublet n_genes_by_counts total_counts total_counts_mt pct_counts_mt total_counts_ribo pct_counts_ribo
AGGCATTCATCCGTTC-1_1 False 290 401.0 1.0 0.249377 0.0 0.000000
TGGTACAGTTGGTGTT-1_1 False 323 401.0 0.0 0.000000 0.0 0.000000
CAGGGCTTCATGCGGC-1_1 False 330 401.0 7.0 1.745636 1.0 0.249377
GTCGTTCTCCAAGGGA-1_1 False 300 401.0 0.0 0.000000 0.0 0.000000
CGAGAAGGTGAACTAA-1_1 False 308 401.0 0.0 0.000000 0.0 0.000000
... ... ... ... ... ... ... ...
TCTATCACAAGGCTTT-1_1 False 3582 8276.0 183.0 2.211213 5.0 0.060416
ATCCACCAGAGGTATT-1_1 False 3913 8286.0 191.0 2.305093 34.0 0.410331
TTACCGCCACTCAGAT-1_1 False 4144 11369.0 39.0 0.343038 5.0 0.043979
TTGGTTTTCCTAGCTC-1_1 False 3902 11472.0 463.0 4.035913 13.0 0.113319
TAACTTCCAACCACGC-1_1 False 5158 15645.0 221.0 1.412592 211.0 1.348674

5648 rows × 7 columns

[64]:
sns.jointplot(
    data=adata.obs,
    x="total_counts",
    y="n_genes_by_counts",
    kind="hex",
)
[64]:
<seaborn.axisgrid.JointGrid at 0x7fedd44cbbb0>
_images/02-analysis-review_120_1.svg

Filter genes by counts§

We can filter genes found in some minimum number of cells, in this case we arbitrarily choose three.

[65]:
sc.pp.filter_genes(adata, min_cells=3)
[66]:
adata.var.sort_values("n_cells_by_counts")
[66]:
mt ribo n_cells_by_counts mean_counts pct_dropout_by_counts total_counts n_cells
GSG1 False False 3 0.000531 99.946884 3.0 3
TRAJ47 False False 3 0.000531 99.946884 3.0 3
AC002465.2 False False 3 0.000531 99.946884 3.0 3
LINC02709 False False 3 0.000531 99.946884 3.0 3
AC096536.1 False False 3 0.000531 99.946884 3.0 3
... ... ... ... ... ... ... ...
AKAP13 False False 4059 2.881374 28.133853 16274.0 4059
NEAT1 False False 4148 5.001062 26.558074 28246.0 4148
MBNL1 False False 4150 2.695822 26.522663 15226.0 4150
ZBTB20 False False 4281 2.425992 24.203258 13702.0 4281
MALAT1 False False 5639 64.417313 0.159348 363829.0 5639

24136 rows × 7 columns

Filter cells by number of genes§

As a matter of completeness we filter cells with fewer than 200 genes; however, we see the minimum number of genes in a cell is already greater than 200.

[67]:
adata.obs.sort_values('n_genes_by_counts')
[67]:
doublet n_genes_by_counts total_counts total_counts_mt pct_counts_mt total_counts_ribo pct_counts_ribo
TAGGGTTTCTGGCTGG-1_1 False 276 419.0 1.0 0.238663 1.0 0.238663
CGTGCTTCAAAGGGCT-1_1 False 277 432.0 41.0 9.490741 0.0 0.000000
TGACAGTTCTAAACGC-1_1 False 278 414.0 0.0 0.000000 0.0 0.000000
CTCTGGTCACGACGAA-1_1 False 285 407.0 0.0 0.000000 0.0 0.000000
GTAAGTCGTATCGCGC-1_1 False 289 430.0 0.0 0.000000 0.0 0.000000
... ... ... ... ... ... ... ...
TAACGACAGATGACCG-1_1 False 3791 8187.0 94.0 1.148162 4.0 0.048858
TTGGTTTTCCTAGCTC-1_1 False 3902 11472.0 463.0 4.035913 13.0 0.113319
ATCCACCAGAGGTATT-1_1 False 3913 8286.0 191.0 2.305093 34.0 0.410331
TTACCGCCACTCAGAT-1_1 False 4144 11369.0 39.0 0.343038 5.0 0.043979
TAACTTCCAACCACGC-1_1 False 5158 15645.0 221.0 1.412592 211.0 1.348674

5648 rows × 7 columns

[68]:
sc.pp.filter_cells(adata, min_genes=200)
[69]:
adata
[69]:
AnnData object with n_obs × n_vars = 5648 × 24136
    obs: 'doublet', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'n_genes'
    var: 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells'

Review and filter QC outliers§

We can plot distributions of the number of genes by counts and total counts as well as the percentage of mitochondrial and ribosomal genes.

[70]:
sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt', 'pct_counts_ribo'],
             jitter=0.4, multi_panel=True)
_images/02-analysis-review_132_0.svg

If we choose to filter cells beyond the 98th percentile, we can compute the bound via the numpy quantile function.

[71]:
upper_lim = np.quantile(adata.obs.n_genes_by_counts.values, .98)
upper_lim
[71]:
2310.0599999999995
[72]:
adata = adata[adata.obs.n_genes_by_counts < upper_lim]

This filters a few more than 100 additional cells.

[73]:
adata.obs
[73]:
doublet n_genes_by_counts total_counts total_counts_mt pct_counts_mt total_counts_ribo pct_counts_ribo n_genes
CCTCAACCATACAGGG-1_1 False 2276 5434.0 38.0 0.699301 0.0 0.000000 2275
ACAAAGATCCACAGGC-1_1 False 2310 5504.0 1.0 0.018169 1.0 0.018169 2308
ATACTTCCAAGGTCTT-1_1 False 2122 5252.0 1.0 0.019040 4.0 0.076161 2120
GGGTCACTCTATTCGT-1_1 False 2286 5056.0 0.0 0.000000 1.0 0.019778 2280
GATGATCCACAACCGC-1_1 False 2263 4938.0 2.0 0.040502 2.0 0.040502 2262
... ... ... ... ... ... ... ... ...
CGCCATTGTTTGCCGG-1_1 False 355 410.0 3.0 0.731707 0.0 0.000000 355
CACTGGGGTCTACGTA-1_1 False 346 403.0 0.0 0.000000 0.0 0.000000 346
CATACTTGTAGAGGAA-1_1 False 360 410.0 2.0 0.487805 0.0 0.000000 360
TTTGGTTTCCACGGAC-1_1 False 299 405.0 0.0 0.000000 2.0 0.493827 299
ATGCATGAGTCATGAA-1_1 False 351 411.0 0.0 0.000000 0.0 0.000000 350

5535 rows × 8 columns

Finally, we filter mitochondrial (at 20%) and ribosomal (at 2%) outliers.

[74]:
adata = adata[adata.obs.pct_counts_mt < 20]
[75]:
adata = adata[adata.obs.pct_counts_ribo < 2]
[76]:
adata
[76]:
View of AnnData object with n_obs × n_vars = 5518 × 24136
    obs: 'doublet', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'n_genes'
    var: 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells'

We end our preprocessing with 5518 cells and 24136 genes.

Save/load checkpoint§

Save current version of post processed data to file.

[77]:
adata.write(f"{output_directory}/adata_post_processed.h5ad", compression="gzip")

Variables can be reloaded if necessary.

[78]:
df, doublets, solo, vae = pickle.load(open(f"{output_directory}/auxiliary.p", "rb"))
[79]:
bdata_post_processed = None
bdata_post_processed = sc.read_h5ad(f"{output_directory}/adata_post_processed.h5ad")
bdata_post_processed
[79]:
AnnData object with n_obs × n_vars = 5518 × 24136
    obs: 'doublet', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'n_genes'
    var: 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells'

Normalization§

[80]:
adata.X.sum(axis=1)
[80]:
ArrayView([5433., 5502., 5250., ...,  410.,  405.,  410.], dtype=float32)
[81]:
sc.pp.normalize_total(adata, target_sum=1e4)
/usr/lib/python3.10/site-packages/scanpy/preprocessing/_normalization.py:170: UserWarning: Received a view of an AnnData. Making a copy.
  view_to_actual(adata)
[82]:
adata.X.sum(axis=1)
[82]:
array([10000., 10000., 10000., ..., 10000., 10000., 10000.], dtype=float32)
[83]:
sc.pp.log1p(adata)
[84]:
adata.X.sum(axis=1)
[84]:
array([3285.3794, 3163.8867, 2858.1602, ..., 1191.2706, 1022.7512,
       1162.8784], dtype=float32)
[85]:
adata.raw = adata

Clustering§

Estimate expression variability§

[86]:
adata.var
[86]:
mt ribo n_cells_by_counts mean_counts pct_dropout_by_counts total_counts n_cells
AL627309.1 False False 4 0.000708 99.929178 4.0 4
AL627309.5 False False 26 0.004603 99.539660 26.0 26
AL627309.4 False False 4 0.000708 99.929178 4.0 4
LINC01409 False False 233 0.044972 95.874646 254.0 233
FAM87B False False 6 0.001062 99.893768 6.0 6
... ... ... ... ... ... ... ...
AL354822.1 False False 121 0.022132 97.857649 125.0 121
AL592183.1 False False 871 0.182188 84.578612 1029.0 871
AC240274.1 False False 142 0.026027 97.485836 147.0 142
AC007325.4 False False 4 0.000708 99.929178 4.0 4
AC007325.2 False False 6 0.001062 99.893768 6.0 6

24136 rows × 7 columns

Here we use the scanpy pre-processing function for highly-variable genes to annotate genes by their expression variability across cells.

[87]:
sc.pp.highly_variable_genes(adata, n_top_genes = 2000)
[88]:
adata.var
[88]:
mt ribo n_cells_by_counts mean_counts pct_dropout_by_counts total_counts n_cells highly_variable means dispersions dispersions_norm
AL627309.1 False False 4 0.000708 99.929178 4.0 4 True 0.009588 2.873559 1.581447
AL627309.5 False False 26 0.004603 99.539660 26.0 26 False 0.029244 2.300264 -0.151067
AL627309.4 False False 4 0.000708 99.929178 4.0 4 False 0.002517 1.702826 -1.956543
LINC01409 False False 233 0.044972 95.874646 254.0 233 False 0.265776 2.334326 -0.048131
FAM87B False False 6 0.001062 99.893768 6.0 6 False 0.007217 1.947624 -1.216757
... ... ... ... ... ... ... ... ... ... ... ...
AL354822.1 False False 121 0.022132 97.857649 125.0 121 False 0.158275 2.407786 0.173866
AL592183.1 False False 871 0.182188 84.578612 1029.0 871 False 0.868576 2.511827 -0.388573
AC240274.1 False False 142 0.026027 97.485836 147.0 142 False 0.153063 2.300034 -0.151763
AC007325.4 False False 4 0.000708 99.929178 4.0 4 False 0.003025 1.579867 -2.328129
AC007325.2 False False 6 0.001062 99.893768 6.0 6 False 0.006377 1.995220 -1.072921

24136 rows × 11 columns

[89]:
sc.pl.highly_variable_genes(adata)
_images/02-analysis-review_162_0.svg

Filter highly variable genes§

We finally filter for the highly variable genes noting the change in n_vars.

[90]:
adata
[90]:
AnnData object with n_obs × n_vars = 5518 × 24136
    obs: 'doublet', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'n_genes'
    var: 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'log1p', 'hvg'
[91]:
adata = adata[:, adata.var.highly_variable]
[92]:
adata
[92]:
View of AnnData object with n_obs × n_vars = 5518 × 2000
    obs: 'doublet', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'n_genes'
    var: 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'log1p', 'hvg'

Principle components analysis§

Regression and rescaling§

We would like to avoid interpreting differences that derive from differences in total counts, percentage of mitochondrial counts, and percentage of ribosomal counts. We can reduce the impact of these differences using the `scanpy regress out function <https://scanpy.readthedocs.io/en/latest/generated/scanpy.pp.regress_out.html>`__.

[93]:
sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt', 'pct_counts_ribo'])

We also want to normalize the data scaling it to zero mean and unit variance while clipping any remaining outliers above 10.

[94]:
sc.pp.scale(adata, max_value=10)

Perform PCA§

PCA can be performed with the `scanpy PCA tool <https://scanpy.readthedocs.io/en/latest/generated/scanpy.tl.pca.html>`__.

[95]:
sc.tl.pca(adata, svd_solver='arpack')
[96]:
sc.pl.pca_variance_ratio(adata, log=True, n_pcs = 50)
_images/02-analysis-review_177_0.svg

Clustering§

The UMAP algorithm contains a method for estimating distances between cells to which an interface is provided by scanpy neighbors.

[97]:
sc.pp.neighbors(adata, n_pcs = 30)

This function adds distances and connectivities as pairwise relationships among the cells in adata.obsp.

[98]:
adata
[98]:
AnnData object with n_obs × n_vars = 5518 × 2000
    obs: 'doublet', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'n_genes'
    var: 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'log1p', 'hvg', 'pca', 'neighbors'
    obsm: 'X_pca'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'
[99]:
adata.obsp
[99]:
PairwiseArrays with keys: distances, connectivities

We can now plot the UMAP embedding, but note there are no defined clusters of cells even though they may be hypothesized visually.

[100]:
sc.tl.umap(adata)
[101]:
sc.pl.umap(adata)
_images/02-analysis-review_186_0.svg

The `scanpy interface to the Leiden algorithm <https://scanpy.readthedocs.io/en/latest/generated/scanpy.tl.leiden.html>`__ can be used to cluster cells into subgroups.

[102]:
sc.tl.leiden(adata, resolution = 0.5)
[103]:
adata.obs
[103]:
doublet n_genes_by_counts total_counts total_counts_mt pct_counts_mt total_counts_ribo pct_counts_ribo n_genes leiden
CCTCAACCATACAGGG-1_1 False 2276 5434.0 38.0 0.699301 0.0 0.000000 2275 11
ACAAAGATCCACAGGC-1_1 False 2310 5504.0 1.0 0.018169 1.0 0.018169 2308 2
ATACTTCCAAGGTCTT-1_1 False 2122 5252.0 1.0 0.019040 4.0 0.076161 2120 2
GGGTCACTCTATTCGT-1_1 False 2286 5056.0 0.0 0.000000 1.0 0.019778 2280 2
GATGATCCACAACCGC-1_1 False 2263 4938.0 2.0 0.040502 2.0 0.040502 2262 2
... ... ... ... ... ... ... ... ... ...
CGCCATTGTTTGCCGG-1_1 False 355 410.0 3.0 0.731707 0.0 0.000000 355 7
CACTGGGGTCTACGTA-1_1 False 346 403.0 0.0 0.000000 0.0 0.000000 346 1
CATACTTGTAGAGGAA-1_1 False 360 410.0 2.0 0.487805 0.0 0.000000 360 0
TTTGGTTTCCACGGAC-1_1 False 299 405.0 0.0 0.000000 2.0 0.493827 299 5
ATGCATGAGTCATGAA-1_1 False 351 411.0 0.0 0.000000 0.0 0.000000 350 1

5518 rows × 9 columns

We can now color the UMAP embedding by the Louvain clusters.

[104]:
sc.pl.umap(adata, color=['leiden'])
_images/02-analysis-review_191_0.svg

Save/load checkpoint§

Save current data with cluster information to file.

[105]:
adata.write(f"{output_directory}/adata_clustered.h5ad", compression="gzip")
[106]:
adata
[106]:
AnnData object with n_obs × n_vars = 5518 × 2000
    obs: 'doublet', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'n_genes', 'leiden'
    var: 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'log1p', 'hvg', 'pca', 'neighbors', 'umap', 'leiden', 'leiden_colors'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'

Variables can be reloaded if necessary.

[107]:
bdata_clustered = None
bdata_clustered = sc.read_h5ad(f"{output_directory}/adata_clustered.h5ad")
bdata_clustered
[107]:
AnnData object with n_obs × n_vars = 5518 × 2000
    obs: 'doublet', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'n_genes', 'leiden'
    var: 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'hvg', 'leiden', 'leiden_colors', 'log1p', 'neighbors', 'pca', 'umap'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'connectivities', 'distances'

Sample integration§

Batch processing§

See the command line program scpreprocess. It can be installed with

$ pip install -e .

and run, for example

$ scpreprocess  --input-directory="data/GSE171524/supplementary/"

Where the contents of input-directory was, in this case, created by running download_geo_data.sh on GSE171524. In this case, scpreprocess requires about 30GB of RAM and about 30 - 60 minutes to complete with 1 NVIDIA Tesla T4 GPU.

Filter genes§

[12]:
adata = None
adata = sc.read_h5ad(f"{output_directory}/adata_combined.h5ad")
adata
[12]:
AnnData object with n_obs × n_vars = 98878 × 34546
    obs: 'Sample', 'doublet', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo'
[13]:
sc.pp.filter_genes(adata, min_cells = 10)
[14]:
adata.X = csr_matrix(adata.X)
[15]:
adata.write(f"{output_directory}/adata_combined_filtered.h5ad", compression="gzip")
[16]:
adata = None
adata = sc.read_h5ad(f"{output_directory}/adata_combined_filtered.h5ad")
adata
[16]:
AnnData object with n_obs × n_vars = 98878 × 29029
    obs: 'Sample', 'doublet', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo'
    var: 'n_cells'
[17]:
adata.obs.groupby('Sample').count()
[17]:
doublet n_genes n_genes_by_counts total_counts total_counts_mt pct_counts_mt total_counts_ribo pct_counts_ribo
Sample
C51ctr 5501 5501 5501 5501 5501 5501 5501 5501
C52ctr 4055 4055 4055 4055 4055 4055 4055 4055
C53ctr 6241 6241 6241 6241 6241 6241 6241 6241
C54ctr 3961 3961 3961 3961 3961 3961 3961 3961
C55ctr 5088 5088 5088 5088 5088 5088 5088 5088
C56ctr 3613 3613 3613 3613 3613 3613 3613 3613
C57ctr 4242 4242 4242 4242 4242 4242 4242 4242
L01cov 2645 2645 2645 2645 2645 2645 2645 2645
L03cov 3583 3583 3583 3583 3583 3583 3583 3583
L04cov 3062 3062 3062 3062 3062 3062 3062 3062
L04covaddon 3933 3933 3933 3933 3933 3933 3933 3933
L05cov 2382 2382 2382 2382 2382 2382 2382 2382
L06cov 5737 5737 5737 5737 5737 5737 5737 5737
L07cov 4289 4289 4289 4289 4289 4289 4289 4289
L08cov 3474 3474 3474 3474 3474 3474 3474 3474
L09cov 3117 3117 3117 3117 3117 3117 3117 3117
L10cov 1357 1357 1357 1357 1357 1357 1357 1357
L11cov 2682 2682 2682 2682 2682 2682 2682 2682
L12cov 3298 3298 3298 3298 3298 3298 3298 3298
L13cov 4293 4293 4293 4293 4293 4293 4293 4293
L15cov 3635 3635 3635 3635 3635 3635 3635 3635
L16cov 1593 1593 1593 1593 1593 1593 1593 1593
L17cov 3952 3952 3952 3952 3952 3952 3952 3952
L18cov 2440 2440 2440 2440 2440 2440 2440 2440
L19cov 2162 2162 2162 2162 2162 2162 2162 2162
L21cov 2917 2917 2917 2917 2917 2917 2917 2917
L22cov 5626 5626 5626 5626 5626 5626 5626 5626
[18]:
sc.pp.filter_genes(adata, min_cells = 100)
[19]:
adata
[19]:
AnnData object with n_obs × n_vars = 98878 × 20625
    obs: 'Sample', 'doublet', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo'
    var: 'n_cells'
[20]:
adata.layers['counts'] = adata.X.copy()
[21]:
adata
[21]:
AnnData object with n_obs × n_vars = 98878 × 20625
    obs: 'Sample', 'doublet', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo'
    var: 'n_cells'
    layers: 'counts'

Normalize§

[22]:
sc.pp.normalize_total(adata, target_sum = 1e4)
sc.pp.log1p(adata)
adata.raw = adata
[23]:
adata.obs.head()
[23]:
Sample doublet n_genes n_genes_by_counts total_counts total_counts_mt pct_counts_mt total_counts_ribo pct_counts_ribo
CTGGCAGCACTCCTGT-1_21 L15cov False 1889 1889 4812.0 0.0 0.0 9.0 0.187032
TCCCACAAGCGACTTT-1_21 L15cov False 2125 2125 4318.0 0.0 0.0 0.0 0.000000
CCTCCAAAGACCAACG-1_21 L15cov False 2099 2099 4279.0 0.0 0.0 5.0 0.116850
GGTTGTAAGGTTACCT-1_21 L15cov False 2019 2019 4194.0 0.0 0.0 1.0 0.023844
GGGATGACAGTATTCG-1_21 L15cov False 2077 2077 4198.0 0.0 0.0 5.0 0.119104
[24]:
scvi.model.SCVI.setup_anndata(adata, layer = "counts",
                             categorical_covariate_keys=["Sample"], # batch, technology
                             continuous_covariate_keys=['pct_counts_mt', 'total_counts', 'pct_counts_ribo'])
[25]:
model = scvi.model.SCVI(adata)
[26]:
model.train()
/usr/lib/python3.10/site-packages/scvi/model/base/_training_mixin.py:67: UserWarning: max_epochs=81 is less than n_epochs_kl_warmup=400. The max_kl_weight will not be reached during training.
  warnings.warn(
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Epoch 81/81: 100% 81/81 [17:06<00:00, 12.42s/it, loss=2.76e+03, v_num=1]
`Trainer.fit` stopped: `max_epochs=81` reached.
Epoch 81/81: 100% 81/81 [17:06<00:00, 12.68s/it, loss=2.76e+03, v_num=1]
[28]:
# pickle.dump([model], open(f"{output_directory}/model_combined.p", "wb"))
model.save(f"{output_directory}/model_combined/")
# model = scvi.model.SCVI.load(f"{output_directory}/model_combined/")
[29]:
adata.obsm['X_scVI'] = model.get_latent_representation()
[30]:
adata.layers['scvi_normalized'] = model.get_normalized_expression(library_size = 1e4)

Cluster§

[31]:
sc.pp.neighbors(adata, use_rep = 'X_scVI')
[33]:
sc.tl.umap(adata)
sc.tl.leiden(adata, resolution = 0.5)
[36]:
sc.pl.umap(adata, color = ['leiden'], frameon = True)
_images/02-analysis-review_224_0.svg
[37]:
sc.pl.umap(adata, color = ['Sample'], frameon = True)
_images/02-analysis-review_225_0.svg

Save/load checkpoint§

[38]:
adata.write(f"{output_directory}/adata_integrated.h5ad", compression="gzip")
[16]:
adata = None
adata = sc.read_h5ad(f"{output_directory}/adata_integrated.h5ad")
adata
[16]:
AnnData object with n_obs × n_vars = 98878 × 29029
    obs: 'Sample', 'doublet', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo'
    var: 'n_cells'

Marker identification§

Increase cluster resolution from 0.5 to default value of 1.

[39]:
sc.tl.leiden(adata, resolution = 1)
[40]:
sc.tl.rank_genes_groups(adata, 'leiden')
WARNING: Default of the method has been changed to 't-test' from 't-test_overestim_var'
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:394: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'names'] = self.var_names[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:396: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'scores'] = scores[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:399: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'pvals'] = pvals[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:409: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:420: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'logfoldchanges'] = np.log2(
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:394: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'names'] = self.var_names[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:396: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'scores'] = scores[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:399: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'pvals'] = pvals[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:409: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:420: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'logfoldchanges'] = np.log2(
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:394: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'names'] = self.var_names[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:396: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'scores'] = scores[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:399: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'pvals'] = pvals[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:409: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:420: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'logfoldchanges'] = np.log2(
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:394: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'names'] = self.var_names[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:396: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'scores'] = scores[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:399: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'pvals'] = pvals[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:409: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:420: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'logfoldchanges'] = np.log2(
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:394: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'names'] = self.var_names[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:396: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'scores'] = scores[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:399: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'pvals'] = pvals[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:409: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:420: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'logfoldchanges'] = np.log2(
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:394: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'names'] = self.var_names[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:396: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'scores'] = scores[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:399: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'pvals'] = pvals[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:409: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:420: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'logfoldchanges'] = np.log2(
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:394: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'names'] = self.var_names[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:396: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'scores'] = scores[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:399: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'pvals'] = pvals[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:409: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:420: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'logfoldchanges'] = np.log2(
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:394: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'names'] = self.var_names[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:396: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'scores'] = scores[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:399: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'pvals'] = pvals[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:409: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:420: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'logfoldchanges'] = np.log2(
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:394: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'names'] = self.var_names[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:396: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'scores'] = scores[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:399: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'pvals'] = pvals[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:409: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:420: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'logfoldchanges'] = np.log2(

Plot genes that distinguish each cluster.

[42]:
sc.pl.rank_genes_groups(adata, n_genes=20, sharey=False)
_images/02-analysis-review_234_0.svg

Filter markers by adjusted p-value and log fold change.

[41]:
markers = sc.get.rank_genes_groups_df(adata, None)
markers = markers[(markers.pvals_adj < 0.05) & (markers.logfoldchanges > .5)]
markers
[41]:
group names scores logfoldchanges pvals pvals_adj
0 0 CTSB 157.684479 4.972445 0.000000 0.000000
1 0 LRMDA 137.313080 2.813497 0.000000 0.000000
2 0 PLXDC2 135.515442 3.123800 0.000000 0.000000
3 0 ZEB2 133.443405 3.335379 0.000000 0.000000
4 0 FMN1 133.045715 3.743495 0.000000 0.000000
... ... ... ... ... ... ...
578222 28 CCL3 2.141156 4.313347 0.037839 0.048973
578223 28 VKORC1 2.140682 3.905211 0.037879 0.049022
578224 28 NDUFA2 2.137703 4.700588 0.038135 0.049349
578225 28 ZMPSTE24 2.136283 2.764338 0.038256 0.049503
578226 28 LRPAP1 2.135819 2.313233 0.038296 0.049551

84872 rows × 6 columns

Estimate differential expression for leiden clusters.

[43]:
markers_scvi = model.differential_expression(groupby = 'leiden')
markers_scvi
DE...: 100% 29/29 [04:38<00:00,  9.61s/it]
[43]:
proba_de proba_not_de bayes_factor scale1 scale2 pseudocounts delta lfc_mean lfc_median lfc_std ... raw_mean1 raw_mean2 non_zeros_proportion1 non_zeros_proportion2 raw_normalized_mean1 raw_normalized_mean2 is_de_fdr_0.05 comparison group1 group2
SPP1 0.9898 0.0102 4.575114 0.000638 0.000048 0.0 0.25 6.495259 6.727366 3.440886 ... 1.076499 0.040538 0.181072 0.009292 6.256733 0.294361 True 0 vs Rest 0 Rest
CCL18 0.9898 0.0102 4.575114 0.000326 0.000043 0.0 0.25 6.902542 7.447420 4.287405 ... 0.602506 0.042989 0.168400 0.017725 3.187437 0.270629 True 0 vs Rest 0 Rest
APOC1 0.9896 0.0104 4.555494 0.000148 0.000016 0.0 0.25 6.963109 7.430366 3.897622 ... 0.276810 0.019038 0.135315 0.012533 1.356255 0.119619 True 0 vs Rest 0 Rest
C1QA 0.9894 0.0106 4.536244 0.000333 0.000045 0.0 0.25 7.422565 7.997816 4.413615 ... 0.539626 0.045370 0.252483 0.025287 3.077946 0.308842 True 0 vs Rest 0 Rest
TSPAN5 0.9890 0.0110 4.498798 0.000006 0.000235 0.0 0.25 -5.034511 -5.320078 3.038231 ... 0.002972 0.244487 0.002738 0.142439 0.029674 2.230635 True 0 vs Rest 0 Rest
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
PCBP2 0.6028 0.3972 0.417146 0.000112 0.000126 0.0 0.25 -0.146613 -0.167634 0.442000 ... 0.133333 0.161815 0.133333 0.140631 1.616527 1.317526 False 28 vs Rest 28 Rest
API5 0.6020 0.3980 0.413805 0.000040 0.000042 0.0 0.25 -0.024984 -0.050644 0.501002 ... 0.088889 0.053969 0.088889 0.050580 0.924425 0.441995 False 28 vs Rest 28 Rest
SEPTIN2 0.5982 0.4018 0.397971 0.000164 0.000179 0.0 0.25 -0.067447 -0.083257 0.518050 ... 0.244444 0.243477 0.222222 0.197394 2.164344 2.037301 False 28 vs Rest 28 Rest
PI4KB 0.5702 0.4298 0.282667 0.000048 0.000047 0.0 0.25 0.065837 0.066301 0.450776 ... 0.022222 0.058350 0.022222 0.053980 0.204813 0.473826 False 28 vs Rest 28 Rest
YWHAE 0.5520 0.4480 0.208755 0.000132 0.000140 0.0 0.25 -0.079388 -0.062251 0.425387 ... 0.088889 0.171559 0.088889 0.148847 0.608118 1.433012 False 28 vs Rest 28 Rest

598125 rows × 22 columns

Filter differentially expressed genes by FDR and log fold change.

[47]:
markers_scvi = markers_scvi[(markers_scvi['is_de_fdr_0.05']) & (markers_scvi.lfc_mean > .5)]
markers_scvi
[47]:
proba_de proba_not_de bayes_factor scale1 scale2 pseudocounts delta lfc_mean lfc_median lfc_std ... raw_mean1 raw_mean2 non_zeros_proportion1 non_zeros_proportion2 raw_normalized_mean1 raw_normalized_mean2 is_de_fdr_0.05 comparison group1 group2
SPP1 0.9898 0.0102 4.575114 0.000638 0.000048 0.0 0.25 6.495259 6.727366 3.440886 ... 1.076499 0.040538 0.181072 0.009292 6.256733 0.294361 True 0 vs Rest 0 Rest
CCL18 0.9898 0.0102 4.575114 0.000326 0.000043 0.0 0.25 6.902542 7.447420 4.287405 ... 0.602506 0.042989 0.168400 0.017725 3.187437 0.270629 True 0 vs Rest 0 Rest
APOC1 0.9896 0.0104 4.555494 0.000148 0.000016 0.0 0.25 6.963109 7.430366 3.897622 ... 0.276810 0.019038 0.135315 0.012533 1.356255 0.119619 True 0 vs Rest 0 Rest
C1QA 0.9894 0.0106 4.536244 0.000333 0.000045 0.0 0.25 7.422565 7.997816 4.413615 ... 0.539626 0.045370 0.252483 0.025287 3.077946 0.308842 True 0 vs Rest 0 Rest
MCEMP1 0.9888 0.0112 4.480577 0.000039 0.000008 0.0 0.25 5.592349 5.895313 3.543164 ... 0.061634 0.013300 0.045366 0.010071 0.315246 0.083767 True 0 vs Rest 0 Rest
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
AL158206.1 0.8942 0.1058 2.134379 0.000003 0.000002 0.0 0.25 1.058239 0.996449 1.515791 ... 0.000000 0.002145 0.000000 0.002135 0.000000 0.016850 True 28 vs Rest 28 Rest
AC004847.1 0.8942 0.1058 2.134379 0.000002 0.000002 0.0 0.25 0.636359 0.664043 1.732896 ... 0.000000 0.001639 0.000000 0.001609 0.000000 0.014673 True 28 vs Rest 28 Rest
SNHG17 0.8942 0.1058 2.134379 0.000014 0.000008 0.0 0.25 1.102903 1.107450 1.149856 ... 0.000000 0.011150 0.000000 0.010796 0.000000 0.087149 True 28 vs Rest 28 Rest
BAIAP2 0.8942 0.1058 2.134379 0.000056 0.000046 0.0 0.25 1.501894 0.933272 2.498250 ... 0.066667 0.063237 0.066667 0.052938 0.852857 0.478457 True 28 vs Rest 28 Rest
ZNF530 0.8942 0.1058 2.134379 0.000003 0.000002 0.0 0.25 0.892193 0.836232 1.560018 ... 0.000000 0.002256 0.000000 0.002216 0.000000 0.019131 True 28 vs Rest 28 Rest

92424 rows × 22 columns

[50]:
sc.pl.umap(adata, color = ['leiden'], frameon = False, legend_loc = "on data")
_images/02-analysis-review_242_0.svg
[ ]:
sc.pl.umap(adata, color = ['PTPRC', 'CD3E', 'CD4'], frameon = False, layer = 'scvi_normalized', vmax = 5)
sc.pl.umap(adata, color = ['PTPRC', 'CD3E', 'CD8A'], frameon = False, layer = 'scvi_normalized', vmax = 5)

sc.pl.umap(adata, color = ['EPCAM', 'MUC1'], frameon = False, layer = 'scvi_normalized', vmax = 5)
sc.pl.umap(adata, color = ['EPCAM', 'MUC1'], frameon = False, layer = 'scvi_normalized', vmax = 5)