This page was generated from source/02-analysis-review.ipynb.

Analysis review§

Overview§

This notebook follows the tutorial by mousepixels/sanbomics, which has an accompanying screencast.

Analysis is illustrated with single-nucleus RNA sequencing data from the following paper [MBH+21]

Melms JC, Biermann J, Huang H, Wang Y, Nair A, Tagore S, et al. A molecular single-cell lung atlas of lethal COVID-19. Nature. 2021;595: 114–119. doi:10.1038/s41586-021-03569-0

This paper examined 116,000 nuclei from the lungs of nineteen patients who underwent autopsy following death in association with COVID-19. Findings reported in the abstract of the paper include:

activated monocyte-derived macrophages and alveolar macrophages
impaired T cell activation
monocyte/macrophage-derived interleukin-1β and epithelial cell-derived interleukin-6
alveolar type 2 cells adopted an inflammation-associated transient progenitor cell state and failed to undergo full transition into alveolar type 1 cells
expansion of CTHRC1+ pathological fibroblasts
protein activity and ligand–receptor interactions suggest putative drug targets

This notebook makes extensive use of [WAT18] and [LRC+18] including updates that have been made to the underlying software packages, scanpy and scvi-tools, since their initial publication.

Setup§

Import libraries§

[1]:

from inspect import getmembers
from pprint import pprint
from types import FunctionType
from datetime import datetime
import os

import numpy as np
import pickle
import pandas as pd
import scanpy as sc
from scipy.sparse import csr_matrix
import scvi
import seaborn as sns

Global seed set to 0
/usr/lib/python3.10/site-packages/pytorch_lightning/utilities/warnings.py:53: LightningDeprecationWarning: pytorch_lightning.utilities.warnings.rank_zero_deprecation has been deprecated in v1.6 and will be removed in v1.8. Use the equivalent function from the pytorch_lightning.utilities.rank_zero module instead.
  new_rank_zero_deprecation(
/usr/lib/python3.10/site-packages/pytorch_lightning/utilities/warnings.py:58: LightningDeprecationWarning: The `pytorch_lightning.loggers.base.rank_zero_experiment` is deprecated in v1.7 and will be removed in v1.9. Please use `pytorch_lightning.loggers.logger.rank_zero_experiment` instead.
  return new_rank_zero_deprecation(*args, **kwargs)

Setup plotting§

[2]:

import matplotlib.font_manager
import matplotlib.pyplot as plt

# import matplotlib_inline

[3]:

# fonts_path = "/usr/share/texmf/fonts/opentype/public/lm/" #ubuntu
# fonts_path = "~/Library/Fonts/" # macos
fonts_path = "/usr/share/fonts/OTF/"  # arch
# user_path = "$HOME/" # user
# fonts_path = user_path + "fonts/latinmodern/opentype/public/lm/"  # home
matplotlib.font_manager.fontManager.addfont(fonts_path + "lmsans10-regular.otf")
matplotlib.font_manager.fontManager.addfont(fonts_path + "lmroman10-regular.otf")

[4]:

# https://stackoverflow.com/a/36622238/446907
%config InlineBackend.figure_formats = ['svg']

[5]:

plt.style.use("default")  # reset default parameters
# https://stackoverflow.com/a/3900167/446907
plt.rcParams.update(
    {
        "font.size": 16,
        "font.family": ["sans-serif"],
        "font.serif": ["Latin Modern Roman"] + plt.rcParams["font.serif"],
        "font.sans-serif": ["Latin Modern Sans"] + plt.rcParams["font.sans-serif"],
    }
)

Utility functions§

[6]:

def attributes(obj):
    """
    get object attributes
    """
    disallowed_names = {
        name for name, value in getmembers(type(obj)) if isinstance(value, FunctionType)
    }
    return {
        name: getattr(obj, name)
        for name in dir(obj)
        if name[0] != "_" and name not in disallowed_names and hasattr(obj, name)
    }


def print_attributes(obj):
    """
    print object attributes
    """
    pprint(attributes(obj))

Set output folder for session§

[7]:

now = datetime.now()
timestamp = now.strftime("%Y%m%d_%H%M%S")
timestamp = "20220930_215746"

[8]:

output_directory = f"output/{timestamp}"

[9]:

if not os.path.exists(output_directory):
    os.makedirs(output_directory)
    print(f"created {output_directory}")

[10]:

print(f"using {output_directory}")

using output/20220930_215746

Import data§

Here we review how the data were downloaded, and proceed to import and inspect the data.

Data download§

Data with GEO accession GSE171524 was downloaded using ./data/download_geo_data.sh with parameters

./download_geo_data.sh \
       -a GSE132771 \
       -f 'ftp.*RAW.*' \
       -j '..|.supplementary_files?|..|.url?|select(length>0)'

A skeleton of this script that may work in this case is

!/usr/bin/env bash

#-- debugging (comment to reduce stderr output)
#-- https://wiki.bash-hackers.org/scripting/debuggingtips
export PS4='+(${BASH_SOURCE}:${LINENO}): ${FUNCNAME[0]:+${FUNCNAME[0]}(): }'
set -o xtrace

# get metadata
# Melms JC, Biermann J, Huang H, Wang Y, Nair A, Tagore S, et al.
# A molecular single-cell lung atlas of lethal COVID-19.
# Nature. 2021;595: 114–119. doi:10.1038/s41586-021-03569-0
# GSE171524
ffq -l 1 -o GSE171524.json GSE171524

# download raw data
wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE171nnn/GSE171524/suppl/GSE171524_RAW.tar

# list contents
tar -tvf GSE171524_RAW.tar

# untar
mkdir -p GSE171524 && \
tar -xvf GSE171524_RAW.tar -C GSE171524

Data load§

See the documentation for scanpy read csv which returns an AnnData object.

[11]:

adata = None
adata = sc.read_csv(
    "data/GSE171524/supplementary/GSM5226574_C51ctr_raw_counts.csv.gz"
).T
adata

[11]:

AnnData object with n_obs × n_vars = 6099 × 34546

Note the scanpy.read_csv function accepts gzipped files.

Data properties§

[12]:

type(adata)

[12]:

anndata._core.anndata.AnnData

[13]:

adata.X.shape

[13]:

(6099, 34546)

[14]:

print_attributes(adata)

{'T': AnnData object with n_obs × n_vars = 34546 × 6099,
 'X': array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32),
 'file': Backing file manager: no file is set.,
 'filename': None,
 'is_view': False,
 'isbacked': False,
 'isview': False,
 'layers': Layers with keys: ,
 'n_obs': 6099,
 'n_vars': 34546,
 'obs': Empty DataFrame
Columns: []
Index: [TAGGTACCATGGCCAC-1_1, ATTCACTGTAACAGGC-1_1, TAACTTCCAACCACGC-1_1, TTGGGTACACGACAAG-1_1, AGGCCACAGAGTCACG-1_1, CACTGAAGTCGAAGCA-1_1, ACTGATGTCTGCACCT-1_1, TTACCGCCACTCAGAT-1_1, TTGGTTTTCCTAGCTC-1_1, TGGGAAGTCAGTGATC-1_1, CCACGAGTCTCTTAAC-1_1, ACTTCCGCACAACGCC-1_1, GGGAAGTAGCGACCCT-1_1, TGGTAGTTCCCGTGTT-1_1, CGCATAACATGCCGGT-1_1, TCTATCACAAGGCTTT-1_1, ATCCACCAGAGGTATT-1_1, TAACGACAGATGACCG-1_1, TCTTAGTGTATGAGGC-1_1, CACTTCGCAGTACTAC-1_1, GTCAAACAGAACGTGC-1_1, GCAACCGAGGGCAGGA-1_1, CATACTTTCATCACTT-1_1, AAGAACATCGGATTAC-1_1, GGGTATTGTACGATGG-1_1, CTGTAGATCAACGTGT-1_1, GTCATTTGTATCTCGA-1_1, CCTTGTGCAGAGGGTT-1_1, AAGTTCGCAACACGTT-1_1, TCATTCACAAATCAAG-1_1, TCCATGCCAACGACTT-1_1, TCCTTCTCAGTTTCAG-1_1, TGTGAGTCAAATGATG-1_1, AAACGAAGTACAGAGC-1_1, CAACCAAAGTATTCCG-1_1, CTTCTCTCAGAGACTG-1_1, TACAACGGTGGCTGAA-1_1, AACGGGACATGCCGGT-1_1, AACCAACGTTGGGAAC-1_1, TATATCCAGCGTCAGA-1_1, AGACAAACATCCCGTT-1_1, ATGACCAGTCTTCATT-1_1, CTTACCGTCAGACATC-1_1, CGGGACTGTTAGTTCG-1_1, ATTCATCCACTGAGTT-1_1, TCATGAGAGAGGCGGA-1_1, TCCCACATCTAGTACG-1_1, CTTCCTTCATATCTCT-1_1, CCGGACACACTCGATA-1_1, ACACGCGCACCTGTCT-1_1, GAATCGTCAGAAGTGC-1_1, GGTGGCTCAAGCTCTA-1_1, CCTGCATCACATATGC-1_1, GTGGGAAGTTAAAGTG-1_1, CGTTCTGGTACTAGCT-1_1, ACCCTCACAATAGTCC-1_1, GCCCGAACAAACTAAG-1_1, GTGGAAGCACATGACT-1_1, GTTGTGACATCGATAC-1_1, GACAGCCCAGGTCCGT-1_1, TAAGCACGTTGGCTAT-1_1, GGGACAAGTCACCACG-1_1, CTGGCAGGTTCGGTAT-1_1, GACTCAACACTGTGAT-1_1, GCCAGTGGTGTGGTCC-1_1, TCTAACTGTAGGCAGT-1_1, GAAGAATGTAGCTTGT-1_1, TCACTCGCAATCTCTT-1_1, CAAGACTTCCCACAGG-1_1, CAGATACGTGACTCTA-1_1, TGGGATTAGAGGGTCT-1_1, ACCTGAACACTCCTTG-1_1, GACACGCCACTCGATA-1_1, CTCATCGTCACCGCTT-1_1, AGGTAGGGTCCCTGTT-1_1, TACCCGTCAACACTAC-1_1, TGCTGAAAGACGGATC-1_1, ACACCAACACAACGCC-1_1, AAGATAGCAAATGGAT-1_1, CTTCAATGTGACAGGT-1_1, GAAGCGAAGAGTTGAT-1_1, GCCGTGACACAAGCCC-1_1, CCTCAACCATACAGGG-1_1, ACAAAGATCCACAGGC-1_1, CAGATACAGTCCCAGC-1_1, GGCAGTCTCCGGTTCT-1_1, TAAGTCGAGCTGAGCA-1_1, GAGACCCGTCTGTGCG-1_1, TAACACGCATGTGTCA-1_1, TCAATTCGTTCTCGCT-1_1, GCTTTCGCACAGTGTT-1_1, AACCAACAGATAACAC-1_1, ATCGGCGCACATCATG-1_1, TCATCCGCACGAGGAT-1_1, CTGATCCTCTTTACAC-1_1, TCACACCCAACTTCTT-1_1, TGAGGGACACCGTACG-1_1, GTGCACGTCATCTGTT-1_1, GGTAATCAGTTGCATC-1_1, ATACTTCCAAGGTCTT-1_1, ...]

[6099 rows x 0 columns],
 'obs_names': Index(['TAGGTACCATGGCCAC-1_1', 'ATTCACTGTAACAGGC-1_1', 'TAACTTCCAACCACGC-1_1',
       'TTGGGTACACGACAAG-1_1', 'AGGCCACAGAGTCACG-1_1', 'CACTGAAGTCGAAGCA-1_1',
       'ACTGATGTCTGCACCT-1_1', 'TTACCGCCACTCAGAT-1_1', 'TTGGTTTTCCTAGCTC-1_1',
       'TGGGAAGTCAGTGATC-1_1',
       ...
       'AAGTCGTGTGTGAATA-1_1', 'GTCGTTCTCCAAGGGA-1_1', 'GTTTGGATCGGCCTTT-1_1',
       'GTACAGTCACGTATAC-1_1', 'TCATGCCCAAGAGGTC-1_1', 'CGCCATTGTTTGCCGG-1_1',
       'CACTGGGGTCTACGTA-1_1', 'CATACTTGTAGAGGAA-1_1', 'TTTGGTTTCCACGGAC-1_1',
       'ATGCATGAGTCATGAA-1_1'],
      dtype='object', length=6099),
 'obsm': AxisArrays with keys: ,
 'obsp': PairwiseArrays with keys: ,
 'raw': None,
 'shape': (6099, 34546),
 'uns': OverloadedDict, wrapping:
        OrderedDict()
With overloaded keys:
        ['neighbors'].,
 'var': Empty DataFrame
Columns: []
Index: [AL627309.1, AL627309.5, AL627309.4, AL669831.2, LINC01409, FAM87B, LINC01128, LINC00115, FAM41C, AL645608.6, AL645608.2, LINC02593, SAMD11, NOC2L, KLHL17, PLEKHN1, PERM1, AL645608.7, HES4, ISG15, AL645608.1, AGRN, C1orf159, AL390719.3, LINC01342, AL390719.2, TTLL10-AS1, TTLL10, TNFRSF18, TNFRSF4, SDF4, B3GALT6, C1QTNF12, UBE2J2, LINC01786, SCNN1D, ACAP3, PUSL1, INTS11, AL139287.1, CPTP, DVL1, MXRA8, AURKAIP1, CCNL2, MRPL20-AS1, MRPL20, AL391244.2, ANKRD65, AL391244.1, LINC01770, VWA1, ATAD3C, ATAD3B, ATAD3A, TMEM240, SSU72, AL645728.1, FNDC10, AL691432.4, AL691432.2, MIB2, MMP23B, CDK11B, FO704657.1, SLC35E2B, CDK11A, SLC35E2A, NADK, GNB1, AL109917.1, CALML6, TMEM52, CFAP74, GABRD, AL391845.1, PRKCZ, AL590822.2, PRKCZ-AS1, FAAP20, AL590822.1, SKI, AL590822.3, MORN1, AL589739.1, AL513477.2, RER1, PEX10, PLCH2, AL139246.4, PANK4, HES5, AL139246.5, TNFRSF14-AS1, TNFRSF14, AL139246.3, PRXL2B, MMEL1, TTC34, AC242022.2, ...]

[34546 rows x 0 columns],
 'var_names': Index(['AL627309.1', 'AL627309.5', 'AL627309.4', 'AL669831.2', 'LINC01409',
       'FAM87B', 'LINC01128', 'LINC00115', 'FAM41C', 'AL645608.6',
       ...
       'AC087190.2', 'AC136428.1', 'AC019183.1', 'AC105094.1', 'AC010485.1',
       'VN1R2', 'AL031676.1', 'SMIM34A', 'AL050402.1', 'AL445072.1'],
      dtype='object', length=34546),
 'varm': AxisArrays with keys: ,
 'varp': PairwiseArrays with keys: }

/tmp/ipykernel_16232/22660072.py:11: DeprecationWarning: Use is_view instead of isview, isview will be removed in the future.
  if name[0] != "_" and name not in disallowed_names and hasattr(obj, name)
/tmp/ipykernel_16232/22660072.py:9: DeprecationWarning: Use is_view instead of isview, isview will be removed in the future.
  name: getattr(obj, name)

[15]:

adata.obs

[15]:


TAGGTACCATGGCCAC-1_1
ATTCACTGTAACAGGC-1_1
TAACTTCCAACCACGC-1_1
TTGGGTACACGACAAG-1_1
AGGCCACAGAGTCACG-1_1
...
CGCCATTGTTTGCCGG-1_1
CACTGGGGTCTACGTA-1_1
CATACTTGTAGAGGAA-1_1
TTTGGTTTCCACGGAC-1_1
ATGCATGAGTCATGAA-1_1

6099 rows × 0 columns

Gene names are saved as adata.var.

[16]:

adata.var

[16]:


AL627309.1
AL627309.5
AL627309.4
AL669831.2
LINC01409
...
VN1R2
AL031676.1
SMIM34A
AL050402.1
AL445072.1

34546 rows × 0 columns

[17]:

adata.obs_names

[17]:

Index(['TAGGTACCATGGCCAC-1_1', 'ATTCACTGTAACAGGC-1_1', 'TAACTTCCAACCACGC-1_1',
       'TTGGGTACACGACAAG-1_1', 'AGGCCACAGAGTCACG-1_1', 'CACTGAAGTCGAAGCA-1_1',
       'ACTGATGTCTGCACCT-1_1', 'TTACCGCCACTCAGAT-1_1', 'TTGGTTTTCCTAGCTC-1_1',
       'TGGGAAGTCAGTGATC-1_1',
       ...
       'AAGTCGTGTGTGAATA-1_1', 'GTCGTTCTCCAAGGGA-1_1', 'GTTTGGATCGGCCTTT-1_1',
       'GTACAGTCACGTATAC-1_1', 'TCATGCCCAAGAGGTC-1_1', 'CGCCATTGTTTGCCGG-1_1',
       'CACTGGGGTCTACGTA-1_1', 'CATACTTGTAGAGGAA-1_1', 'TTTGGTTTCCACGGAC-1_1',
       'ATGCATGAGTCATGAA-1_1'],
      dtype='object', length=6099)

[18]:

adata.var_names

[18]:

Index(['AL627309.1', 'AL627309.5', 'AL627309.4', 'AL669831.2', 'LINC01409',
       'FAM87B', 'LINC01128', 'LINC00115', 'FAM41C', 'AL645608.6',
       ...
       'AC087190.2', 'AC136428.1', 'AC019183.1', 'AC105094.1', 'AC010485.1',
       'VN1R2', 'AL031676.1', 'SMIM34A', 'AL050402.1', 'AL445072.1'],
      dtype='object', length=34546)

There are no layers in this data set.

[19]:

adata.layers

[19]:

Layers with keys:

There are no multidimensional observations or variables.

[20]:

print(adata.obsm)
print(adata.varm)
print(adata.obsp)
print(adata.varp)

AxisArrays with keys:
AxisArrays with keys:
PairwiseArrays with keys:
PairwiseArrays with keys:

The data appears to contain reads mapped to 6099 cell-associated barcodes and 34546 RNA molecule-associated features.

Doublet removal§

Filter transcripts by minimum number of cells with non-zero counts§

We may choose to filter out transcript types that are detected in a relatively small number of cells. The optimum threshold is not known. Here we use 10 as a base case.

[21]:

adata

[21]:

AnnData object with n_obs × n_vars = 6099 × 34546

See the documentation for scanpy pre-processing filter-genes.

[22]:

sc.pp.filter_genes(adata, min_cells=10)

[23]:

adata

[23]:

AnnData object with n_obs × n_vars = 6099 × 19896
    var: 'n_cells'

Following filtration there are 19896 transcript types remaining that are present in at least 10 cells. This means 14650 transcript types were removed at the threshold of 10. Notice that an annotation named n_cells has been added to the genes to indicate the number of cells with non-zero values for that transcript type.

[24]:

adata.var

[24]:

	n_cells
AL627309.5	33
LINC01409	274
LINC01128	81
LINC00115	15
SAMD11	16
...	...
MAFIP	34
AC011043.1	11
AL354822.1	133
AL592183.1	1003
AC240274.1	162

19896 rows × 1 columns

[25]:

adata.var.describe()

[25]:

	n_cells
count	19896.000000
mean	293.307600
std	451.440237
min	10.000000
25%	34.000000
50%	111.000000
75%	345.000000
max	6090.000000

[26]:

sns.histplot(adata.var['n_cells'])

/usr/lib/python3.10/site-packages/matplotlib_inline/config.py:59: DeprecationWarning: InlineBackend._figure_formats_changed is deprecated in traitlets 4.1: use @observe and @unobserve instead.
  def _figure_formats_changed(self, name, old, new):

[26]:

<AxesSubplot:xlabel='n_cells', ylabel='Count'>

Select highly variable genes§

It is common to select transcript types with high variability among the cell population under the assumption that this will help to focus on features that distinguish the cells. Again there is no perfect threshold. Here we select the 2000 highest variability genes.

[27]:

adata

[27]:

AnnData object with n_obs × n_vars = 6099 × 19896
    var: 'n_cells'

See the documentation for scanpy pre-processing highly-variable-genes.

[28]:

sc.pp.highly_variable_genes(adata, n_top_genes=2000, subset=True, flavor="seurat_v3")

[29]:

adata

[29]:

AnnData object with n_obs × n_vars = 6099 × 2000
    var: 'n_cells', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm'
    uns: 'hvg'

We have filtered down to 2000 transcript types and added 5 annotations to the genes including a binary variable indicating membership in the highly-variable class, the ranking among highly variable genes, and the mean, variance, and normalized variance for each gene across cells.

[30]:

adata.var

[30]:

	n_cells	highly_variable	highly_variable_rank	means	variances	variances_norm
TTLL10	112	True	903.0	0.028857	0.069354	1.901045
TNFRSF18	15	True	1604.0	0.002951	0.004911	1.513808
CFAP74	159	True	1370.0	0.041154	0.087024	1.607399
TTC34	209	True	245.0	0.080341	0.363502	3.044086
AJAP1	31	True	1922.0	0.006886	0.011432	1.421002
...	...	...	...	...	...	...
MT-ND4L	650	True	1212.0	0.191671	0.559353	1.695637
MT-ND4	1328	True	178.0	0.833087	9.742224	3.486984
MT-ND5	886	True	443.0	0.332514	1.669344	2.430210
MT-ND6	821	True	123.0	0.383178	3.357087	3.985056
MT-CYB	1295	True	414.0	0.622397	4.384615	2.487058

2000 rows × 6 columns

Doublet removal§

Here we scvi-tools model as an interface to solo from [BFL+20]. scvi adds some unstructured data to interface with torch.

[31]:

adata.uns

[31]:

OverloadedDict, wrapping:
        OrderedDict([('hvg', {'flavor': 'seurat_v3'})])
With overloaded keys:
        ['neighbors'].

[32]:

scvi.model.SCVI.setup_anndata(adata)

[33]:

adata.uns

[33]:

OverloadedDict, wrapping:
        OrderedDict([('hvg', {'flavor': 'seurat_v3'}), ('_scvi_uuid', 'efb3a791-15ff-4864-b57c-7536e8ea9df6'), ('_scvi_manager_uuid', 'ccc84e47-0ab2-4ab3-852b-55f888509f9e')])
With overloaded keys:
        ['neighbors'].

Train the scvi model.

[34]:

vae = scvi.model.SCVI(adata)
vae.train()

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

Epoch 400/400: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 400/400 [03:36<00:00,  1.98it/s, loss=323, v_num=1]

`Trainer.fit` stopped: `max_epochs=400` reached.

Epoch 400/400: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 400/400 [03:36<00:00,  1.85it/s, loss=323, v_num=1]

After training the model adata.obs now contains _scvi_batch and scvi_labels annotations.

[35]:

adata

[35]:

AnnData object with n_obs × n_vars = 6099 × 2000
    obs: '_scvi_batch', '_scvi_labels'
    var: 'n_cells', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm'
    uns: 'hvg', '_scvi_uuid', '_scvi_manager_uuid'

[36]:

adata.obs

[36]:

	_scvi_batch	_scvi_labels
TAGGTACCATGGCCAC-1_1	0	0
ATTCACTGTAACAGGC-1_1	0	0
TAACTTCCAACCACGC-1_1	0	0
TTGGGTACACGACAAG-1_1	0	0
AGGCCACAGAGTCACG-1_1	0	0
...	...	...
CGCCATTGTTTGCCGG-1_1	0	0
CACTGGGGTCTACGTA-1_1	0	0
CATACTTGTAGAGGAA-1_1	0	0
TTTGGTTTCCACGGAC-1_1	0	0
ATGCATGAGTCATGAA-1_1	0	0

6099 rows × 2 columns

Train the solo model.

[37]:

solo = scvi.external.SOLO.from_scvi_model(vae)
solo.train()

INFO     Creating doublets, preparing SOLO model.

/usr/lib/python3.10/site-packages/anndata/_core/anndata.py:1785: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.
  [AnnData(sparse.csr_matrix(a.shape), obs=a.obs) for a in all_adatas],
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

Epoch 207/400:  52%|██████████████████████████████████████████████████▏                                              | 207/400 [01:51<01:44,  1.85it/s, loss=0.299, v_num=1]
Monitored metric validation_loss did not improve in the last 30 records. Best score: 0.295. Signaling Trainer to stop.

[38]:

type(solo)

[38]:

scvi.external.solo._model.SOLO

We can now use the solo model to add a prediction annotation for doublet or singlet to our data.

[39]:

df = solo.predict()
df["prediction"] = solo.predict(soft=False)

df.index = df.index.map(lambda x: x[:-2])

df

[39]:

	doublet	singlet	prediction
TAGGTACCATGGCCAC-1_1	1.729540	-0.992126	doublet
ATTCACTGTAACAGGC-1_1	2.275840	-1.639090	doublet
TAACTTCCAACCACGC-1_1	0.735813	-0.259502	doublet
TTGGGTACACGACAAG-1_1	1.618703	-0.680344	doublet
AGGCCACAGAGTCACG-1_1	1.398426	-0.673206	doublet
...	...	...	...
CGCCATTGTTTGCCGG-1_1	-0.691126	1.362337	singlet
CACTGGGGTCTACGTA-1_1	-2.632991	2.468916	singlet
CATACTTGTAGAGGAA-1_1	-2.506424	2.567346	singlet
TTTGGTTTCCACGGAC-1_1	-3.308819	2.476630	singlet
ATGCATGAGTCATGAA-1_1	-2.357370	2.206591	singlet

6099 rows × 3 columns

Counting the predicted doublets and singlets reveals about 20% doublets.

[40]:

df.groupby("prediction").count()

[40]:

	doublet	singlet
prediction
doublet	1206	1206
singlet	4893	4893

We can assess the magnitude of the prediction by taking the difference between the doublet and singlet scores.

[41]:

df["dif"] = df.doublet - df.singlet
df

[41]:

	doublet	singlet	prediction	dif
TAGGTACCATGGCCAC-1_1	1.729540	-0.992126	doublet	2.721666
ATTCACTGTAACAGGC-1_1	2.275840	-1.639090	doublet	3.914930
TAACTTCCAACCACGC-1_1	0.735813	-0.259502	doublet	0.995315
TTGGGTACACGACAAG-1_1	1.618703	-0.680344	doublet	2.299047
AGGCCACAGAGTCACG-1_1	1.398426	-0.673206	doublet	2.071632
...	...	...	...	...
CGCCATTGTTTGCCGG-1_1	-0.691126	1.362337	singlet	-2.053463
CACTGGGGTCTACGTA-1_1	-2.632991	2.468916	singlet	-5.101907
CATACTTGTAGAGGAA-1_1	-2.506424	2.567346	singlet	-5.073770
TTTGGTTTCCACGGAC-1_1	-3.308819	2.476630	singlet	-5.785450
ATGCATGAGTCATGAA-1_1	-2.357370	2.206591	singlet	-4.563961

6099 rows × 4 columns

Plotting the distribution of doublet-singlet score differences we see there are a large fraction that marginally favor doublet to singlet.

[42]:

sns.histplot(df[df.prediction == "doublet"], x="dif")

[42]:

<AxesSubplot:xlabel='dif', ylabel='Count'>

Since we would like to avoid unnecessarily discarding barcodes, we will arbitrarily retain those with a score from zero to one (however this is not intended to be principled).

[43]:

doublets = df[(df.prediction == 'doublet') & (df.dif > 1)]
doublets

[43]:

	doublet	singlet	prediction	dif
TAGGTACCATGGCCAC-1_1	1.729540	-0.992126	doublet	2.721666
ATTCACTGTAACAGGC-1_1	2.275840	-1.639090	doublet	3.914930
TTGGGTACACGACAAG-1_1	1.618703	-0.680344	doublet	2.299047
AGGCCACAGAGTCACG-1_1	1.398426	-0.673206	doublet	2.071632
CACTGAAGTCGAAGCA-1_1	1.305717	-0.544905	doublet	1.850623
...	...	...	...	...
CAATACGCAATGTGGG-1_1	0.471522	-0.923330	doublet	1.394853
GGGTATTTCAGCGCAC-1_1	1.101736	-0.658175	doublet	1.759911
TTGCTGCAGTGCGACA-1_1	0.720216	-0.624341	doublet	1.344557
CATCCCAAGACGCCAA-1_1	0.937599	-0.486121	doublet	1.423720
CACTGTCAGGGACAGG-1_1	1.190305	-0.678417	doublet	1.868721

451 rows × 4 columns

[44]:

sns.histplot(doublets[doublets.prediction == "doublet"], x="dif")

[44]:

<AxesSubplot:xlabel='dif', ylabel='Count'>

Save/load checkpoint§

Save current variables to file.

[45]:

pickle.dump([df, doublets, solo, vae], open(f"{output_directory}/auxiliary.p", "wb"))

Variables can be reloaded if necessary.

[46]:

df, doublets, solo, vae = pickle.load(open(f"{output_directory}/auxiliary.p", "rb"))

[47]:

adata.write(f"{output_directory}/adata_doublets.h5ad", compression="gzip")

[48]:

bdata_doublets = None
bdata_doublets = sc.read_h5ad(f"{output_directory}/adata_doublets.h5ad")
bdata_doublets

[48]:

AnnData object with n_obs × n_vars = 6099 × 2000
    obs: '_scvi_batch', '_scvi_labels'
    var: 'n_cells', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm'
    uns: '_scvi_manager_uuid', '_scvi_uuid', 'hvg'

Reload data and filter doublets§

[49]:

adata = sc.read_csv(
    "data/GSE171524/supplementary/GSM5226574_C51ctr_raw_counts.csv.gz"
).T
adata

[49]:

AnnData object with n_obs × n_vars = 6099 × 34546

Annotate adata with doublet column.

[50]:

adata.obs['doublet'] = adata.obs.index.isin(doublets.index)

[51]:

adata

[51]:

AnnData object with n_obs × n_vars = 6099 × 34546
    obs: 'doublet'

[52]:

adata.obs

[52]:

	doublet
TAGGTACCATGGCCAC-1_1	True
ATTCACTGTAACAGGC-1_1	True
TAACTTCCAACCACGC-1_1	False
TTGGGTACACGACAAG-1_1	True
AGGCCACAGAGTCACG-1_1	True
...	...
CGCCATTGTTTGCCGG-1_1	False
CACTGGGGTCTACGTA-1_1	False
CATACTTGTAGAGGAA-1_1	False
TTTGGTTTCCACGGAC-1_1	False
ATGCATGAGTCATGAA-1_1	False

6099 rows × 1 columns

Use doublet column to filter doublets.

[53]:

adata = adata[~adata.obs.doublet]

[54]:

adata

[54]:

View of AnnData object with n_obs × n_vars = 5648 × 34546
    obs: 'doublet'

We filtered about 7% of the barcodes as putative doublets.

Preprocessing§

Filter mitochondrial transcripts§

We can add a binary annotation to indicate presence of mitochondrial transcripts.

[55]:

adata.var['mt'] = adata.var.index.str.startswith('MT-')

/tmp/ipykernel_16232/310233584.py:1: ImplicitModificationWarning: Trying to modify attribute `.var` of view, initializing view as actual.
  adata.var['mt'] = adata.var.index.str.startswith('MT-')

[56]:

adata.var

[56]:

	mt
AL627309.1	False
AL627309.5	False
AL627309.4	False
AL669831.2	False
LINC01409	False
...	...
VN1R2	False
AL031676.1	False
SMIM34A	False
AL050402.1	False
AL445072.1	False

34546 rows × 1 columns

Filter ribosomal transcripts§

We can download a list of ribosomal genes from msigdb.

[57]:

ribo_url = "http://software.broadinstitute.org/gsea/msigdb/download_geneset.jsp?geneSetName=KEGG_RIBOSOME&fileType=txt"
ribo_genes = pd.read_table(ribo_url, skiprows=2, header = None)
ribo_genes

[57]:

	0
0	FAU
1	MRPL13
2	RPL10
3	RPL10A
4	RPL10L
...	...
83	RPS9
84	RPSA
85	RSL24D1
86	RSL24D1P11
87	UBA52

88 rows × 1 columns

We add a binary annotation to indicate presence in the list of ribosomal genes.

[58]:

adata.var['ribo'] = adata.var_names.isin(ribo_genes[0].values)

[59]:

adata.var

[59]:

	mt	ribo
AL627309.1	False	False
AL627309.5	False	False
AL627309.4	False	False
AL669831.2	False	False
LINC01409	False	False
...	...	...
VN1R2	False	False
AL031676.1	False	False
SMIM34A	False	False
AL050402.1	False	False
AL445072.1	False	False

34546 rows × 2 columns

Calculate QC metrics§

Recall the observation/barcode annotation currently just indicates doublets based on our solo model.

[60]:

adata.obs

[60]:

	doublet
TAACTTCCAACCACGC-1_1	False
TTACCGCCACTCAGAT-1_1	False
TTGGTTTTCCTAGCTC-1_1	False
TCTATCACAAGGCTTT-1_1	False
ATCCACCAGAGGTATT-1_1	False
...	...
CGCCATTGTTTGCCGG-1_1	False
CACTGGGGTCTACGTA-1_1	False
CATACTTGTAGAGGAA-1_1	False
TTTGGTTTCCACGGAC-1_1	False
ATGCATGAGTCATGAA-1_1	False

5648 rows × 1 columns

We can calculate standard quality control metrics with scanpy. See the documentation for scanpy calculate_qc_metrics. These include the number of

[61]:

sc.pp.calculate_qc_metrics(adata, qc_vars=['mt', 'ribo'], percent_top=None, log1p=False, inplace=True)

Sorting genes by the number of cells with non-zero counts, we see a number of genes that were not found in association with any barcode.

[62]:

adata.var.sort_values("n_cells_by_counts")

[62]:

	mt	ribo	n_cells_by_counts	mean_counts	pct_dropout_by_counts	total_counts
AL445072.1	False	False	0	0.000000	100.000000	0.0
UGT1A5	False	False	0	0.000000	100.000000	0.0
AGXT	False	False	0	0.000000	100.000000	0.0
TEX44	False	False	0	0.000000	100.000000	0.0
AC093585.1	False	False	0	0.000000	100.000000	0.0
...	...	...	...	...	...	...
AKAP13	False	False	4059	2.881374	28.133853	16274.0
NEAT1	False	False	4148	5.001062	26.558074	28246.0
MBNL1	False	False	4150	2.695822	26.522663	15226.0
ZBTB20	False	False	4281	2.425992	24.203258	13702.0
MALAT1	False	False	5639	64.417313	0.159348	363829.0

34546 rows × 6 columns

Sorting barcodes by total counts we see a minimum around 400 suggesting a previously applied filter upstream of this pre-processing.

[63]:

adata.obs.sort_values("total_counts")

[63]:

	doublet	n_genes_by_counts	total_counts	total_counts_mt	pct_counts_mt	total_counts_ribo	pct_counts_ribo
AGGCATTCATCCGTTC-1_1	False	290	401.0	1.0	0.249377	0.0	0.000000
TGGTACAGTTGGTGTT-1_1	False	323	401.0	0.0	0.000000	0.0	0.000000
CAGGGCTTCATGCGGC-1_1	False	330	401.0	7.0	1.745636	1.0	0.249377
GTCGTTCTCCAAGGGA-1_1	False	300	401.0	0.0	0.000000	0.0	0.000000
CGAGAAGGTGAACTAA-1_1	False	308	401.0	0.0	0.000000	0.0	0.000000
...	...	...	...	...	...	...	...
TCTATCACAAGGCTTT-1_1	False	3582	8276.0	183.0	2.211213	5.0	0.060416
ATCCACCAGAGGTATT-1_1	False	3913	8286.0	191.0	2.305093	34.0	0.410331
TTACCGCCACTCAGAT-1_1	False	4144	11369.0	39.0	0.343038	5.0	0.043979
TTGGTTTTCCTAGCTC-1_1	False	3902	11472.0	463.0	4.035913	13.0	0.113319
TAACTTCCAACCACGC-1_1	False	5158	15645.0	221.0	1.412592	211.0	1.348674

5648 rows × 7 columns

[64]:

sns.jointplot(
    data=adata.obs,
    x="total_counts",
    y="n_genes_by_counts",
    kind="hex",
)

[64]:

<seaborn.axisgrid.JointGrid at 0x7fedd44cbbb0>

Filter genes by counts§

We can filter genes found in some minimum number of cells, in this case we arbitrarily choose three.

[65]:

sc.pp.filter_genes(adata, min_cells=3)

[66]:

adata.var.sort_values("n_cells_by_counts")

[66]:

	mt	ribo	n_cells_by_counts	mean_counts	pct_dropout_by_counts	total_counts	n_cells
GSG1	False	False	3	0.000531	99.946884	3.0	3
TRAJ47	False	False	3	0.000531	99.946884	3.0	3
AC002465.2	False	False	3	0.000531	99.946884	3.0	3
LINC02709	False	False	3	0.000531	99.946884	3.0	3
AC096536.1	False	False	3	0.000531	99.946884	3.0	3
...	...	...	...	...	...	...	...
AKAP13	False	False	4059	2.881374	28.133853	16274.0	4059
NEAT1	False	False	4148	5.001062	26.558074	28246.0	4148
MBNL1	False	False	4150	2.695822	26.522663	15226.0	4150
ZBTB20	False	False	4281	2.425992	24.203258	13702.0	4281
MALAT1	False	False	5639	64.417313	0.159348	363829.0	5639

24136 rows × 7 columns

Filter cells by number of genes§

As a matter of completeness we filter cells with fewer than 200 genes; however, we see the minimum number of genes in a cell is already greater than 200.

[67]:

adata.obs.sort_values('n_genes_by_counts')

[67]:

	doublet	n_genes_by_counts	total_counts	total_counts_mt	pct_counts_mt	total_counts_ribo	pct_counts_ribo
TAGGGTTTCTGGCTGG-1_1	False	276	419.0	1.0	0.238663	1.0	0.238663
CGTGCTTCAAAGGGCT-1_1	False	277	432.0	41.0	9.490741	0.0	0.000000
TGACAGTTCTAAACGC-1_1	False	278	414.0	0.0	0.000000	0.0	0.000000
CTCTGGTCACGACGAA-1_1	False	285	407.0	0.0	0.000000	0.0	0.000000
GTAAGTCGTATCGCGC-1_1	False	289	430.0	0.0	0.000000	0.0	0.000000
...	...	...	...	...	...	...	...
TAACGACAGATGACCG-1_1	False	3791	8187.0	94.0	1.148162	4.0	0.048858
TTGGTTTTCCTAGCTC-1_1	False	3902	11472.0	463.0	4.035913	13.0	0.113319
ATCCACCAGAGGTATT-1_1	False	3913	8286.0	191.0	2.305093	34.0	0.410331
TTACCGCCACTCAGAT-1_1	False	4144	11369.0	39.0	0.343038	5.0	0.043979
TAACTTCCAACCACGC-1_1	False	5158	15645.0	221.0	1.412592	211.0	1.348674

5648 rows × 7 columns

[68]:

sc.pp.filter_cells(adata, min_genes=200)

[69]:

adata

[69]:

AnnData object with n_obs × n_vars = 5648 × 24136
    obs: 'doublet', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'n_genes'
    var: 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells'

Review and filter QC outliers§

We can plot distributions of the number of genes by counts and total counts as well as the percentage of mitochondrial and ribosomal genes.

[70]:

sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt', 'pct_counts_ribo'],
             jitter=0.4, multi_panel=True)

If we choose to filter cells beyond the 98th percentile, we can compute the bound via the numpy quantile function.

[71]:

upper_lim = np.quantile(adata.obs.n_genes_by_counts.values, .98)
upper_lim

[71]:

2310.0599999999995

[72]:

adata = adata[adata.obs.n_genes_by_counts < upper_lim]

This filters a few more than 100 additional cells.

[73]:

adata.obs

[73]:

	doublet	n_genes_by_counts	total_counts	total_counts_mt	pct_counts_mt	total_counts_ribo	pct_counts_ribo	n_genes
CCTCAACCATACAGGG-1_1	False	2276	5434.0	38.0	0.699301	0.0	0.000000	2275
ACAAAGATCCACAGGC-1_1	False	2310	5504.0	1.0	0.018169	1.0	0.018169	2308
ATACTTCCAAGGTCTT-1_1	False	2122	5252.0	1.0	0.019040	4.0	0.076161	2120
GGGTCACTCTATTCGT-1_1	False	2286	5056.0	0.0	0.000000	1.0	0.019778	2280
GATGATCCACAACCGC-1_1	False	2263	4938.0	2.0	0.040502	2.0	0.040502	2262
...	...	...	...	...	...	...	...	...
CGCCATTGTTTGCCGG-1_1	False	355	410.0	3.0	0.731707	0.0	0.000000	355
CACTGGGGTCTACGTA-1_1	False	346	403.0	0.0	0.000000	0.0	0.000000	346
CATACTTGTAGAGGAA-1_1	False	360	410.0	2.0	0.487805	0.0	0.000000	360
TTTGGTTTCCACGGAC-1_1	False	299	405.0	0.0	0.000000	2.0	0.493827	299
ATGCATGAGTCATGAA-1_1	False	351	411.0	0.0	0.000000	0.0	0.000000	350

5535 rows × 8 columns

Finally, we filter mitochondrial (at 20%) and ribosomal (at 2%) outliers.

[74]:

adata = adata[adata.obs.pct_counts_mt < 20]

[75]:

adata = adata[adata.obs.pct_counts_ribo < 2]

[76]:

adata

[76]:

View of AnnData object with n_obs × n_vars = 5518 × 24136
    obs: 'doublet', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'n_genes'
    var: 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells'

We end our preprocessing with 5518 cells and 24136 genes.

Save/load checkpoint§

Save current version of post processed data to file.

[77]:

adata.write(f"{output_directory}/adata_post_processed.h5ad", compression="gzip")

Variables can be reloaded if necessary.

[78]:

df, doublets, solo, vae = pickle.load(open(f"{output_directory}/auxiliary.p", "rb"))

[79]:

bdata_post_processed = None
bdata_post_processed = sc.read_h5ad(f"{output_directory}/adata_post_processed.h5ad")
bdata_post_processed

[79]:

AnnData object with n_obs × n_vars = 5518 × 24136
    obs: 'doublet', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'n_genes'
    var: 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells'

Normalization§

[80]:

adata.X.sum(axis=1)

[80]:

ArrayView([5433., 5502., 5250., ...,  410.,  405.,  410.], dtype=float32)

[81]:

sc.pp.normalize_total(adata, target_sum=1e4)

/usr/lib/python3.10/site-packages/scanpy/preprocessing/_normalization.py:170: UserWarning: Received a view of an AnnData. Making a copy.
  view_to_actual(adata)

[82]:

adata.X.sum(axis=1)

[82]:

array([10000., 10000., 10000., ..., 10000., 10000., 10000.], dtype=float32)

[83]:

sc.pp.log1p(adata)

[84]:

adata.X.sum(axis=1)

[84]:

array([3285.3794, 3163.8867, 2858.1602, ..., 1191.2706, 1022.7512,
       1162.8784], dtype=float32)

[85]:

adata.raw = adata

Clustering§

Estimate expression variability§

[86]:

adata.var

[86]:

	mt	ribo	n_cells_by_counts	mean_counts	pct_dropout_by_counts	total_counts	n_cells
AL627309.1	False	False	4	0.000708	99.929178	4.0	4
AL627309.5	False	False	26	0.004603	99.539660	26.0	26
AL627309.4	False	False	4	0.000708	99.929178	4.0	4
LINC01409	False	False	233	0.044972	95.874646	254.0	233
FAM87B	False	False	6	0.001062	99.893768	6.0	6
...	...	...	...	...	...	...	...
AL354822.1	False	False	121	0.022132	97.857649	125.0	121
AL592183.1	False	False	871	0.182188	84.578612	1029.0	871
AC240274.1	False	False	142	0.026027	97.485836	147.0	142
AC007325.4	False	False	4	0.000708	99.929178	4.0	4
AC007325.2	False	False	6	0.001062	99.893768	6.0	6

24136 rows × 7 columns

Here we use the scanpy pre-processing function for highly-variable genes to annotate genes by their expression variability across cells.

[87]:

sc.pp.highly_variable_genes(adata, n_top_genes = 2000)

[88]:

adata.var

[88]:

	mt	ribo	n_cells_by_counts	mean_counts	pct_dropout_by_counts	total_counts	n_cells	highly_variable	means	dispersions	dispersions_norm
AL627309.1	False	False	4	0.000708	99.929178	4.0	4	True	0.009588	2.873559	1.581447
AL627309.5	False	False	26	0.004603	99.539660	26.0	26	False	0.029244	2.300264	-0.151067
AL627309.4	False	False	4	0.000708	99.929178	4.0	4	False	0.002517	1.702826	-1.956543
LINC01409	False	False	233	0.044972	95.874646	254.0	233	False	0.265776	2.334326	-0.048131
FAM87B	False	False	6	0.001062	99.893768	6.0	6	False	0.007217	1.947624	-1.216757
...	...	...	...	...	...	...	...	...	...	...	...
AL354822.1	False	False	121	0.022132	97.857649	125.0	121	False	0.158275	2.407786	0.173866
AL592183.1	False	False	871	0.182188	84.578612	1029.0	871	False	0.868576	2.511827	-0.388573
AC240274.1	False	False	142	0.026027	97.485836	147.0	142	False	0.153063	2.300034	-0.151763
AC007325.4	False	False	4	0.000708	99.929178	4.0	4	False	0.003025	1.579867	-2.328129
AC007325.2	False	False	6	0.001062	99.893768	6.0	6	False	0.006377	1.995220	-1.072921

24136 rows × 11 columns

[89]:

sc.pl.highly_variable_genes(adata)

Filter highly variable genes§

We finally filter for the highly variable genes noting the change in n_vars.

[90]:

adata

[90]:

AnnData object with n_obs × n_vars = 5518 × 24136
    obs: 'doublet', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'n_genes'
    var: 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'log1p', 'hvg'

[91]:

adata = adata[:, adata.var.highly_variable]

[92]:

adata

[92]:

View of AnnData object with n_obs × n_vars = 5518 × 2000
    obs: 'doublet', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'n_genes'
    var: 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'log1p', 'hvg'

Principle components analysis§

Regression and rescaling§

We would like to avoid interpreting differences that derive from differences in total counts, percentage of mitochondrial counts, and percentage of ribosomal counts. We can reduce the impact of these differences using the `scanpy regress out function <https://scanpy.readthedocs.io/en/latest/generated/scanpy.pp.regress_out.html>`__.

[93]:

sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt', 'pct_counts_ribo'])

We also want to normalize the data scaling it to zero mean and unit variance while clipping any remaining outliers above 10.

[94]:

sc.pp.scale(adata, max_value=10)

Perform PCA§

PCA can be performed with the `scanpy PCA tool <https://scanpy.readthedocs.io/en/latest/generated/scanpy.tl.pca.html>`__.

[95]:

sc.tl.pca(adata, svd_solver='arpack')

[96]:

sc.pl.pca_variance_ratio(adata, log=True, n_pcs = 50)

Clustering§

The UMAP algorithm contains a method for estimating distances between cells to which an interface is provided by scanpy neighbors.

[97]:

sc.pp.neighbors(adata, n_pcs = 30)

This function adds distances and connectivities as pairwise relationships among the cells in adata.obsp.

[98]:

adata

[98]:

AnnData object with n_obs × n_vars = 5518 × 2000
    obs: 'doublet', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'n_genes'
    var: 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'log1p', 'hvg', 'pca', 'neighbors'
    obsm: 'X_pca'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'

[99]:

adata.obsp

[99]:

PairwiseArrays with keys: distances, connectivities

We can now plot the UMAP embedding, but note there are no defined clusters of cells even though they may be hypothesized visually.

[100]:

sc.tl.umap(adata)

[101]:

sc.pl.umap(adata)

The `scanpy interface to the Leiden algorithm <https://scanpy.readthedocs.io/en/latest/generated/scanpy.tl.leiden.html>`__ can be used to cluster cells into subgroups.

[102]:

sc.tl.leiden(adata, resolution = 0.5)

[103]:

adata.obs

[103]:

	doublet	n_genes_by_counts	total_counts	total_counts_mt	pct_counts_mt	total_counts_ribo	pct_counts_ribo	n_genes	leiden
CCTCAACCATACAGGG-1_1	False	2276	5434.0	38.0	0.699301	0.0	0.000000	2275	11
ACAAAGATCCACAGGC-1_1	False	2310	5504.0	1.0	0.018169	1.0	0.018169	2308	2
ATACTTCCAAGGTCTT-1_1	False	2122	5252.0	1.0	0.019040	4.0	0.076161	2120	2
GGGTCACTCTATTCGT-1_1	False	2286	5056.0	0.0	0.000000	1.0	0.019778	2280	2
GATGATCCACAACCGC-1_1	False	2263	4938.0	2.0	0.040502	2.0	0.040502	2262	2
...	...	...	...	...	...	...	...	...	...
CGCCATTGTTTGCCGG-1_1	False	355	410.0	3.0	0.731707	0.0	0.000000	355	7
CACTGGGGTCTACGTA-1_1	False	346	403.0	0.0	0.000000	0.0	0.000000	346	1
CATACTTGTAGAGGAA-1_1	False	360	410.0	2.0	0.487805	0.0	0.000000	360	0
TTTGGTTTCCACGGAC-1_1	False	299	405.0	0.0	0.000000	2.0	0.493827	299	5
ATGCATGAGTCATGAA-1_1	False	351	411.0	0.0	0.000000	0.0	0.000000	350	1

5518 rows × 9 columns

We can now color the UMAP embedding by the Louvain clusters.

[104]:

sc.pl.umap(adata, color=['leiden'])

Save/load checkpoint§

Save current data with cluster information to file.

[105]:

adata.write(f"{output_directory}/adata_clustered.h5ad", compression="gzip")

[106]:

adata

[106]:

AnnData object with n_obs × n_vars = 5518 × 2000
    obs: 'doublet', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'n_genes', 'leiden'
    var: 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'log1p', 'hvg', 'pca', 'neighbors', 'umap', 'leiden', 'leiden_colors'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'

Variables can be reloaded if necessary.

[107]:

bdata_clustered = None
bdata_clustered = sc.read_h5ad(f"{output_directory}/adata_clustered.h5ad")
bdata_clustered

[107]:

AnnData object with n_obs × n_vars = 5518 × 2000
    obs: 'doublet', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'n_genes', 'leiden'
    var: 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'hvg', 'leiden', 'leiden_colors', 'log1p', 'neighbors', 'pca', 'umap'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'connectivities', 'distances'

Sample integration§

Batch processing§

See the command line program scpreprocess. It can be installed with

$ pip install -e .

and run, for example

$ scpreprocess  --input-directory="data/GSE171524/supplementary/"

Where the contents of input-directory was, in this case, created by running download_geo_data.sh on GSE171524. In this case, scpreprocess requires about 30GB of RAM and about 30 - 60 minutes to complete with 1 NVIDIA Tesla T4 GPU.

Filter genes§

[12]:

adata = None
adata = sc.read_h5ad(f"{output_directory}/adata_combined.h5ad")
adata

[12]:

AnnData object with n_obs × n_vars = 98878 × 34546
    obs: 'Sample', 'doublet', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo'

[13]:

sc.pp.filter_genes(adata, min_cells = 10)

[14]:

adata.X = csr_matrix(adata.X)

[15]:

adata.write(f"{output_directory}/adata_combined_filtered.h5ad", compression="gzip")

[16]:

adata = None
adata = sc.read_h5ad(f"{output_directory}/adata_combined_filtered.h5ad")
adata

[16]:

AnnData object with n_obs × n_vars = 98878 × 29029
    obs: 'Sample', 'doublet', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo'
    var: 'n_cells'

[17]:

adata.obs.groupby('Sample').count()

[17]:

	doublet	n_genes	n_genes_by_counts	total_counts	total_counts_mt	pct_counts_mt	total_counts_ribo	pct_counts_ribo
Sample
C51ctr	5501	5501	5501	5501	5501	5501	5501	5501
C52ctr	4055	4055	4055	4055	4055	4055	4055	4055
C53ctr	6241	6241	6241	6241	6241	6241	6241	6241
C54ctr	3961	3961	3961	3961	3961	3961	3961	3961
C55ctr	5088	5088	5088	5088	5088	5088	5088	5088
C56ctr	3613	3613	3613	3613	3613	3613	3613	3613
C57ctr	4242	4242	4242	4242	4242	4242	4242	4242
L01cov	2645	2645	2645	2645	2645	2645	2645	2645
L03cov	3583	3583	3583	3583	3583	3583	3583	3583
L04cov	3062	3062	3062	3062	3062	3062	3062	3062
L04covaddon	3933	3933	3933	3933	3933	3933	3933	3933
L05cov	2382	2382	2382	2382	2382	2382	2382	2382
L06cov	5737	5737	5737	5737	5737	5737	5737	5737
L07cov	4289	4289	4289	4289	4289	4289	4289	4289
L08cov	3474	3474	3474	3474	3474	3474	3474	3474
L09cov	3117	3117	3117	3117	3117	3117	3117	3117
L10cov	1357	1357	1357	1357	1357	1357	1357	1357
L11cov	2682	2682	2682	2682	2682	2682	2682	2682
L12cov	3298	3298	3298	3298	3298	3298	3298	3298
L13cov	4293	4293	4293	4293	4293	4293	4293	4293
L15cov	3635	3635	3635	3635	3635	3635	3635	3635
L16cov	1593	1593	1593	1593	1593	1593	1593	1593
L17cov	3952	3952	3952	3952	3952	3952	3952	3952
L18cov	2440	2440	2440	2440	2440	2440	2440	2440
L19cov	2162	2162	2162	2162	2162	2162	2162	2162
L21cov	2917	2917	2917	2917	2917	2917	2917	2917
L22cov	5626	5626	5626	5626	5626	5626	5626	5626

[18]:

sc.pp.filter_genes(adata, min_cells = 100)

[19]:

adata

[19]:

AnnData object with n_obs × n_vars = 98878 × 20625
    obs: 'Sample', 'doublet', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo'
    var: 'n_cells'

[20]:

adata.layers['counts'] = adata.X.copy()

[21]:

adata

[21]:

AnnData object with n_obs × n_vars = 98878 × 20625
    obs: 'Sample', 'doublet', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo'
    var: 'n_cells'
    layers: 'counts'

Normalize§

[22]:

sc.pp.normalize_total(adata, target_sum = 1e4)
sc.pp.log1p(adata)
adata.raw = adata

[23]:

adata.obs.head()

[23]:

	Sample	doublet	n_genes	n_genes_by_counts	total_counts	total_counts_ribo	pct_counts_ribo
CTGGCAGCACTCCTGT-1_21	L15cov	False	1889	1889	4812.0	9.0	0.187032
TCCCACAAGCGACTTT-1_21	L15cov	False	2125	2125	4318.0	0.0	0.000000
CCTCCAAAGACCAACG-1_21	L15cov	False	2099	2099	4279.0	5.0	0.116850
GGTTGTAAGGTTACCT-1_21	L15cov	False	2019	2019	4194.0	1.0	0.023844
GGGATGACAGTATTCG-1_21	L15cov	False	2077	2077	4198.0	5.0	0.119104

[24]:

scvi.model.SCVI.setup_anndata(adata, layer = "counts",
                             categorical_covariate_keys=["Sample"], # batch, technology
                             continuous_covariate_keys=['pct_counts_mt', 'total_counts', 'pct_counts_ribo'])

[25]:

model = scvi.model.SCVI(adata)

[26]:

model.train()

/usr/lib/python3.10/site-packages/scvi/model/base/_training_mixin.py:67: UserWarning: max_epochs=81 is less than n_epochs_kl_warmup=400. The max_kl_weight will not be reached during training.
  warnings.warn(
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

Epoch 81/81: 100% 81/81 [17:06<00:00, 12.42s/it, loss=2.76e+03, v_num=1]

`Trainer.fit` stopped: `max_epochs=81` reached.

Epoch 81/81: 100% 81/81 [17:06<00:00, 12.68s/it, loss=2.76e+03, v_num=1]

[28]:

# pickle.dump([model], open(f"{output_directory}/model_combined.p", "wb"))
model.save(f"{output_directory}/model_combined/")
# model = scvi.model.SCVI.load(f"{output_directory}/model_combined/")

[29]:

adata.obsm['X_scVI'] = model.get_latent_representation()

[30]:

adata.layers['scvi_normalized'] = model.get_normalized_expression(library_size = 1e4)

Cluster§

[31]:

sc.pp.neighbors(adata, use_rep = 'X_scVI')

[33]:

sc.tl.umap(adata)
sc.tl.leiden(adata, resolution = 0.5)

[36]:

sc.pl.umap(adata, color = ['leiden'], frameon = True)

[37]:

sc.pl.umap(adata, color = ['Sample'], frameon = True)

Save/load checkpoint§

[38]:

adata.write(f"{output_directory}/adata_integrated.h5ad", compression="gzip")

[16]:

adata = None
adata = sc.read_h5ad(f"{output_directory}/adata_integrated.h5ad")
adata

[16]:

AnnData object with n_obs × n_vars = 98878 × 29029
    obs: 'Sample', 'doublet', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo'
    var: 'n_cells'

Marker identification§

Increase cluster resolution from 0.5 to default value of 1.

[39]:

sc.tl.leiden(adata, resolution = 1)

[40]:

sc.tl.rank_genes_groups(adata, 'leiden')

WARNING: Default of the method has been changed to 't-test' from 't-test_overestim_var'

/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:394: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'names'] = self.var_names[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:396: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'scores'] = scores[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:399: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'pvals'] = pvals[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:409: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:420: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'logfoldchanges'] = np.log2(
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:394: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'names'] = self.var_names[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:396: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'scores'] = scores[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:399: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'pvals'] = pvals[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:409: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:420: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'logfoldchanges'] = np.log2(
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:394: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'names'] = self.var_names[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:396: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'scores'] = scores[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:399: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'pvals'] = pvals[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:409: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:420: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'logfoldchanges'] = np.log2(
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:394: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'names'] = self.var_names[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:396: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'scores'] = scores[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:399: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'pvals'] = pvals[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:409: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:420: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'logfoldchanges'] = np.log2(
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:394: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'names'] = self.var_names[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:396: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'scores'] = scores[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:399: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'pvals'] = pvals[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:409: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:420: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'logfoldchanges'] = np.log2(
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:394: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'names'] = self.var_names[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:396: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'scores'] = scores[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:399: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'pvals'] = pvals[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:409: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:420: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'logfoldchanges'] = np.log2(
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:394: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'names'] = self.var_names[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:396: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'scores'] = scores[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:399: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'pvals'] = pvals[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:409: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:420: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'logfoldchanges'] = np.log2(
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:394: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'names'] = self.var_names[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:396: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'scores'] = scores[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:399: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'pvals'] = pvals[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:409: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:420: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'logfoldchanges'] = np.log2(
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:394: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'names'] = self.var_names[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:396: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'scores'] = scores[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:399: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'pvals'] = pvals[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:409: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'pvals_adj'] = pvals_adj[global_indices]
/usr/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:420: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  self.stats[group_name, 'logfoldchanges'] = np.log2(

Plot genes that distinguish each cluster.

[42]:

sc.pl.rank_genes_groups(adata, n_genes=20, sharey=False)

Filter markers by adjusted p-value and log fold change.

[41]:

markers = sc.get.rank_genes_groups_df(adata, None)
markers = markers[(markers.pvals_adj < 0.05) & (markers.logfoldchanges > .5)]
markers

[41]:

	group	names	scores	logfoldchanges	pvals	pvals_adj
0	0	CTSB	157.684479	4.972445	0.000000	0.000000
1	0	LRMDA	137.313080	2.813497	0.000000	0.000000
2	0	PLXDC2	135.515442	3.123800	0.000000	0.000000
3	0	ZEB2	133.443405	3.335379	0.000000	0.000000
4	0	FMN1	133.045715	3.743495	0.000000	0.000000
...	...	...	...	...	...	...
578222	28	CCL3	2.141156	4.313347	0.037839	0.048973
578223	28	VKORC1	2.140682	3.905211	0.037879	0.049022
578224	28	NDUFA2	2.137703	4.700588	0.038135	0.049349
578225	28	ZMPSTE24	2.136283	2.764338	0.038256	0.049503
578226	28	LRPAP1	2.135819	2.313233	0.038296	0.049551

84872 rows × 6 columns

Estimate differential expression for leiden clusters.

[43]:

markers_scvi = model.differential_expression(groupby = 'leiden')
markers_scvi

DE...: 100% 29/29 [04:38<00:00,  9.61s/it]

[43]:

	proba_de	proba_not_de	bayes_factor	scale1	scale2	pseudocounts	delta	lfc_mean	lfc_median	lfc_std	...	raw_mean1	raw_mean2	non_zeros_proportion1	non_zeros_proportion2	raw_normalized_mean1	raw_normalized_mean2	is_de_fdr_0.05	comparison	group1	group2
SPP1	0.9898	0.0102	4.575114	0.000638	0.000048	0.0	0.25	6.495259	6.727366	3.440886	...	1.076499	0.040538	0.181072	0.009292	6.256733	0.294361	True	0 vs Rest	0	Rest
CCL18	0.9898	0.0102	4.575114	0.000326	0.000043	0.0	0.25	6.902542	7.447420	4.287405	...	0.602506	0.042989	0.168400	0.017725	3.187437	0.270629	True	0 vs Rest	0	Rest
APOC1	0.9896	0.0104	4.555494	0.000148	0.000016	0.0	0.25	6.963109	7.430366	3.897622	...	0.276810	0.019038	0.135315	0.012533	1.356255	0.119619	True	0 vs Rest	0	Rest
C1QA	0.9894	0.0106	4.536244	0.000333	0.000045	0.0	0.25	7.422565	7.997816	4.413615	...	0.539626	0.045370	0.252483	0.025287	3.077946	0.308842	True	0 vs Rest	0	Rest
TSPAN5	0.9890	0.0110	4.498798	0.000006	0.000235	0.0	0.25	-5.034511	-5.320078	3.038231	...	0.002972	0.244487	0.002738	0.142439	0.029674	2.230635	True	0 vs Rest	0	Rest
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
PCBP2	0.6028	0.3972	0.417146	0.000112	0.000126	0.0	0.25	-0.146613	-0.167634	0.442000	...	0.133333	0.161815	0.133333	0.140631	1.616527	1.317526	False	28 vs Rest	28	Rest
API5	0.6020	0.3980	0.413805	0.000040	0.000042	0.0	0.25	-0.024984	-0.050644	0.501002	...	0.088889	0.053969	0.088889	0.050580	0.924425	0.441995	False	28 vs Rest	28	Rest
SEPTIN2	0.5982	0.4018	0.397971	0.000164	0.000179	0.0	0.25	-0.067447	-0.083257	0.518050	...	0.244444	0.243477	0.222222	0.197394	2.164344	2.037301	False	28 vs Rest	28	Rest
PI4KB	0.5702	0.4298	0.282667	0.000048	0.000047	0.0	0.25	0.065837	0.066301	0.450776	...	0.022222	0.058350	0.022222	0.053980	0.204813	0.473826	False	28 vs Rest	28	Rest
YWHAE	0.5520	0.4480	0.208755	0.000132	0.000140	0.0	0.25	-0.079388	-0.062251	0.425387	...	0.088889	0.171559	0.088889	0.148847	0.608118	1.433012	False	28 vs Rest	28	Rest

598125 rows × 22 columns

Filter differentially expressed genes by FDR and log fold change.

[47]:

markers_scvi = markers_scvi[(markers_scvi['is_de_fdr_0.05']) & (markers_scvi.lfc_mean > .5)]
markers_scvi

[47]:

	proba_de	proba_not_de	bayes_factor	scale1	scale2	pseudocounts	delta	lfc_mean	lfc_median	lfc_std	...	raw_mean1	raw_mean2	non_zeros_proportion1	non_zeros_proportion2	raw_normalized_mean1	raw_normalized_mean2	is_de_fdr_0.05	comparison	group1	group2
SPP1	0.9898	0.0102	4.575114	0.000638	0.000048	0.0	0.25	6.495259	6.727366	3.440886	...	1.076499	0.040538	0.181072	0.009292	6.256733	0.294361	True	0 vs Rest	0	Rest
CCL18	0.9898	0.0102	4.575114	0.000326	0.000043	0.0	0.25	6.902542	7.447420	4.287405	...	0.602506	0.042989	0.168400	0.017725	3.187437	0.270629	True	0 vs Rest	0	Rest
APOC1	0.9896	0.0104	4.555494	0.000148	0.000016	0.0	0.25	6.963109	7.430366	3.897622	...	0.276810	0.019038	0.135315	0.012533	1.356255	0.119619	True	0 vs Rest	0	Rest
C1QA	0.9894	0.0106	4.536244	0.000333	0.000045	0.0	0.25	7.422565	7.997816	4.413615	...	0.539626	0.045370	0.252483	0.025287	3.077946	0.308842	True	0 vs Rest	0	Rest
MCEMP1	0.9888	0.0112	4.480577	0.000039	0.000008	0.0	0.25	5.592349	5.895313	3.543164	...	0.061634	0.013300	0.045366	0.010071	0.315246	0.083767	True	0 vs Rest	0	Rest
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
AL158206.1	0.8942	0.1058	2.134379	0.000003	0.000002	0.0	0.25	1.058239	0.996449	1.515791	...	0.000000	0.002145	0.000000	0.002135	0.000000	0.016850	True	28 vs Rest	28	Rest
AC004847.1	0.8942	0.1058	2.134379	0.000002	0.000002	0.0	0.25	0.636359	0.664043	1.732896	...	0.000000	0.001639	0.000000	0.001609	0.000000	0.014673	True	28 vs Rest	28	Rest
SNHG17	0.8942	0.1058	2.134379	0.000014	0.000008	0.0	0.25	1.102903	1.107450	1.149856	...	0.000000	0.011150	0.000000	0.010796	0.000000	0.087149	True	28 vs Rest	28	Rest
BAIAP2	0.8942	0.1058	2.134379	0.000056	0.000046	0.0	0.25	1.501894	0.933272	2.498250	...	0.066667	0.063237	0.066667	0.052938	0.852857	0.478457	True	28 vs Rest	28	Rest
ZNF530	0.8942	0.1058	2.134379	0.000003	0.000002	0.0	0.25	0.892193	0.836232	1.560018	...	0.000000	0.002256	0.000000	0.002216	0.000000	0.019131	True	28 vs Rest	28	Rest

92424 rows × 22 columns

[50]:

sc.pl.umap(adata, color = ['leiden'], frameon = False, legend_loc = "on data")

[ ]:

sc.pl.umap(adata, color = ['PTPRC', 'CD3E', 'CD4'], frameon = False, layer = 'scvi_normalized', vmax = 5)
sc.pl.umap(adata, color = ['PTPRC', 'CD3E', 'CD8A'], frameon = False, layer = 'scvi_normalized', vmax = 5)

sc.pl.umap(adata, color = ['EPCAM', 'MUC1'], frameon = False, layer = 'scvi_normalized', vmax = 5)
sc.pl.umap(adata, color = ['EPCAM', 'MUC1'], frameon = False, layer = 'scvi_normalized', vmax = 5)