This page was generated from source/03-velocity-review.ipynb.

Velocity review§

Here we review basic RNA velocity analysis.

This is a modified version of the scvelo basic velocity analysis tutorial.

The primary data set is development of the endocrine pancreas, which demonstrates lineage commitment to four distinct fates: α, β, δ and ε-cells. See here for more details.

Setup§

Import libraries§

[1]:

# update to the latest version, if not done yet.
# !pip install scvelo --upgrade --quiet

[1]:

from inspect import getmembers
from pprint import pprint
from types import FunctionType

import scanpy as sc
import scvelo as scv

[2]:

scv.logging.print_version()

Running scvelo 0.2.4 (python 3.10.7) on 2022-09-27 02:48.

ERROR: XMLRPC request failed [code: -32500]
RuntimeError: PyPI's XMLRPC API is currently disabled due to unmanageable load and will be deprecated in the near future. See https://status.python.org/ for more information.

[3]:

scv.settings.verbosity = 3  # show errors(0), warnings(1), info(2), hints(3)
scv.settings.presenter_view = True  # set max width size for presenter view
scv.set_figure_params('scvelo')  # for beautified visualization

Setup plotting§

[4]:

import matplotlib.font_manager
import matplotlib.pyplot as plt

# import matplotlib_inline

[5]:

# fonts_path = "/usr/share/texmf/fonts/opentype/public/lm/" #ubuntu
# fonts_path = "~/Library/Fonts/" # macos
fonts_path = "/usr/share/fonts/OTF/"  # arch
# user_path = "$HOME/" # user
# fonts_path = user_path + "fonts/latinmodern/opentype/public/lm/"  # home
matplotlib.font_manager.fontManager.addfont(fonts_path + "lmsans10-regular.otf")
matplotlib.font_manager.fontManager.addfont(fonts_path + "lmroman10-regular.otf")

[6]:

# https://stackoverflow.com/a/36622238/446907
%config InlineBackend.figure_formats = ['svg']

[7]:

plt.style.use("default")  # reset default parameters
# https://stackoverflow.com/a/3900167/446907
plt.rcParams.update(
    {
        "font.size": 16,
        "font.family": ["sans-serif"],
        "font.serif": ["Latin Modern Roman"] + plt.rcParams["font.serif"],
        "font.sans-serif": ["Latin Modern Sans"] + plt.rcParams["font.sans-serif"],
    }
)

Utility functions§

[8]:

def attributes(obj):
    """
    get object attributes
    """
    disallowed_names = {
        name for name, value in getmembers(type(obj)) if isinstance(value, FunctionType)
    }
    return {
        name: getattr(obj, name)
        for name in dir(obj)
        if name[0] != "_" and name not in disallowed_names and hasattr(obj, name)
    }


def print_attributes(obj):
    """
    print object attributes
    """
    pprint(attributes(obj))

Load data§

load an example dataset into an AnnData object
- the data set used here contains precomputed spliced/unspliced reads
- spliced/unspliced reads can be computed with velocyto.py and kb-python
view the detailed structure of the object
estimate relative proportions of spliced/unspliced reads in the data

Load AnnData object§

The analysis is based on the in-built pancreas data. To run velocity analysis on your own data, read your file (loom, h5ad, csv …) to an AnnData object with adata = scv.read('path/file.loom', cache=True). If you want to merge your loom file into an already existing AnnData object, use scv.utils.merge(adata, adata_loom).

[9]:

adata = scv.datasets.pancreas()
adata

[9]:

AnnData object with n_obs × n_vars = 3696 × 27998
    obs: 'clusters_coarse', 'clusters', 'S_score', 'G2M_score'
    var: 'highly_variable_genes'
    uns: 'clusters_coarse_colors', 'clusters_colors', 'day_colors', 'neighbors', 'pca'
    obsm: 'X_pca', 'X_umap'
    layers: 'spliced', 'unspliced'
    obsp: 'distances', 'connectivities'

scVelo is based on adata, an object that stores a data matrix adata.X, annotation of observations adata.obs, variables adata.var, and unstructured annotations adata.uns. Names of observations and variables can be accessed via adata.obs_names and adata.var_names, respectively. AnnData objects can be sliced like dataframes, for example, adata_subset = adata[:, list_of_gene_names]. For more details, see the anndata docs.

Detailed view of adata§

Observations are indexed by barcodes that are intended to correspond to cells after preprocessing. There are four components to the annotation vector. Two of these are fine- and coarse-grained clusters into cell types. The others are scores S_score and G2M_score that assess expression of G2/M and S phase marker genes.

[11]:

adata.obs

[11]:

	clusters_coarse	clusters	S_score	G2M_score
index
AAACCTGAGAGGGATA	Pre-endocrine	Pre-endocrine	-0.224902	-0.252071
AAACCTGAGCCTTGAT	Ductal	Ductal	-0.014707	-0.232610
AAACCTGAGGCAATTA	Endocrine	Alpha	-0.171255	-0.286834
AAACCTGCATCATCCC	Ductal	Ductal	0.599244	0.191243
AAACCTGGTAAGTGGC	Ngn3 high EP	Ngn3 high EP	-0.179981	-0.126030
...	...	...	...	...
TTTGTCAAGTGACATA	Pre-endocrine	Pre-endocrine	-0.235896	-0.266101
TTTGTCAAGTGTGGCA	Ngn3 high EP	Ngn3 high EP	0.279374	-0.204047
TTTGTCAGTTGTTTGG	Ductal	Ductal	-0.045692	-0.208907
TTTGTCATCGAATGCT	Endocrine	Alpha	-0.240576	-0.206865
TTTGTCATCTGTTTGT	Endocrine	Epsilon	-0.136407	-0.184763

3696 rows × 4 columns

Gene names are saved

[12]:

adata.var

[12]:

	highly_variable_genes
index
Xkr4	False
Gm37381	NaN
Rp1	NaN
Rp1-1	NaN
Sox17	NaN
...	...
Gm28672	NaN
Gm28670	NaN
Gm29504	NaN
Gm20837	NaN
Erdr1	True

27998 rows × 1 columns

[13]:

adata.obs_names

[13]:

Index(['AAACCTGAGAGGGATA', 'AAACCTGAGCCTTGAT', 'AAACCTGAGGCAATTA',
       'AAACCTGCATCATCCC', 'AAACCTGGTAAGTGGC', 'AAACCTGGTATTAGCC',
       'AAACCTGTCCCTCTTT', 'AAACCTGTCTTTCCTC', 'AAACGGGAGACAATAC',
       'AAACGGGAGATATGGT',
       ...
       'TTTGGTTCACCAGATT', 'TTTGGTTCACGAAGCA', 'TTTGGTTTCACTTACT',
       'TTTGGTTTCCTTTCGG', 'TTTGTCAAGAATGTGT', 'TTTGTCAAGTGACATA',
       'TTTGTCAAGTGTGGCA', 'TTTGTCAGTTGTTTGG', 'TTTGTCATCGAATGCT',
       'TTTGTCATCTGTTTGT'],
      dtype='object', name='index', length=3696)

[14]:

adata.var_names

[14]:

Index(['Xkr4', 'Gm37381', 'Rp1', 'Rp1-1', 'Sox17', 'Gm37323', 'Mrpl15',
       'Rgs20', 'Npbwr1', '4732440D04Rik',
       ...
       'Gm28406', 'Gm29436', 'Gm28407', 'Gm29393', 'Gm21294', 'Gm28672',
       'Gm28670', 'Gm29504', 'Gm20837', 'Erdr1'],
      dtype='object', name='index', length=27998)

There are two layers corresponding to spliced and unspliced transcripts respectively.

[15]:

adata.layers['spliced']

[15]:

<3696x27998 sparse matrix of type '<class 'numpy.float32'>'
        with 9298890 stored elements in Compressed Sparse Row format>

[16]:

adata.layers['unspliced']

[16]:

<3696x27998 sparse matrix of type '<class 'numpy.float32'>'
        with 3156504 stored elements in Compressed Sparse Row format>

PCA and UMAP have retained 50 and 2 dimensions respectively.

[17]:

print(adata.obsm)
print(adata.obsm['X_pca'].shape)
print(adata.obsm)
print(adata.obsm['X_umap'].shape)

AxisArrays with keys: X_pca, X_umap
(3696, 50)
AxisArrays with keys: X_pca, X_umap
(3696, 2)

[18]:

print(adata.varm)

AxisArrays with keys:

[19]:

print(adata.obsp)
print(adata.obsp['distances'].shape)
print(adata.obsp)
print(adata.obsp['connectivities'].shape)

PairwiseArrays with keys: distances, connectivities
(3696, 3696)
PairwiseArrays with keys: distances, connectivities
(3696, 3696)

[20]:

print(adata.varp)

PairwiseArrays with keys:

[21]:

print_attributes(adata)

{'T': AnnData object with n_obs × n_vars = 27998 × 3696
    obs: 'highly_variable_genes'
    var: 'clusters_coarse', 'clusters', 'S_score', 'G2M_score'
    uns: 'clusters_coarse_colors', 'clusters_colors', 'day_colors', 'neighbors', 'pca'
    varm: 'X_pca', 'X_umap'
    layers: 'spliced', 'unspliced'
    varp: 'distances', 'connectivities',
 'X': <3696x27998 sparse matrix of type '<class 'numpy.float32'>'
        with 9298890 stored elements in Compressed Sparse Row format>,
 'file': Backing file manager: no file is set.,
 'filename': None,
 'is_view': False,
 'isbacked': False,
 'isview': False,
 'layers': Layers with keys: spliced, unspliced,
 'n_obs': 3696,
 'n_vars': 27998,
 'obs':                  clusters_coarse       clusters   S_score  G2M_score
index
AAACCTGAGAGGGATA   Pre-endocrine  Pre-endocrine -0.224902  -0.252071
AAACCTGAGCCTTGAT          Ductal         Ductal -0.014707  -0.232610
AAACCTGAGGCAATTA       Endocrine          Alpha -0.171255  -0.286834
AAACCTGCATCATCCC          Ductal         Ductal  0.599244   0.191243
AAACCTGGTAAGTGGC    Ngn3 high EP   Ngn3 high EP -0.179981  -0.126030
...                          ...            ...       ...        ...
TTTGTCAAGTGACATA   Pre-endocrine  Pre-endocrine -0.235896  -0.266101
TTTGTCAAGTGTGGCA    Ngn3 high EP   Ngn3 high EP  0.279374  -0.204047
TTTGTCAGTTGTTTGG          Ductal         Ductal -0.045692  -0.208907
TTTGTCATCGAATGCT       Endocrine          Alpha -0.240576  -0.206865
TTTGTCATCTGTTTGT       Endocrine        Epsilon -0.136407  -0.184763

[3696 rows x 4 columns],
 'obs_names': Index(['AAACCTGAGAGGGATA', 'AAACCTGAGCCTTGAT', 'AAACCTGAGGCAATTA',
       'AAACCTGCATCATCCC', 'AAACCTGGTAAGTGGC', 'AAACCTGGTATTAGCC',
       'AAACCTGTCCCTCTTT', 'AAACCTGTCTTTCCTC', 'AAACGGGAGACAATAC',
       'AAACGGGAGATATGGT',
       ...
       'TTTGGTTCACCAGATT', 'TTTGGTTCACGAAGCA', 'TTTGGTTTCACTTACT',
       'TTTGGTTTCCTTTCGG', 'TTTGTCAAGAATGTGT', 'TTTGTCAAGTGACATA',
       'TTTGTCAAGTGTGGCA', 'TTTGTCAGTTGTTTGG', 'TTTGTCATCGAATGCT',
       'TTTGTCATCTGTTTGT'],
      dtype='object', name='index', length=3696),
 'obsm': AxisArrays with keys: X_pca, X_umap,
 'obsp': PairwiseArrays with keys: distances, connectivities,
 'raw': None,
 'shape': (3696, 27998),
 'uns': OverloadedDict, wrapping:
        {'clusters_coarse_colors': array(['#031A5C', '#fa9fb5', '#fdbf6f', '#ff7f00', '#b2df8a'],
      dtype=object), 'clusters_colors': array(['#8fbc8f', '#f4a460', '#fdbf6f', '#ff7f00', '#b2df8a', '#1f78b4',
       '#6a3d9a', '#cab2d6'], dtype=object), 'day_colors': array(['#d62728'], dtype=object), 'neighbors': {'params': {'method': array(['umap'], dtype=object), 'n_neighbors': array([15])}}, 'pca': {'variance': array([84.01987   , 56.14541   , 24.902763  , 18.046164  , 13.804973  ,
       11.931583  ,  9.848511  ,  5.831464  ,  4.772083  ,  4.364034  ,
        3.9239557 ,  3.4221544 ,  2.7514822 ,  2.4077806 ,  2.2492352 ,
        2.1247294 ,  2.067351  ,  1.7702986 ,  1.7164644 ,  1.5288209 ,
        1.4669604 ,  1.3977634 ,  1.341725  ,  1.2898475 ,  1.1913112 ,
        1.1588011 ,  1.1220304 ,  1.087722  ,  1.0535891 ,  0.999956  ,
        0.9759477 ,  0.96313083,  0.92936826,  0.91443396,  0.89059216,
        0.8624986 ,  0.85675824,  0.807511  ,  0.79329145,  0.779374  ,
        0.7724183 ,  0.7326805 ,  0.7189729 ,  0.69933045,  0.67954636,
        0.67046547,  0.63618135,  0.63056374,  0.6286348 ,  0.6084896 ],
      dtype=float32), 'variance_ratio': array([0.12779653, 0.08539871, 0.03787779, 0.02744871, 0.02099774,
       0.01814827, 0.01497986, 0.00886982, 0.00725847, 0.00663782,
       0.00596844, 0.00520519, 0.00418508, 0.0036623 , 0.00342115,
       0.00323177, 0.0031445 , 0.00269267, 0.00261079, 0.00232538,
       0.00223129, 0.00212604, 0.0020408 , 0.00196189, 0.00181202,
       0.00176257, 0.00170664, 0.00165445, 0.00160254, 0.00152096,
       0.00148444, 0.00146495, 0.00141359, 0.00139088, 0.00135462,
       0.00131188, 0.00130315, 0.00122825, 0.00120662, 0.00118545,
       0.00117487, 0.00111443, 0.00109358, 0.0010637 , 0.00103361,
       0.0010198 , 0.00096765, 0.0009591 , 0.00095617, 0.00092553],
      dtype=float32)}}
With overloaded keys:
        ['neighbors'].,
 'var':         highly_variable_genes
index
Xkr4                    False
Gm37381                   NaN
Rp1                       NaN
Rp1-1                     NaN
Sox17                     NaN
...                       ...
Gm28672                   NaN
Gm28670                   NaN
Gm29504                   NaN
Gm20837                   NaN
Erdr1                    True

[27998 rows x 1 columns],
 'var_names': Index(['Xkr4', 'Gm37381', 'Rp1', 'Rp1-1', 'Sox17', 'Gm37323', 'Mrpl15',
       'Rgs20', 'Npbwr1', '4732440D04Rik',
       ...
       'Gm28406', 'Gm29436', 'Gm28407', 'Gm29393', 'Gm21294', 'Gm28672',
       'Gm28670', 'Gm29504', 'Gm20837', 'Erdr1'],
      dtype='object', name='index', length=27998),
 'varm': AxisArrays with keys: ,
 'varp': PairwiseArrays with keys: }

/tmp/ipykernel_131/22660072.py:11: DeprecationWarning: Use is_view instead of isview, isview will be removed in the future.
  if name[0] != "_" and name not in disallowed_names and hasattr(obj, name)
/tmp/ipykernel_131/22660072.py:9: DeprecationWarning: Use is_view instead of isview, isview will be removed in the future.
  name: getattr(obj, name)

Show spliced/unspliced§

Here, the proportions of spliced/unspliced counts are displayed. Depending on the protocol used (Drop-Seq, Smart-Seq), we typically have between 10%-25% of unspliced molecules containing intronic sequences. We also advice you to examine the variations on cluster level to verify consistency in splicing efficiency. Here, we find variations as expected, with slightly lower unspliced proportions at cycling ductal cells, then higher proportion at cell fate commitment in Ngn3-high and Pre-endocrine cells where many genes start to be transcribed.

[22]:

scv.pl.proportions(adata)

Preprocess the Data§

Preprocessing requisites consist of gene selection by detection (with a minimum number of counts) and high variability (dispersion), normalizing every cell by its total size and logarithmizing X. Filtering and normalization is applied in the same vein to spliced/unspliced counts and X. Logarithmizing is only applied to X. If X is already preprocessed from former analysis, it will not be touched.

All of this is summarized in a single function scv.pp.filter_and_normalize, which essentially runs the following:

scv.pp.filter_genes(adata, min_shared_counts=20)
scv.pp.normalize_per_cell(adata)
scv.pp.filter_genes_dispersion(adata, n_top_genes=2000)
scv.pp.log1p(adata)

Further, we need the first and second order moments (means and uncentered variances) computed among nearest neighbors in PCA space, summarized in scv.pp.moments, which internally computes scv.pp.pca and scv.pp.neighbors. First order is needed for deterministic velocity estimation, while stochastic estimation also requires second order moments.

[23]:

scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=2000)
scv.pp.moments(adata, n_pcs=30, n_neighbors=30)

Filtered out 20801 genes that are detected 20 counts (shared).
Normalized count data: X, spliced, unspliced.
Extracted 2000 highly variable genes.
Logarithmized X.
computing neighbors
    finished (0:00:15) --> added
    'distances' and 'connectivities', weighted adjacency matrices (adata.obsp)
computing moments based on connectivities
    finished (0:00:00) --> added
    'Ms' and 'Mu', moments of un/spliced abundances (adata.layers)

Further preprocessing (such as batch effect correction) may be used to remove unwanted sources of variability. See the best practices for further details. Note, that any additional preprocessing step only affects X and is not applied to spliced/unspliced counts.

Compute RNA velocity§

Estimate RNA velocity§

Velocities are vectors in gene expression space and represent the direction and speed of movement of the individual cells. The velocities are obtained by modeling transcriptional dynamics of splicing kinetics, either stochastically (default) or deterministically (by setting mode='deterministic'). For each gene, a steady-state-ratio of pre-mature (unspliced) and mature (spliced) mRNA counts is fitted, which constitutes a constant transcriptional state. Velocities are then obtained as residuals from this ratio. Positive velocity indicates that a gene is up-regulated, which occurs for cells that show higher abundance of unspliced mRNA for that gene than expected in steady state. Conversely, negative velocity indicates that a gene is down-regulated.

The solution to the full dynamical model is obtained by setting mode='dynamical', which requires to run scv.tl.recover_dynamics(adata) beforehand. We will elaborate more on the dynamical model in the next tutorial.

[24]:

scv.tl.velocity(adata)

computing velocities
    finished (0:00:01) --> added
    'velocity', velocity vectors for each individual cell (adata.layers)

The computed velocities are stored in adata.layers just like the count matrices.

The combination of velocities across genes can then be used to estimate the future state of an individual cell. In order to project the velocities into a lower-dimensional embedding, transition probabilities of cell-to-cell transitions are estimated. That is, for each velocity vector we find the likely cell transitions that are accordance with that direction. The transition probabilities are computed using cosine correlation between the potential cell-to-cell transitions and the velocity vector, and are stored in a matrix denoted as velocity graph. The resulting velocity graph has dimension \(n_{obs} \times n_{obs}\) and summarizes the possible cell state changes that are well explained through the velocity vectors (for runtime speedup it can also be computed on reduced PCA space by setting approx=True).

[25]:

scv.tl.velocity_graph(adata)

computing velocity graph (using 1/4 cores)

    finished (0:00:15) --> added
    'velocity_graph', sparse matrix with cosine correlations (adata.uns)

For a variety of applications, the velocity graph can be converted to a transition matrix by applying a Gaussian kernel to transform the cosine correlations into actual transition probabilities. You can access the Markov transition matrix via scv.utils.get_transition_matrix.

As mentioned, it is internally used to project the velocities into a low-dimensional embedding by applying the mean transition with respect to the transition probabilities, obtained with scv.tl.velocity_embedding. Further, we can trace cells along the Markov chain to their origins and potential fates, thereby getting root cells and end points within a trajectory, obtained via scv.tl.terminal_states.

Project the velocities§

Finally, the velocities are projected onto any embedding, specified by basis, and visualized in one of these ways: - on cellular level with scv.pl.velocity_embedding, - as gridlines with scv.pl.velocity_embedding_grid, - or as streamlines with scv.pl.velocity_embedding_stream.

Note, that the data has an already pre-computed UMAP embedding, and annotated clusters. When applying to your own data, these can be obtained with scv.tl.umap and scv.tl.louvain. For more details, see the scanpy tutorial. Further, all plotting functions are defaulted to using basis='umap' and color='clusters', which you can set accordingly.

[26]:

scv.pl.velocity_embedding_stream(adata, basis='umap')

computing velocity embedding
    finished (0:00:00) --> added
    'velocity_umap', embedded velocity vectors (adata.obsm)

The velocity vector field displayed as streamlines yields fine-grained insights into the developmental processes. It accurately delineates the cycling population of ductal cells and endocrine progenitors. Further, it illuminates cell states of lineage commitment, cell-cycle exit, and endocrine cell differentiation.

The most fine-grained resolution of the velocity vector field we get at single-cell level, with each arrow showing the direction and speed of movement of an individual cell. That reveals, e.g., the early endocrine commitment of Ngn3-cells (yellow) and a clear-cut difference between near-terminal α-cells (blue) and transient β-cells (green).

[27]:

scv.pl.velocity_embedding(adata, arrow_length=3, arrow_size=2, dpi=120)

Interpret the velocities§

This is perhaps the most important part as we advise the user not to limit biological conclusions to the projected velocities, but to examine individual gene dynamics via phase portraits to understand how inferred directions are supported by particular genes.

See the gif here to get an idea of how to interpret a spliced vs. unspliced phase portrait. Gene activity is orchestrated by transcriptional regulation. Transcriptional induction for a particular gene results in an increase of (newly transcribed) precursor unspliced mRNAs while, conversely, repression or absence of transcription results in a decrease of unspliced mRNAs. Spliced mRNAs is produced from unspliced mRNA and follows the same trend with a time lag. Time is a hidden/latent variable. Thus, the dynamics needs to be inferred from what is actually measured: spliced and unspliced mRNAs as displayed in the phase portrait.

Now, let us examine the phase portraits of some marker genes, visualized with scv.pl.velocity(adata, gene_names) or scv.pl.scatter(adata, gene_names).

[28]:

scv.pl.velocity(adata, ['Cpe',  'Gnao1', 'Ins2', 'Adk'], ncols=2)

The black line corresponds to the estimated ‘steady-state’ ratio, i.e. the ratio of unspliced to spliced mRNA abundance which is in a constant transcriptional state. RNA velocity for a particular gene is determined as the residual, i.e. how much an observation deviates from that steady-state line. Positive velocity indicates that a gene is up-regulated, which occurs for cells that show higher abundance of unspliced mRNA for that gene than expected in steady state. Conversely, negative velocity indicates that a gene is down-regulated.

For instance Cpe explains the directionality in the up-regulated Ngn3 (yellow) to Pre-endocrine (orange) to β-cells (green), while Adk explains the directionality in the down-regulated Ductal (dark green) to Ngn3 (yellow) to the remaining endocrine cells.

[29]:

scv.pl.scatter(adata, 'Cpe', color=['clusters', 'velocity'],
               add_outline='Ngn3 high EP, Pre-endocrine, Beta')

Postprocessing§

Identify important genes§

We need a systematic way to identify genes that may help explain the resulting vector field and inferred lineages. To do so, we can test which genes have cluster-specific differential velocity expression, being siginificantly higher/lower compared to the remaining population. The module scv.tl.rank_velocity_genes runs a differential velocity t-test and outpus a gene ranking for each cluster. Thresholds can be set (e.g. min_corr) to restrict the test on a selection of gene candidates.

[30]:

scv.tl.rank_velocity_genes(adata, groupby='clusters', min_corr=.3)

df = scv.DataFrame(adata.uns['rank_velocity_genes']['names'])
df.head()

ranking velocity genes

/usr/lib/python3.10/site-packages/scvelo/tools/utils.py:501: DeprecationWarning: Please use `rankdata` from the `scipy.stats` namespace, the `scipy.stats.stats` namespace is deprecated.
  from scipy.stats.stats import rankdata

    finished (0:00:02) --> added
    'rank_velocity_genes', sorted scores by group ids (adata.uns)
    'spearmans_score', spearmans correlation scores (adata.var)

/tmp/ipykernel_131/1614465797.py:3: DeprecationWarning: `scvelo.read_load.get_df` is deprecated since scVelo v0.2.4 and will be removed in a future version. Please use `scvelo.core.get_df` instead.
  df = scv.DataFrame(adata.uns['rank_velocity_genes']['names'])

[30]:

	Ductal	Ngn3 low EP	Ngn3 high EP	Pre-endocrine	Beta	Alpha	Delta	Epsilon
0	Notch2	Ptpn3	Pde1c	Pam	Pax6	Zcchc16	Zdbf2	Tmcc3
1	Sox5	Hacd1	Ptprs	Sdk1	Unc5c	Nlgn1	Spock3	Heg1
2	Krt19	Hspa8	Pclo	Baiap3	Nnat	Nell1	Akr1c19	Gpr179
3	Hspa8	Gm8113	Rap1gap2	Abcc8	Tmem108	Prune2	Ptprt	Ica1
4	Ano6	Kcnq1	Ttyh2	Gnas	Ptprt	Ksr2	Snap25	Ncoa7

[31]:

kwargs = dict(frameon=False, size=10, linewidth=1.5,
              add_outline='Ngn3 high EP, Pre-endocrine, Beta')

scv.pl.scatter(adata, df['Ngn3 high EP'][:5], ylabel='Ngn3 high EP', **kwargs)
scv.pl.scatter(adata, df['Pre-endocrine'][:5], ylabel='Pre-endocrine', **kwargs)

The genes Ptprs, Pclo, Pam, Abcc8, Gnas, for instance, support the directionality from Ngn3 high EP (yellow) to Pre-endocrine (orange) to Beta (green).

Velocities in cycling progenitors§

The cell cycle detected by RNA velocity, is biologically affirmed by cell cycle scores (standardized scores of mean expression levels of phase marker genes).

[32]:

scv.tl.score_genes_cell_cycle(adata)
scv.pl.scatter(adata, color_gradients=['S_score', 'G2M_score'], smooth=True, perc=[5, 95])

calculating cell cycle phase
-->     'S_score' and 'G2M_score', scores of cell cycle phases (adata.obs)

For the cycling Ductal cells, we may screen through S and G2M phase markers. The previous module also computed a spearmans correlation score, which we can use to rank/sort the phase marker genes to then display their phase portraits.

[33]:

s_genes, g2m_genes = scv.utils.get_phase_marker_genes(adata)
s_genes = scv.get_df(adata[:, s_genes], 'spearmans_score', sort_values=True).index
g2m_genes = scv.get_df(adata[:, g2m_genes], 'spearmans_score', sort_values=True).index

kwargs = dict(frameon=False, ylabel='cell cycle genes')
scv.pl.scatter(adata, list(s_genes[:2]) + list(g2m_genes[:3]), **kwargs)

Particularly Hells and Top2a are well-suited to explain the vector field in the cycling progenitors. Top2a gets assigned a high velocity shortly before it actually peaks in the G2M phase. There, the negative velocity then perfectly matches the immediately following down-regulation.

[34]:

scv.pl.velocity(adata, ['Hells', 'Top2a'], ncols=2, add_outline=True)

Speed and coherence§

Two more useful stats: - The speed or rate of differentiation is given by the length of the velocity vector. - The coherence of the vector field (i.e., how a velocity vector correlates with its neighboring velocities) provides a measure of confidence.

[35]:

scv.tl.velocity_confidence(adata)
keys = 'velocity_length', 'velocity_confidence'
scv.pl.scatter(adata, c=keys, cmap='coolwarm', perc=[5, 95])

--> added 'velocity_length' (adata.obs)
--> added 'velocity_confidence' (adata.obs)
--> added 'velocity_confidence_transition' (adata.obs)

These provide insights where cells differentiate at a slower/faster pace, and where the direction is un-/determined.

On cluster-level, we find that differentiation substantially speeds up after cell cycle exit (Ngn3 low EP), keeping the pace during Beta cell production while slowing down during Alpha cell production.

[36]:

df = adata.obs.groupby('clusters')[keys].mean().T
df.style.background_gradient(cmap='coolwarm', axis=1)

[36]:

clusters	Ductal	Ngn3 low EP	Ngn3 high EP	Pre-endocrine	Beta	Alpha	Delta	Epsilon
velocity_length	5.707904	6.227023	13.257741	13.429206	14.219882	10.491913	7.623143	10.477606
velocity_confidence	0.775042	0.756474	0.896944	0.830438	0.725973	0.752715	0.691188	0.784467

Velocity graph and pseudotime§

We can visualize the velocity graph to portray all velocity-inferred cell-to-cell connections/transitions. It can be confined to high-probability transitions by setting a threshold. The graph, for instance, indicates two phases of Epsilon cell production, coming from early and late Pre-endocrine cells.

[37]:

scv.pl.velocity_graph(adata, threshold=.1)

Further, the graph can be used to draw descendents/anscestors coming from a specified cell. Here, a pre-endocrine cell is traced to its potential fate.

[38]:

x, y = scv.utils.get_cell_transitions(adata, basis='umap', starting_cell=70)
ax = scv.pl.velocity_graph(adata, c='lightgrey', edge_width=.05, show=False)
ax = scv.pl.scatter(adata, x=x, y=y, s=120, c='ascending', cmap='gnuplot', ax=ax)

Finally, based on the velocity graph, a velocity pseudotime can be computed. After inferring a distribution over root cells from the graph, it measures the average number of steps it takes to reach a cell after walking along the graph starting from the root cells.

Contrarily to diffusion pseudotime, it implicitly infers the root cells and is based on the directed velocity graph instead of the similarity-based diffusion kernel.

[39]:

scv.tl.velocity_pseudotime(adata)
scv.pl.scatter(adata, color='velocity_pseudotime', cmap='gnuplot')

computing terminal states
    identified 2 regions of root cells and 1 region of end points .
    finished (0:00:00) --> added
    'root_cells', root cells of Markov diffusion process (adata.obs)
    'end_points', end points of Markov diffusion process (adata.obs)

PAGA velocity graph§

PAGA graph abstraction has benchmarked as top-performing method for trajectory inference. It provides a graph-like map of the data topology with weighted edges corresponding to the connectivity between two clusters. Here, PAGA is extended by velocity-inferred directionality.

[ ]:

# PAGA requires to install igraph, if not done yet.
# !pip install python-igraph --upgrade --quiet

[41]:

# this is needed due to a current bug - bugfix is coming soon.
adata.uns['neighbors']['distances'] = adata.obsp['distances']
adata.uns['neighbors']['connectivities'] = adata.obsp['connectivities']

scv.tl.paga(adata, groups='clusters')
df = scv.get_df(adata, 'paga/transitions_confidence', precision=2).T
df.style.background_gradient(cmap='Blues').format('{:.2g}')

running PAGA using priors: ['velocity_pseudotime']
    finished (0:00:00) --> added
    'paga/connectivities', connectivities adjacency (adata.uns)
    'paga/connectivities_tree', connectivities subtree (adata.uns)
    'paga/transitions_confidence', velocity transitions (adata.uns)

---------------------------------------------------------------------------
OptionError                               Traceback (most recent call last)
Cell In [41], line 6
      3 adata.uns['neighbors']['connectivities'] = adata.obsp['connectivities']
      5 scv.tl.paga(adata, groups='clusters')
----> 6 df = scv.get_df(adata, 'paga/transitions_confidence', precision=2).T
      7 df.style.background_gradient(cmap='Blues').format('{:.2g}')

File /usr/lib/python3.10/site-packages/scvelo/core/_anndata.py:196, in get_df(data, keys, layer, index, columns, sort_values, dropna, precision)
    164 """Get dataframe for a specified adata key.
    165
    166 Return values for specified key
   (...)
    192     A dataframe.
    193 """
    195 if precision is not None:
--> 196     pd.set_option("precision", precision)
    198 if isinstance(data, AnnData):
    199     keys, keys_split = (
    200         keys.split("*") if isinstance(keys, str) and "*" in keys else (keys, None)
    201     )

File /usr/lib/python3.10/site-packages/pandas/_config/config.py:256, in CallableDynamicDoc.__call__(self, *args, **kwds)
    255 def __call__(self, *args, **kwds):
--> 256     return self.__func__(*args, **kwds)

File /usr/lib/python3.10/site-packages/pandas/_config/config.py:149, in _set_option(*args, **kwargs)
    146     raise TypeError(f'_set_option() got an unexpected keyword argument "{kwarg}"')
    148 for k, v in zip(args[::2], args[1::2]):
--> 149     key = _get_single_key(k, silent)
    151     o = _get_registered_option(key)
    152     if o and o.validator:

File /usr/lib/python3.10/site-packages/pandas/_config/config.py:116, in _get_single_key(pat, silent)
    114     raise OptionError(f"No such keys(s): {repr(pat)}")
    115 if len(keys) > 1:
--> 116     raise OptionError("Pattern matched multiple keys")
    117 key = keys[0]
    119 if not silent:

OptionError: Pattern matched multiple keys

This reads from left/row to right/column, thus e.g. assigning a confident transition from Ductal to Ngn3 low EP.

This table can be summarized by a directed graph superimposed onto the UMAP embedding.

[23]:

scv.pl.paga(adata, basis='umap', size=50, alpha=.1,
            min_edge_width=2, node_size_scale=1.5)