Exploring all data from a tissue

This tutorial provides a series of examples for how to explore and query the Census in the context of a single tissue, lung. We will summarize cell and gene metadata, then fetch the single-cell expression counts and perform some basic data explorations via Scanpy

Contents

  1. Learning about the human lung data.

    1. Learning about cells of the lung.

    2. Learning about genes of the lung .

  2. Fetching all single-cell human lung data from the Census.

  3. Calculating QC metrics of the lung data.

  4. Creating a normalized expression layer and embeddings.

Learning about the lung data in the Census

First we will open the Census. If you are not familiar with the basics of the Census API you should take a look at notebook Learning about the CZ CELLxGENE Census

[1]:
import cellxgene_census
import scanpy as sc
import pandas as pd
import numpy as np
from scipy.sparse import coo_matrix

census = cellxgene_census.open_soma()

Let’s first take a look at the number of cells from human lung:

[2]:
summary_table = census["census_info"]["summary_cell_counts"].read().concat().to_pandas()

summary_table.query("organism == 'Homo sapiens' & category == 'tissue_general' & label =='lung'")
[2]:
soma_joinid organism category ontology_term_id unique_cell_count total_cell_count label
944 944 Homo sapiens tissue_general UBERON:0002048 2295237 4573212 lung

There you can see the total of cells of under total_cell_count and the unique number cells under unique_cell_count (i.e. after removing cells that were included in multiple datasets).

Let’s now take a look at the cell and gene information of this slice of the Census.

Learning about cells of lung data

Let’s load the cell metadata for all lung cells and select only the unique cells using is_primary_data.

[3]:
lung_obs = (
    census["census_data"]["homo_sapiens"]
    .obs.read(value_filter="tissue_general == 'lung' and is_primary_data == True")
    .concat()
    .to_pandas()
)
lung_obs
[3]:
soma_joinid dataset_id assay assay_ontology_term_id cell_type cell_type_ontology_term_id development_stage development_stage_ontology_term_id disease disease_ontology_term_id ... is_primary_data self_reported_ethnicity self_reported_ethnicity_ontology_term_id sex sex_ontology_term_id suspension_type tissue tissue_ontology_term_id tissue_general tissue_general_ontology_term_id
0 145215 d8da613f-e681-4c69-b463-e94f5e66847f 10x 3' v3 EFO:0009922 lung ciliated cell CL:1000271 70-year-old human stage HsapDv:0000164 normal PATO:0000461 ... True European HANCESTRO:0005 female PATO:0000383 nucleus lung UBERON:0002048 lung UBERON:0002048
1 145216 d8da613f-e681-4c69-b463-e94f5e66847f 10x 3' v3 EFO:0009922 alveolar macrophage CL:0000583 70-year-old human stage HsapDv:0000164 normal PATO:0000461 ... True European HANCESTRO:0005 female PATO:0000383 nucleus lung UBERON:0002048 lung UBERON:0002048
2 145217 d8da613f-e681-4c69-b463-e94f5e66847f 10x 3' v3 EFO:0009922 lung ciliated cell CL:1000271 70-year-old human stage HsapDv:0000164 normal PATO:0000461 ... True European HANCESTRO:0005 female PATO:0000383 nucleus lung UBERON:0002048 lung UBERON:0002048
3 145218 d8da613f-e681-4c69-b463-e94f5e66847f 10x 3' v3 EFO:0009922 lung ciliated cell CL:1000271 70-year-old human stage HsapDv:0000164 normal PATO:0000461 ... True European HANCESTRO:0005 female PATO:0000383 nucleus lung UBERON:0002048 lung UBERON:0002048
4 145219 d8da613f-e681-4c69-b463-e94f5e66847f 10x 3' v3 EFO:0009922 lung ciliated cell CL:1000271 70-year-old human stage HsapDv:0000164 normal PATO:0000461 ... True European HANCESTRO:0005 female PATO:0000383 nucleus lung UBERON:0002048 lung UBERON:0002048
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2295232 45561670 f64e1be1-de15-4d27-8da4-82225cd4c035 10x 3' v3 EFO:0009922 T cell CL:0000084 78-year-old human stage HsapDv:0000172 lung adenocarcinoma MONDO:0005061 ... True European HANCESTRO:0005 male PATO:0000384 cell lung UBERON:0002048 lung UBERON:0002048
2295233 45562154 f64e1be1-de15-4d27-8da4-82225cd4c035 10x 3' v2 EFO:0009899 T cell CL:0000084 67-year-old human stage HsapDv:0000161 lung adenocarcinoma MONDO:0005061 ... True European HANCESTRO:0005 female PATO:0000383 cell lung UBERON:0002048 lung UBERON:0002048
2295234 45565290 f64e1be1-de15-4d27-8da4-82225cd4c035 10x 3' v2 EFO:0009899 T cell CL:0000084 65-year-old human stage HsapDv:0000159 lung adenocarcinoma MONDO:0005061 ... True European HANCESTRO:0005 female PATO:0000383 cell lung UBERON:0002048 lung UBERON:0002048
2295235 45566613 f64e1be1-de15-4d27-8da4-82225cd4c035 10x 3' v2 EFO:0009899 T cell CL:0000084 80-year-old human stage HsapDv:0000206 lung adenocarcinoma MONDO:0005061 ... True European HANCESTRO:0005 male PATO:0000384 cell lung UBERON:0002048 lung UBERON:0002048
2295236 45567413 f64e1be1-de15-4d27-8da4-82225cd4c035 10x 3' v3 EFO:0009922 T cell CL:0000084 68-year-old human stage HsapDv:0000162 lung adenocarcinoma MONDO:0005061 ... True European HANCESTRO:0005 male PATO:0000384 cell lung UBERON:0002048 lung UBERON:0002048

2295237 rows × 21 columns

You can see that the number or rows represents the total number of unique lung cells in the Census. Now let’s take a deeper dive into the characteristics of these cells.

Datasets

First let’s start by looking at what are the datasets and collections from CELLxGENE Discover contributing to lung. For this we will use the dataset table at census["census-info"]["datasets"] that contains metadata of all datasets used to build this Census.

[4]:
census_datasets = (
    census["census_info"]["datasets"]
    .read(column_names=["collection_name", "dataset_title", "dataset_id", "soma_joinid"])
    .concat()
    .to_pandas()
)
census_datasets = census_datasets.set_index("dataset_id")
census_datasets
[4]:
collection_name dataset_title soma_joinid
dataset_id
f512b8b6-369d-4a85-a695-116e0806857f Transcriptional Programming of Normal and Infl... Skin 0
36c867a7-be10-4e69-9b39-5de12b0af6da Single-Cell Analysis of Crohn’s Disease Lesion... Ileum 1
58b01044-c5e5-4b0f-8a2d-6ebf951e01ff Humoral immunity at the brain borders in homeo... A scRNA-seq atlas of immune cells at the CNS b... 2
456e8b9b-f872-488b-871d-94534090a865 Single-cell atlas of peripheral immune respons... Single-cell atlas of peripheral immune respons... 3
d8da613f-e681-4c69-b463-e94f5e66847f A molecular single-cell lung atlas of lethal C... A molecular single-cell lung atlas of lethal C... 4
... ... ... ...
30498543-4fdd-4f86-9e1b-05c1a1454a6a COVID-19 mRNA vaccine elicits a potent adaptiv... B cells -- CV19 infection, vaccination and HC 524
b5191f01-f67d-44b8-bc8d-511a4ecd07bb COVID-19 mRNA vaccine elicits a potent adaptiv... innate T cells -- CV19 infection, vaccination ... 525
e463dae9-3fc1-476d-870e-d98a04c56cd6 COVID-19 mRNA vaccine elicits a potent adaptiv... M cells -- CV19 infection, vaccination and HC 526
1b699e04-1127-42ea-998b-011ace4a5b81 COVID-19 mRNA vaccine elicits a potent adaptiv... T cells -- CV19 infection, vaccination and HC 527
87ce26ed-e5d1-44b4-81cc-cc5b709a169f COVID-19 mRNA vaccine elicits a potent adaptiv... Circulating Immune cells -- CV19 infection, va... 528

529 rows × 3 columns

The obs cell metadata pandas.DataFrame contains a column dataset_id that can be used for joining to the census_dataset pandas.DataFrame we just created.

So let’s take a look at the cell counts per dataset_id of the lung slice and then join to the dataset table to append the human-readable labels.

[5]:
dataset_cell_counts = pd.DataFrame(lung_obs[["dataset_id"]].value_counts())
dataset_cell_counts = dataset_cell_counts.rename(columns={0: "cell_counts"})
dataset_cell_counts = dataset_cell_counts.merge(census_datasets, on="dataset_id")

dataset_cell_counts
[5]:
count collection_name dataset_title soma_joinid
dataset_id
1e6a6ef9-7ec9-4c90-bbfb-2ad3c3165fd1 1028006 High-resolution single-cell atlas reveals dive... The single-cell lung cancer atlas (LuCA) -- ex... 157
066943a2-fdac-4b29-b348-40cede398e4e 254931 The integrated Human Lung Cell Atlas The integrated Human Lung Cell Atlas 79
f7c1c579-2dc0-47e2-ba19-8165c5a0e353 217738 A human cell atlas of fetal gene expression Survey of human embryonic development 94
d8da613f-e681-4c69-b463-e94f5e66847f 116313 A molecular single-cell lung atlas of lethal C... A molecular single-cell lung atlas of lethal C... 4
576f193c-75d0-4a11-bd25-8676587e6dc2 90384 HTAN MSK - Single cell profiling reveals novel... Combined samples 519
3dc61ca1-ce40-46b6-8337-f27260fd9a03 71752 A human fetal lung cell atlas uncovers proxima... All cells 75
8c42cfd0-0b0a-46d5-910c-fc833d83c45e 60993 A molecular cell atlas of the human lung from ... Krasnow Lab Human Lung Cell Atlas, 10X 177
2672b679-8048-4f5e-9786-f1b196ccfd08 57019 scRNA-seq assessment of the human lung, spleen... Lung Parenchyma 317
9dbab10c-118d-496b-966a-67f1763a6b7d 49014 COVID-19 immune features revealed by a large-s... Large-scale single-cell analysis reveals criti... 63
9968be68-ab65-4a38-9e1a-c9b6abece194 47909 Charting human development using a multi-endod... Developing Human Atlas 508
3de0ad6d-4378-4f62-b37b-ec0b75a50d94 46500 LungMAP — Human data from a broad age healthy ... Single-cell multiomic profiling of human lungs... 188
2f132ec9-24b5-422f-9be0-ccef03b4fe28 39778 SARS-CoV-2 receptor ACE2 and TMPRSS2 are prima... Lung 104
1e5bd3b8-6a0e-4959-8d69-cafed30fe814 35699 Emphysema Cell Atlas immune cells 51
53d208b0-2cfd-4366-9866-c3c6114081bc 35682 Tabula Sapiens Tabula Sapiens - All Cells 339
1b9d8702-5af8-4142-85ed-020eb06ec4f6 35419 Cross-tissue immune cell analysis reveals tiss... Global 314
4ed927e9-c099-49af-b8ce-a2652d069333 35284 Single-nucleus cross-tissue molecular referenc... Single-nucleus cross-tissue molecular referenc... 62
2adb1f8a-a6b1-4909-8ee8-484814e2d4bf 33698 Construction of a human cell landscape at sing... Construction of a human cell landscape at sing... 244
4b6af54a-4a21-46e0-bc8d-673c0561a836 15495 Emphysema Cell Atlas non-immune cells 53
01209dce-3575-4bed-b1df-129f57fbc031 11059 Single-cell transcriptomics of human T cells r... Single-cell transcriptomics of human T cells r... 186
e04daea4-4412-45b5-989e-76a9be070a89 8657 A molecular cell atlas of the human lung from ... Krasnow Lab Human Lung Cell Atlas, Smart-seq2 176
214bf9eb-93db-48c8-8e3c-9bb22fa3bc63 3662 Emphysema Cell Atlas AT2 cells 52
f9846bb4-784d-4582-92c1-3f279e4c6f0c 176 A human fetal lung cell atlas uncovers proxima... Fibroblast and smooth muscle 71
f64e1be1-de15-4d27-8da4-82225cd4c035 55 HTAN MSK - Single cell profiling reveals novel... Immune cells 520
810ac45f-8969-4698-b42c-652f802f75c2 10 A human fetal lung cell atlas uncovers proxima... Endothelium 77
0ba16f4b-cb87-4fa3-9363-19fc51eec6e7 4 A human fetal lung cell atlas uncovers proxima... Myeloid 74

These are all the datasets lung cells whose counts are reprensented in the column cell_counts. The top collections with lung data are:

  1. The integrated Human Lung Cell Atlas.

  2. A human cell atlas of fetal gene expression.

  3. High-resolution single-cell atlas reveals diversity and plasticity of tumor-associated neutrophils in non-small cell lung cancer.

  4. HTAN MSK - Single cell profiling reveals novel tumor and myeloid subpopulations in small cell lung cancer.

  5. A human fetal lung cell atlas uncovers proximal-distal gradients of differentiation and key regulators of epithelial fates..

Assays

Let’s use similar logic to take a look at all the assays available for human lung data. This tells us that most assays are from 10x technologies and sci-RNA-seq.

[6]:
lung_obs[["assay"]].value_counts()
[6]:
assay
10x 3' v2                                   1087211
10x 3' v3                                    436291
10x 5' v1                                    218556
sci-RNA-seq                                  217738
BD Rhapsody Whole Transcriptome Analysis     122902
single cell library construction              58981
10x 5' v2                                     45798
microwell-seq                                 33698
Smart-seq2                                    25662
inDrop                                        25652
10x 3' v1                                     11558
Drop-seq                                       7333
10x 3' transcription profiling                 3857
Name: count, dtype: int64

Disease

And now let’s take a look at diseased cell counts, with normal indicating non-diseased cells.

[7]:
lung_obs[["disease"]].value_counts()
[7]:
disease
normal                                   1111566
lung adenocarcinoma                       626322
squamous cell lung carcinoma              189044
COVID-19                                  128650
non-small cell lung carcinoma             120796
chronic obstructive pulmonary disease      55254
pulmonary emphysema                        32065
small cell lung carcinoma                  31540
Name: count, dtype: int64

Sex

There doesn’t seem to be strong biases for sex.

[8]:
lung_obs[["sex"]].value_counts()
[8]:
sex
male       1267756
female     1019958
unknown       7523
Name: count, dtype: int64

Cell vs nucleus

The majority of data are from cells and not nucleus.

[9]:
lung_obs[["suspension_type"]].value_counts()
[9]:
suspension_type
cell               1879402
nucleus             415835
Name: count, dtype: int64

Cell types

Let’s take a look at the counts of the top 20 cell types.

[10]:
lung_obs[["cell_type"]].value_counts().head(20)
[10]:
cell_type
alveolar macrophage                           244502
CD4-positive, alpha-beta T cell               185787
macrophage                                    154415
CD8-positive, alpha-beta T cell               152746
type II pneumocyte                            143252
epithelial cell of lower respiratory tract    101959
classical monocyte                             90969
natural killer cell                            90715
stromal cell                                   87587
malignant cell                                 75917
B cell                                         61804
T cell                                         53703
fibroblast                                     41934
plasma cell                                    41907
fibroblast of lung                             41294
epithelial cell of lung                        38426
epithelial cell                                37498
regulatory T cell                              36381
mesenchymal cell                               33235
monocyte                                       33154
Name: count, dtype: int64

Sub-tissues

We can look at the original tissue annotations that were mapped to “lung”.

[11]:
lung_obs[["tissue"]].value_counts()
[11]:
tissue
lung                       1882088
lung parenchyma             311950
alveolus of lung             54856
lingula of left lung         35284
lower lobe of left lung      11059
Name: count, dtype: int64

Learning about genes of lung data

Let’s load the gene metadata of the Census.

[12]:
lung_var = census["census_data"]["homo_sapiens"].ms["RNA"].var.read().concat().to_pandas()
lung_var
[12]:
soma_joinid feature_id feature_name feature_length
0 0 ENSG00000238009 RP11-34P13.7 3726
1 1 ENSG00000279457 WASH9P 1397
2 2 ENSG00000228463 AP006222.1 8224
3 3 ENSG00000237094 RP4-669L17.4 6204
4 4 ENSG00000230021 RP11-206L10.17 5495
... ... ... ... ...
60659 60659 ENSG00000288719 RP4-669P10.21 4252
60660 60660 ENSG00000288720 RP11-852E15.3 7007
60661 60661 ENSG00000288721 RP5-973N23.5 7765
60662 60662 ENSG00000288723 RP11-553N16.6 1015
60663 60663 ENSG00000288724 RP13-546I2.2 625

60664 rows × 4 columns

You can see the total number of genes represented by the number of rows. This number is actually misleading because it is the join of all genes in the Census. However we know that the lung data comes from a subset of datasets.

So let’s take a look at the number of genes that were measured in each of those datasets.

To accomplish this we can use the “dataset presence matrix” at census["census_data"]["homo_sapiens"].ms["RNA"]["feature_dataset_presence_matrix"]. This is a boolean matrix N x M where N is the number of datasets and M is the number of genes in the Census.

So we can select the rows corresponding to the lung datasets and perform a row-wise sum.

[13]:
presence_matrix = cellxgene_census.get_presence_matrix(census, "Homo sapiens", "RNA")
presence_matrix = presence_matrix[dataset_cell_counts.soma_joinid, :]
[14]:
presence_matrix.sum(axis=1).A1
[14]:
array([17811, 28024, 44150, 34265, 22447, 26347, 20921, 24672, 27705,
       27243, 26323, 27181, 23203, 57030, 32610, 29620, 26454, 23705,
       38676, 47307, 21641, 23740, 22552, 20594, 19952], dtype=uint64)
[15]:
genes_measured = presence_matrix.sum(axis=1).A1
dataset_cell_counts["genes_measured"] = genes_measured
dataset_cell_counts
[15]:
count collection_name dataset_title soma_joinid genes_measured
dataset_id
1e6a6ef9-7ec9-4c90-bbfb-2ad3c3165fd1 1028006 High-resolution single-cell atlas reveals dive... The single-cell lung cancer atlas (LuCA) -- ex... 157 17811
066943a2-fdac-4b29-b348-40cede398e4e 254931 The integrated Human Lung Cell Atlas The integrated Human Lung Cell Atlas 79 28024
f7c1c579-2dc0-47e2-ba19-8165c5a0e353 217738 A human cell atlas of fetal gene expression Survey of human embryonic development 94 44150
d8da613f-e681-4c69-b463-e94f5e66847f 116313 A molecular single-cell lung atlas of lethal C... A molecular single-cell lung atlas of lethal C... 4 34265
576f193c-75d0-4a11-bd25-8676587e6dc2 90384 HTAN MSK - Single cell profiling reveals novel... Combined samples 519 22447
3dc61ca1-ce40-46b6-8337-f27260fd9a03 71752 A human fetal lung cell atlas uncovers proxima... All cells 75 26347
8c42cfd0-0b0a-46d5-910c-fc833d83c45e 60993 A molecular cell atlas of the human lung from ... Krasnow Lab Human Lung Cell Atlas, 10X 177 20921
2672b679-8048-4f5e-9786-f1b196ccfd08 57019 scRNA-seq assessment of the human lung, spleen... Lung Parenchyma 317 24672
9dbab10c-118d-496b-966a-67f1763a6b7d 49014 COVID-19 immune features revealed by a large-s... Large-scale single-cell analysis reveals criti... 63 27705
9968be68-ab65-4a38-9e1a-c9b6abece194 47909 Charting human development using a multi-endod... Developing Human Atlas 508 27243
3de0ad6d-4378-4f62-b37b-ec0b75a50d94 46500 LungMAP — Human data from a broad age healthy ... Single-cell multiomic profiling of human lungs... 188 26323
2f132ec9-24b5-422f-9be0-ccef03b4fe28 39778 SARS-CoV-2 receptor ACE2 and TMPRSS2 are prima... Lung 104 27181
1e5bd3b8-6a0e-4959-8d69-cafed30fe814 35699 Emphysema Cell Atlas immune cells 51 23203
53d208b0-2cfd-4366-9866-c3c6114081bc 35682 Tabula Sapiens Tabula Sapiens - All Cells 339 57030
1b9d8702-5af8-4142-85ed-020eb06ec4f6 35419 Cross-tissue immune cell analysis reveals tiss... Global 314 32610
4ed927e9-c099-49af-b8ce-a2652d069333 35284 Single-nucleus cross-tissue molecular referenc... Single-nucleus cross-tissue molecular referenc... 62 29620
2adb1f8a-a6b1-4909-8ee8-484814e2d4bf 33698 Construction of a human cell landscape at sing... Construction of a human cell landscape at sing... 244 26454
4b6af54a-4a21-46e0-bc8d-673c0561a836 15495 Emphysema Cell Atlas non-immune cells 53 23705
01209dce-3575-4bed-b1df-129f57fbc031 11059 Single-cell transcriptomics of human T cells r... Single-cell transcriptomics of human T cells r... 186 38676
e04daea4-4412-45b5-989e-76a9be070a89 8657 A molecular cell atlas of the human lung from ... Krasnow Lab Human Lung Cell Atlas, Smart-seq2 176 47307
214bf9eb-93db-48c8-8e3c-9bb22fa3bc63 3662 Emphysema Cell Atlas AT2 cells 52 21641
f9846bb4-784d-4582-92c1-3f279e4c6f0c 176 A human fetal lung cell atlas uncovers proxima... Fibroblast and smooth muscle 71 23740
f64e1be1-de15-4d27-8da4-82225cd4c035 55 HTAN MSK - Single cell profiling reveals novel... Immune cells 520 22552
810ac45f-8969-4698-b42c-652f802f75c2 10 A human fetal lung cell atlas uncovers proxima... Endothelium 77 20594
0ba16f4b-cb87-4fa3-9363-19fc51eec6e7 4 A human fetal lung cell atlas uncovers proxima... Myeloid 74 19952

You can see the genes measured in each dataset represented in genes_measured. Now lets get the genes that were measured in all datasets.

[16]:
var_somaid = np.nonzero(presence_matrix.sum(axis=0).A1 == presence_matrix.shape[0])[0].tolist()
[17]:
lung_var = lung_var.query(f"soma_joinid in {var_somaid}")
lung_var
[17]:
soma_joinid feature_id feature_name feature_length
6 6 ENSG00000177757 FAM87B 1947
8 8 ENSG00000230368 FAM41C 1971
10 10 ENSG00000187634 SAMD11 4172
11 11 ENSG00000188976 NOC2L 5540
12 12 ENSG00000187961 KLHL17 3402
... ... ... ... ...
27743 27743 ENSG00000160838 LRRC71 2649
30433 30433 ENSG00000136918 WDR38 1387
30816 30816 ENSG00000149021 SCGB1A1 584
31272 31272 ENSG00000187848 P2RX2 1975
32538 32538 ENSG00000160472 TMEM190 605

11516 rows × 4 columns

The number of rows represents the genes that were measured in all lung datasets.

Summary of lung metadata

In the previous sections, using the Census we learned the following information:

  • The total number of unique lung cells and their composition for:

    • Number of datasets.

    • Number sequencing technologies, most of which are 10x

    • Mostly human data, but some diseases exist, primarily “lung adenocarcinoma” and “COVID-19 infected”

    • No sex biases.

    • Mostly data from cells (~80%) rather than nucleus (~20%)

  • A total of ~12k genes were measured across all cells.

Fetching all single-cell human lung data from the Census

Since loading the entire lung data is resource-intensive, for the sake of this exercise let’s load a subset of the lung data into an anndata.AnnData object and perform some exploratory analysis.

We will subset to 100,000 random unique cells using the lung_obs pandas.DataFrame we previously created.

[18]:
lung_cell_subsampled_n = 100000
lung_cell_subsampled_ids = lung_obs["soma_joinid"].sample(lung_cell_subsampled_n, random_state=1).tolist()

Now we can directly use the values of soma_joinid for querying the Census data and obtaining an AnnData object.

[19]:
lung_gene_ids = lung_var["soma_joinid"].to_numpy()
lung_adata = cellxgene_census.get_anndata(
    census,
    organism="Homo sapiens",
    obs_coords=lung_cell_subsampled_ids,
    var_coords=lung_gene_ids,
)

lung_adata.var_names = lung_adata.var["feature_name"]
[20]:
lung_adata
[20]:
AnnData object with n_obs × n_vars = 100000 × 11516
    obs: 'soma_joinid', 'dataset_id', 'assay', 'assay_ontology_term_id', 'cell_type', 'cell_type_ontology_term_id', 'development_stage', 'development_stage_ontology_term_id', 'disease', 'disease_ontology_term_id', 'donor_id', 'is_primary_data', 'self_reported_ethnicity', 'self_reported_ethnicity_ontology_term_id', 'sex', 'sex_ontology_term_id', 'suspension_type', 'tissue', 'tissue_ontology_term_id', 'tissue_general', 'tissue_general_ontology_term_id'
    var: 'soma_joinid', 'feature_id', 'feature_name', 'feature_length'

We are done with the census, so close it

[21]:
census.close()
del census

Calculating QC metrics of the lung data

Now let’s take a look at some QC metrics

Top genes per cell

[22]:
sc.pl.highest_expr_genes(lung_adata, n_top=20)
../../_images/notebooks_analysis_demo_comp_bio_explore_and_load_lung_data_39_0.png

Number of sequenced genes by assay

[23]:
sc.pp.calculate_qc_metrics(lung_adata, percent_top=None, log1p=False, inplace=True)
sc.pl.violin(lung_adata, "n_genes_by_counts", groupby="assay", jitter=0.4, rotation=90)
../../_images/notebooks_analysis_demo_comp_bio_explore_and_load_lung_data_41_0.png

Total counts by assay

[24]:
sc.pl.violin(lung_adata, "total_counts", groupby="assay", jitter=0.4, rotation=90)
../../_images/notebooks_analysis_demo_comp_bio_explore_and_load_lung_data_43_0.png

You can see that Smart-Seq2 is an outlier for the total counts per cell, so let’s exlcude it to see how the rest of the assays look like

[25]:
sc.pl.violin(
    lung_adata[lung_adata.obs["assay"] != "Smart-seq2",],
    "total_counts",
    groupby="assay",
    jitter=0.4,
    rotation=90,
)
../../_images/notebooks_analysis_demo_comp_bio_explore_and_load_lung_data_45_0.png

Creating a normalized expression layer and embeddings

Let’s perform a bread and butter normalization and take a look at UMAP embeddings, but for all the data below we’ll exclude Smart-seq2 as this requires an extra step to normalize based on gene lengths

[26]:
lung_adata = lung_adata[lung_adata.obs["assay"] != "Smart-seq2",].copy()
lung_adata.layers["counts"] = lung_adata.X

Now let’s do some basic normalization:

  • Normalize by sequencing depth

  • Transform to log-scale

  • Select 500 highly variable genes

  • Scale values across the gene axis

[27]:
sc.pp.normalize_total(lung_adata, target_sum=1e4)
sc.pp.log1p(lung_adata)
sc.pp.highly_variable_genes(lung_adata, n_top_genes=500, flavor="seurat_v3", layer="counts")
lung_adata = lung_adata[:, lung_adata.var.highly_variable]
sc.pp.scale(lung_adata, max_value=10)
/Users/pgarcianieto/opt/miniconda3/envs/census_latest/lib/python3.10/site-packages/scanpy/preprocessing/_highly_variable_genes.py:62: UserWarning: `flavor='seurat_v3'` expects raw count data, but non-integers were found.
  warnings.warn(
/Users/pgarcianieto/opt/miniconda3/envs/census_latest/lib/python3.10/site-packages/scanpy/preprocessing/_simple.py:843: UserWarning: Received a view of an AnnData. Making a copy.
  view_to_actual(adata)

And reduce dimensionality by obtaining UMAP embeddings.

[28]:
sc.tl.pca(lung_adata)
sc.pp.neighbors(lung_adata)
sc.tl.umap(lung_adata)
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.

And plot these embeddings.

[29]:
n_cell_types = len(lung_adata.obs["cell_type"].drop_duplicates())

from random import randint

colors = []

for i in range(len(lung_adata.obs["cell_type"].drop_duplicates())):
    colors.append("#%06X" % randint(0, 0xFFFFFF))
[30]:
sc.pl.umap(lung_adata, color="cell_type", palette=colors, legend_loc=None)
/Users/pgarcianieto/opt/miniconda3/envs/census_latest/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
../../_images/notebooks_analysis_demo_comp_bio_explore_and_load_lung_data_54_1.png

Let’s color by assay.

[31]:
sc.pl.umap(lung_adata, color="assay")
/Users/pgarcianieto/opt/miniconda3/envs/census_latest/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
../../_images/notebooks_analysis_demo_comp_bio_explore_and_load_lung_data_56_1.png

Given the high number of cell types it makes it hard to visualize, so let’s look at the top 20 most abundant cell types.

[32]:
top_cell_types = lung_adata.obs["cell_type"].value_counts()
top_cell_types = list(top_cell_types.reset_index().head(20)["cell_type"])
[33]:
lung_adata_top_cell_types = lung_adata[[i in top_cell_types for i in lung_adata.obs["cell_type"]], :]
sc.pl.umap(lung_adata_top_cell_types, color="cell_type")
/Users/pgarcianieto/opt/miniconda3/envs/census_latest/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
../../_images/notebooks_analysis_demo_comp_bio_explore_and_load_lung_data_59_1.png

Let’s color by assay of this subset of the data.

[34]:
sc.pl.umap(lung_adata_top_cell_types, color="assay")
/Users/pgarcianieto/opt/miniconda3/envs/census_latest/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
../../_images/notebooks_analysis_demo_comp_bio_explore_and_load_lung_data_61_1.png