Census datasets presence

Goal: demonstrate basic use of the datasets_presence_matrix array.

The presence matrix is a sparse array, indicating which features (var) were present in each dataset. The array has dimensions [n_datasets, n_var], and is stored in the SOMA Measurement varp collection. The first dimension is indexed by the soma_joinid in the census_datasets dataframe. The second is indexed by the soma_joinid in the var dataframe of the measurement.

[1]:
import numpy as np
from scipy import sparse
import cellxgene_census

census = cellxgene_census.open_soma()
# Grab the experiment containing human data, and the measurement therein with RNA
human = census["census_data"]["homo_sapiens"]
human_rna = human.ms["RNA"]

# The census-wide datasets
datasets_df = census["census_info"]["datasets"].read().concat().to_pandas()

datasets_df
[1]:
soma_joinid collection_id collection_name collection_doi dataset_id dataset_title dataset_h5ad_path dataset_total_cell_count
0 0 43d4bb39-21af-4d05-b973-4c1fed7b916c Transcriptional Programming of Normal and Infl... 10.1016/j.celrep.2018.09.006 f512b8b6-369d-4a85-a695-116e0806857f Skin f512b8b6-369d-4a85-a695-116e0806857f.h5ad 68036
1 1 d36ca85c-3e8b-444c-ba3e-a645040c6185 A molecular atlas of the human postmenopausal ... 10.1101/2022.08.04.502826 90d4a63b-5c02-43eb-acde-c49345681601 Fallopian tube RNA 90d4a63b-5c02-43eb-acde-c49345681601.h5ad 60574
2 2 d36ca85c-3e8b-444c-ba3e-a645040c6185 A molecular atlas of the human postmenopausal ... 10.1101/2022.08.04.502826 d1207c81-7309-43a7-a5a0-f4283670b62b Ovary RNA d1207c81-7309-43a7-a5a0-f4283670b62b.h5ad 26134
3 3 2b02dff7-e427-4cdc-96fb-c0f354c099aa Single-Cell Analysis of Crohn’s Disease Lesion... 10.1016/j.cell.2019.08.008 36c867a7-be10-4e69-9b39-5de12b0af6da Ileum 36c867a7-be10-4e69-9b39-5de12b0af6da.h5ad 32458
4 4 e9eec7f5-8519-42f6-99b4-6dbd9cc5ef03 Humoral immunity at the brain borders in homeo... 10.1016/j.coi.2022.102188 58b01044-c5e5-4b0f-8a2d-6ebf951e01ff A scRNA-seq atlas of immune cells at the CNS b... 58b01044-c5e5-4b0f-8a2d-6ebf951e01ff.h5ad 130908
... ... ... ... ... ... ... ... ...
517 517 a18474f4-ff1e-4864-af69-270b956cee5b Single-cell RNA sequencing unifies development... 10.1158/2159-8290.cd-22-0824 63bb6359-3945-4658-92eb-3072419953e4 UMAP of T-Cells cells 63bb6359-3945-4658-92eb-3072419953e4.h5ad 14970
518 518 a18474f4-ff1e-4864-af69-270b956cee5b Single-cell RNA sequencing unifies development... 10.1158/2159-8290.cd-22-0824 94423ec1-21f8-40e8-b5c9-c3ea82350ca4 UMAP of Myeloid cells 94423ec1-21f8-40e8-b5c9-c3ea82350ca4.h5ad 3282
519 519 a18474f4-ff1e-4864-af69-270b956cee5b Single-cell RNA sequencing unifies development... 10.1158/2159-8290.cd-22-0824 773b9b2e-70c8-40be-8cbb-e7b5abab360d UMAP of Columnar cells 773b9b2e-70c8-40be-8cbb-e7b5abab360d.h5ad 79522
520 520 a18474f4-ff1e-4864-af69-270b956cee5b Single-cell RNA sequencing unifies development... 10.1158/2159-8290.cd-22-0824 e5233a94-9e43-418c-8209-6f1400c31530 UMAP of all data e5233a94-9e43-418c-8209-6f1400c31530.h5ad 146583
521 521 10bf5c50-8d85-4c5f-94b4-22c1363d9f31 Intratumoral heterogeneity in recurrent pediat... be884a28-0eac-4eca-9c23-1294a93cf56e Intratumoral heterogeneity in recurrent pediat... be884a28-0eac-4eca-9c23-1294a93cf56e.h5ad 35637

522 rows × 8 columns

For convenience, read the entire presence matrix (for Homo sapiens) into a SciPy array. There is a convenience API providing this capability, returning the matrix in a scipy.sparse.array:

[2]:
presence_matrix = cellxgene_census.get_presence_matrix(census, organism="Homo sapiens", measurement_name="RNA")

presence_matrix
[2]:
<522x60664 sparse matrix of type '<class 'numpy.uint8'>'
        with 13844504 stored elements in Compressed Sparse Row format>

We also need the var dataframe, which is read into a Pandas DataFrame for convenient manipulation:

[3]:
var_df = human_rna.var.read().concat().to_pandas()

var_df
[3]:
soma_joinid feature_id feature_name feature_length
0 0 ENSG00000238009 RP11-34P13.7 3726
1 1 ENSG00000279457 WASH9P 1397
2 2 ENSG00000228463 AP006222.1 8224
3 3 ENSG00000237094 RP4-669L17.4 6204
4 4 ENSG00000230021 RP11-206L10.17 5495
... ... ... ... ...
60659 60659 ENSG00000288719 RP4-669P10.21 4252
60660 60660 ENSG00000288720 RP11-852E15.3 7007
60661 60661 ENSG00000288721 RP5-973N23.5 7765
60662 60662 ENSG00000288723 RP11-553N16.6 1015
60663 60663 ENSG00000288724 RP13-546I2.2 625

60664 rows × 4 columns

Is a feature present in a dataset?

Goal: test if a given feature is present in a given dataset.

Important: the presence matrix is indexed by soma_joinid, and is NOT positionally indexed. In other words: * the first dimension of the presence matrix is the dataset’s soma_joinid, as stored in the census_datasets dataframe. * the second dimension of the presence matrix is the feature’s soma_joinid, as stored in the var dataframe.

[4]:
var_joinid = var_df.loc[var_df.feature_id == "ENSG00000286096"].soma_joinid
dataset_joinid = datasets_df.loc[datasets_df.dataset_id == "97a17473-e2b1-4f31-a544-44a60773e2dd"].soma_joinid
is_present = presence_matrix[dataset_joinid, var_joinid][0, 0]
print(f'Feature is {"present" if is_present else "not present"}.')
Feature is present.

What datasets contain a feature?

Goal: look up all datasets that have a feature_id present.

[5]:
# Grab the feature's soma_joinid from the var dataframe
var_joinid = var_df.loc[var_df.feature_id == "ENSG00000286096"].soma_joinid

# The presence matrix is indexed by the joinids of the dataset and var dataframes,
# so slice out the feature of interest by its joinid.
dataset_joinids = presence_matrix[:, var_joinid].tocoo().row

# From the datasets dataframe, slice out the datasets which have a joinid in the list
datasets_df.loc[datasets_df.soma_joinid.isin(dataset_joinids)]
[5]:
soma_joinid collection_id collection_name collection_doi dataset_id dataset_title dataset_h5ad_path dataset_total_cell_count
89 89 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 07b1d7c8-5c2e-42f7-9246-26f746cd6013 Dissection: Myelencephalon (medulla oblongata)... 07b1d7c8-5c2e-42f7-9246-26f746cd6013.h5ad 27210
102 102 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 7c1c3d47-3166-43e5-9a95-65ceb2d45f78 Dissection: Pons (Pn) - Pontine reticular form... 7c1c3d47-3166-43e5-9a95-65ceb2d45f78.h5ad 49512
103 103 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 9372df2d-13d6-4fac-980b-919a5b7eb483 Dissection: Midbrain (M) - Periaqueductal gray... 9372df2d-13d6-4fac-980b-919a5b7eb483.h5ad 33794
131 131 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 dd03ce70-3243-4c96-9561-330cc461e4d7 Dissection: Cerebral cortex (Cx) - Perirhinal ... dd03ce70-3243-4c96-9561-330cc461e4d7.h5ad 23732
145 145 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 7a0a8891-9a22-4549-a55b-c2aca23c3a2a Supercluster: Hippocampal CA1-3 7a0a8891-9a22-4549-a55b-c2aca23c3a2a.h5ad 74979
147 147 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 d2b5efc1-14c6-4b5f-bd98-40f9084872d7 Dissection: Tail of Hippocampus (HiT) - Caudal... d2b5efc1-14c6-4b5f-bd98-40f9084872d7.h5ad 36886
151 151 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 f8dda921-5fb4-4c94-a654-c6fc346bfd6d Dissection: Cerebral cortex (Cx) - Occipitotem... f8dda921-5fb4-4c94-a654-c6fc346bfd6d.h5ad 31899
154 154 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 3a7f3ab4-a280-4b3b-b2c0-6dd05614a78c Supercluster: Splatter 3a7f3ab4-a280-4b3b-b2c0-6dd05614a78c.h5ad 291833
156 156 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 bdb26abd-f4ba-4ea3-8862-c2340e7a4f55 Supercluster: CGE interneuron bdb26abd-f4ba-4ea3-8862-c2340e7a4f55.h5ad 227671
158 158 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 5e5ab909-f73f-4b57-98a0-6d2c5662f6a4 Dissection: Midbrain (M) - Inferior colliculus... 5e5ab909-f73f-4b57-98a0-6d2c5662f6a4.h5ad 32306
166 166 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 35c8a04c-8639-4d15-8228-765d8d93fc96 Dissection: Hypothalamus (HTH) - supraoptic re... 35c8a04c-8639-4d15-8228-765d8d93fc96.h5ad 16753
168 168 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 3f56901c-dd4a-47d6-b60b-7b0c0111cfb2 Dissection: Head of hippocampus (HiH) - CA1-3 3f56901c-dd4a-47d6-b60b-7b0c0111cfb2.h5ad 37911
184 184 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 93131426-0124-4ab4-a013-9dfbcd99d467 Dissection: Epithalamus - ETH 93131426-0124-4ab4-a013-9dfbcd99d467.h5ad 24327
190 190 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 acae7679-d077-461c-b857-ee6ccfeb267f Dissection: Head of hippocampus (HiH) - CA1 acae7679-d077-461c-b857-ee6ccfeb267f.h5ad 39147
192 192 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 c202b243-1aa1-4b16-bc9a-b36241f3b1e3 Supercluster: Amygdala excitatory c202b243-1aa1-4b16-bc9a-b36241f3b1e3.h5ad 109452
194 194 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 0325478a-9b52-45b5-b40a-2e2ab0d72eb1 Supercluster: Upper-layer intratelencephalic 0325478a-9b52-45b5-b40a-2e2ab0d72eb1.h5ad 455006
196 196 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 c2aad8fc-b63b-4f9b-9cfd-baf7bc9c1771 Dissection: Cerebral cortex (Cx) - Temporal po... c2aad8fc-b63b-4f9b-9cfd-baf7bc9c1771.h5ad 37642
197 197 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 c4b03352-af8d-492a-8d6b-40f304e0a122 Supercluster: Medium spiny neuron c4b03352-af8d-492a-8d6b-40f304e0a122.h5ad 152189
214 214 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 fe1a73ab-a203-45fd-84e9-0f7fd19efcbd Dissection: Amygdaloid complex (AMY) - basolat... fe1a73ab-a203-45fd-84e9-0f7fd19efcbd.h5ad 35285
487 487 e5f58829-1a66-40b5-a624-9046778e74f5 Tabula Sapiens 10.1126/science.abl4896 a68b64d8-aee3-4947-81b7-36b8fe5a44d2 Tabula Sapiens - Stromal a68b64d8-aee3-4947-81b7-36b8fe5a44d2.h5ad 82478
488 488 e5f58829-1a66-40b5-a624-9046778e74f5 Tabula Sapiens 10.1126/science.abl4896 97a17473-e2b1-4f31-a544-44a60773e2dd Tabula Sapiens - Epithelial 97a17473-e2b1-4f31-a544-44a60773e2dd.h5ad 104148
489 489 e5f58829-1a66-40b5-a624-9046778e74f5 Tabula Sapiens 10.1126/science.abl4896 c5d88abe-f23a-45fa-a534-788985e93dad Tabula Sapiens - Immune c5d88abe-f23a-45fa-a534-788985e93dad.h5ad 264824
490 490 e5f58829-1a66-40b5-a624-9046778e74f5 Tabula Sapiens 10.1126/science.abl4896 5a11f879-d1ef-458a-910c-9b0bdfca5ebf Tabula Sapiens - Endothelial 5a11f879-d1ef-458a-910c-9b0bdfca5ebf.h5ad 31691
491 491 e5f58829-1a66-40b5-a624-9046778e74f5 Tabula Sapiens 10.1126/science.abl4896 53d208b0-2cfd-4366-9866-c3c6114081bc Tabula Sapiens - All Cells 53d208b0-2cfd-4366-9866-c3c6114081bc.h5ad 483152

What features are in a dataset?

Goal: lookup the features present in a given dataset.

This example also demonstrates the ability to do the query on multiple datasets.

[6]:
# Slice the dataset(s) of interest, and get the joinid(s)
dataset_joinids = datasets_df.loc[datasets_df.collection_id == "17481d16-ee44-49e5-bcf0-28c0780d8c4a"].soma_joinid

# Slice the presence matrix by the first dimension, i.e., by dataset
var_joinids = presence_matrix[dataset_joinids, :].tocoo().col

# From the feature (var) dataframe, slice out features which have a joinid in the list.
var_df.loc[var_df.soma_joinid.isin(var_joinids)]
[6]:
soma_joinid feature_id feature_name feature_length
0 0 ENSG00000238009 RP11-34P13.7 3726
1 1 ENSG00000279457 WASH9P 1397
2 2 ENSG00000228463 AP006222.1 8224
3 3 ENSG00000237094 RP4-669L17.4 6204
4 4 ENSG00000230021 RP11-206L10.17 5495
... ... ... ... ...
56689 56689 ENSG00000283063 TRBV6-2 424
56698 56698 ENSG00000283095 ABC11-4932300O16.1 1535
56705 56705 ENSG00000283117 MGC4859 3118
56706 56706 ENSG00000283118 RP11-107E5.4 644
56709 56709 ENSG00000283125 RP11-299P2.2 547

27211 rows × 4 columns