Census datasets presence¶

Goal: demonstrate basic use of the datasets_presence_matrix array.

The presence matrix is a sparse array, indicating which features (var) were present in each dataset. The array has dimensions [n_datasets, n_var], and is stored in the SOMA Measurement varp collection. The first dimension is indexed by the soma_joinid in the census_datasets dataframe. The second is indexed by the soma_joinid in the var dataframe of the measurement.

[1]:

import numpy as np
from scipy import sparse
import cellxgene_census

census = cellxgene_census.open_soma()
# Grab the experiment containing human data, and the measurement therein with RNA
human = census["census_data"]["homo_sapiens"]
human_rna = human.ms["RNA"]

# The census-wide datasets
datasets_df = census["census_info"]["datasets"].read().concat().to_pandas()

datasets_df

[1]:

	soma_joinid	collection_id	collection_name	collection_doi	dataset_id	dataset_title	dataset_h5ad_path	dataset_total_cell_count
0	0	43d4bb39-21af-4d05-b973-4c1fed7b916c	Transcriptional Programming of Normal and Infl...	10.1016/j.celrep.2018.09.006	f512b8b6-369d-4a85-a695-116e0806857f	Skin	f512b8b6-369d-4a85-a695-116e0806857f.h5ad	68036
1	1	d36ca85c-3e8b-444c-ba3e-a645040c6185	A molecular atlas of the human postmenopausal ...	10.1101/2022.08.04.502826	90d4a63b-5c02-43eb-acde-c49345681601	Fallopian tube RNA	90d4a63b-5c02-43eb-acde-c49345681601.h5ad	60574
2	2	d36ca85c-3e8b-444c-ba3e-a645040c6185	A molecular atlas of the human postmenopausal ...	10.1101/2022.08.04.502826	d1207c81-7309-43a7-a5a0-f4283670b62b	Ovary RNA	d1207c81-7309-43a7-a5a0-f4283670b62b.h5ad	26134
3	3	2b02dff7-e427-4cdc-96fb-c0f354c099aa	Single-Cell Analysis of Crohn’s Disease Lesion...	10.1016/j.cell.2019.08.008	36c867a7-be10-4e69-9b39-5de12b0af6da	Ileum	36c867a7-be10-4e69-9b39-5de12b0af6da.h5ad	32458
4	4	e9eec7f5-8519-42f6-99b4-6dbd9cc5ef03	Humoral immunity at the brain borders in homeo...	10.1016/j.coi.2022.102188	58b01044-c5e5-4b0f-8a2d-6ebf951e01ff	A scRNA-seq atlas of immune cells at the CNS b...	58b01044-c5e5-4b0f-8a2d-6ebf951e01ff.h5ad	130908
...	...	...	...	...	...	...	...	...
517	517	a18474f4-ff1e-4864-af69-270b956cee5b	Single-cell RNA sequencing unifies development...	10.1158/2159-8290.cd-22-0824	63bb6359-3945-4658-92eb-3072419953e4	UMAP of T-Cells cells	63bb6359-3945-4658-92eb-3072419953e4.h5ad	14970
518	518	a18474f4-ff1e-4864-af69-270b956cee5b	Single-cell RNA sequencing unifies development...	10.1158/2159-8290.cd-22-0824	94423ec1-21f8-40e8-b5c9-c3ea82350ca4	UMAP of Myeloid cells	94423ec1-21f8-40e8-b5c9-c3ea82350ca4.h5ad	3282
519	519	a18474f4-ff1e-4864-af69-270b956cee5b	Single-cell RNA sequencing unifies development...	10.1158/2159-8290.cd-22-0824	773b9b2e-70c8-40be-8cbb-e7b5abab360d	UMAP of Columnar cells	773b9b2e-70c8-40be-8cbb-e7b5abab360d.h5ad	79522
520	520	a18474f4-ff1e-4864-af69-270b956cee5b	Single-cell RNA sequencing unifies development...	10.1158/2159-8290.cd-22-0824	e5233a94-9e43-418c-8209-6f1400c31530	UMAP of all data	e5233a94-9e43-418c-8209-6f1400c31530.h5ad	146583
521	521	10bf5c50-8d85-4c5f-94b4-22c1363d9f31	Intratumoral heterogeneity in recurrent pediat...		be884a28-0eac-4eca-9c23-1294a93cf56e	Intratumoral heterogeneity in recurrent pediat...	be884a28-0eac-4eca-9c23-1294a93cf56e.h5ad	35637

522 rows × 8 columns

For convenience, read the entire presence matrix (for Homo sapiens) into a SciPy array. There is a convenience API providing this capability, returning the matrix in a scipy.sparse.array:

[2]:

presence_matrix = cellxgene_census.get_presence_matrix(census, organism="Homo sapiens", measurement_name="RNA")

presence_matrix

[2]:

<522x60664 sparse matrix of type '<class 'numpy.uint8'>'
        with 13844504 stored elements in Compressed Sparse Row format>

We also need the var dataframe, which is read into a Pandas DataFrame for convenient manipulation:

[3]:

var_df = human_rna.var.read().concat().to_pandas()

var_df

[3]:

	soma_joinid	feature_id	feature_name	feature_length
0	0	ENSG00000238009	RP11-34P13.7	3726
1	1	ENSG00000279457	WASH9P	1397
2	2	ENSG00000228463	AP006222.1	8224
3	3	ENSG00000237094	RP4-669L17.4	6204
4	4	ENSG00000230021	RP11-206L10.17	5495
...	...	...	...	...
60659	60659	ENSG00000288719	RP4-669P10.21	4252
60660	60660	ENSG00000288720	RP11-852E15.3	7007
60661	60661	ENSG00000288721	RP5-973N23.5	7765
60662	60662	ENSG00000288723	RP11-553N16.6	1015
60663	60663	ENSG00000288724	RP13-546I2.2	625

60664 rows × 4 columns

Is a feature present in a dataset?¶

Goal: test if a given feature is present in a given dataset.

Important: the presence matrix is indexed by soma_joinid, and is NOT positionally indexed. In other words: * the first dimension of the presence matrix is the dataset’s soma_joinid, as stored in the census_datasets dataframe. * the second dimension of the presence matrix is the feature’s soma_joinid, as stored in the var dataframe.

[4]:

var_joinid = var_df.loc[var_df.feature_id == "ENSG00000286096"].soma_joinid
dataset_joinid = datasets_df.loc[datasets_df.dataset_id == "97a17473-e2b1-4f31-a544-44a60773e2dd"].soma_joinid
is_present = presence_matrix[dataset_joinid, var_joinid][0, 0]
print(f'Feature is {"present" if is_present else "not present"}.')

Feature is present.

What datasets contain a feature?¶

Goal: look up all datasets that have a feature_id present.

[5]:

# Grab the feature's soma_joinid from the var dataframe
var_joinid = var_df.loc[var_df.feature_id == "ENSG00000286096"].soma_joinid

# The presence matrix is indexed by the joinids of the dataset and var dataframes,
# so slice out the feature of interest by its joinid.
dataset_joinids = presence_matrix[:, var_joinid].tocoo().row

# From the datasets dataframe, slice out the datasets which have a joinid in the list
datasets_df.loc[datasets_df.soma_joinid.isin(dataset_joinids)]

[5]:

	soma_joinid	collection_id	collection_name	collection_doi	dataset_id	dataset_title	dataset_h5ad_path	dataset_total_cell_count
89	89	283d65eb-dd53-496d-adb7-7570c7caa443	Transcriptomic diversity of cell types across ...	10.1101/2022.10.12.511898	07b1d7c8-5c2e-42f7-9246-26f746cd6013	Dissection: Myelencephalon (medulla oblongata)...	07b1d7c8-5c2e-42f7-9246-26f746cd6013.h5ad	27210
102	102	283d65eb-dd53-496d-adb7-7570c7caa443	Transcriptomic diversity of cell types across ...	10.1101/2022.10.12.511898	7c1c3d47-3166-43e5-9a95-65ceb2d45f78	Dissection: Pons (Pn) - Pontine reticular form...	7c1c3d47-3166-43e5-9a95-65ceb2d45f78.h5ad	49512
103	103	283d65eb-dd53-496d-adb7-7570c7caa443	Transcriptomic diversity of cell types across ...	10.1101/2022.10.12.511898	9372df2d-13d6-4fac-980b-919a5b7eb483	Dissection: Midbrain (M) - Periaqueductal gray...	9372df2d-13d6-4fac-980b-919a5b7eb483.h5ad	33794
131	131	283d65eb-dd53-496d-adb7-7570c7caa443	Transcriptomic diversity of cell types across ...	10.1101/2022.10.12.511898	dd03ce70-3243-4c96-9561-330cc461e4d7	Dissection: Cerebral cortex (Cx) - Perirhinal ...	dd03ce70-3243-4c96-9561-330cc461e4d7.h5ad	23732
145	145	283d65eb-dd53-496d-adb7-7570c7caa443	Transcriptomic diversity of cell types across ...	10.1101/2022.10.12.511898	7a0a8891-9a22-4549-a55b-c2aca23c3a2a	Supercluster: Hippocampal CA1-3	7a0a8891-9a22-4549-a55b-c2aca23c3a2a.h5ad	74979
147	147	283d65eb-dd53-496d-adb7-7570c7caa443	Transcriptomic diversity of cell types across ...	10.1101/2022.10.12.511898	d2b5efc1-14c6-4b5f-bd98-40f9084872d7	Dissection: Tail of Hippocampus (HiT) - Caudal...	d2b5efc1-14c6-4b5f-bd98-40f9084872d7.h5ad	36886
151	151	283d65eb-dd53-496d-adb7-7570c7caa443	Transcriptomic diversity of cell types across ...	10.1101/2022.10.12.511898	f8dda921-5fb4-4c94-a654-c6fc346bfd6d	Dissection: Cerebral cortex (Cx) - Occipitotem...	f8dda921-5fb4-4c94-a654-c6fc346bfd6d.h5ad	31899
154	154	283d65eb-dd53-496d-adb7-7570c7caa443	Transcriptomic diversity of cell types across ...	10.1101/2022.10.12.511898	3a7f3ab4-a280-4b3b-b2c0-6dd05614a78c	Supercluster: Splatter	3a7f3ab4-a280-4b3b-b2c0-6dd05614a78c.h5ad	291833
156	156	283d65eb-dd53-496d-adb7-7570c7caa443	Transcriptomic diversity of cell types across ...	10.1101/2022.10.12.511898	bdb26abd-f4ba-4ea3-8862-c2340e7a4f55	Supercluster: CGE interneuron	bdb26abd-f4ba-4ea3-8862-c2340e7a4f55.h5ad	227671
158	158	283d65eb-dd53-496d-adb7-7570c7caa443	Transcriptomic diversity of cell types across ...	10.1101/2022.10.12.511898	5e5ab909-f73f-4b57-98a0-6d2c5662f6a4	Dissection: Midbrain (M) - Inferior colliculus...	5e5ab909-f73f-4b57-98a0-6d2c5662f6a4.h5ad	32306
166	166	283d65eb-dd53-496d-adb7-7570c7caa443	Transcriptomic diversity of cell types across ...	10.1101/2022.10.12.511898	35c8a04c-8639-4d15-8228-765d8d93fc96	Dissection: Hypothalamus (HTH) - supraoptic re...	35c8a04c-8639-4d15-8228-765d8d93fc96.h5ad	16753
168	168	283d65eb-dd53-496d-adb7-7570c7caa443	Transcriptomic diversity of cell types across ...	10.1101/2022.10.12.511898	3f56901c-dd4a-47d6-b60b-7b0c0111cfb2	Dissection: Head of hippocampus (HiH) - CA1-3	3f56901c-dd4a-47d6-b60b-7b0c0111cfb2.h5ad	37911
184	184	283d65eb-dd53-496d-adb7-7570c7caa443	Transcriptomic diversity of cell types across ...	10.1101/2022.10.12.511898	93131426-0124-4ab4-a013-9dfbcd99d467	Dissection: Epithalamus - ETH	93131426-0124-4ab4-a013-9dfbcd99d467.h5ad	24327
190	190	283d65eb-dd53-496d-adb7-7570c7caa443	Transcriptomic diversity of cell types across ...	10.1101/2022.10.12.511898	acae7679-d077-461c-b857-ee6ccfeb267f	Dissection: Head of hippocampus (HiH) - CA1	acae7679-d077-461c-b857-ee6ccfeb267f.h5ad	39147
192	192	283d65eb-dd53-496d-adb7-7570c7caa443	Transcriptomic diversity of cell types across ...	10.1101/2022.10.12.511898	c202b243-1aa1-4b16-bc9a-b36241f3b1e3	Supercluster: Amygdala excitatory	c202b243-1aa1-4b16-bc9a-b36241f3b1e3.h5ad	109452
194	194	283d65eb-dd53-496d-adb7-7570c7caa443	Transcriptomic diversity of cell types across ...	10.1101/2022.10.12.511898	0325478a-9b52-45b5-b40a-2e2ab0d72eb1	Supercluster: Upper-layer intratelencephalic	0325478a-9b52-45b5-b40a-2e2ab0d72eb1.h5ad	455006
196	196	283d65eb-dd53-496d-adb7-7570c7caa443	Transcriptomic diversity of cell types across ...	10.1101/2022.10.12.511898	c2aad8fc-b63b-4f9b-9cfd-baf7bc9c1771	Dissection: Cerebral cortex (Cx) - Temporal po...	c2aad8fc-b63b-4f9b-9cfd-baf7bc9c1771.h5ad	37642
197	197	283d65eb-dd53-496d-adb7-7570c7caa443	Transcriptomic diversity of cell types across ...	10.1101/2022.10.12.511898	c4b03352-af8d-492a-8d6b-40f304e0a122	Supercluster: Medium spiny neuron	c4b03352-af8d-492a-8d6b-40f304e0a122.h5ad	152189
214	214	283d65eb-dd53-496d-adb7-7570c7caa443	Transcriptomic diversity of cell types across ...	10.1101/2022.10.12.511898	fe1a73ab-a203-45fd-84e9-0f7fd19efcbd	Dissection: Amygdaloid complex (AMY) - basolat...	fe1a73ab-a203-45fd-84e9-0f7fd19efcbd.h5ad	35285
487	487	e5f58829-1a66-40b5-a624-9046778e74f5	Tabula Sapiens	10.1126/science.abl4896	a68b64d8-aee3-4947-81b7-36b8fe5a44d2	Tabula Sapiens - Stromal	a68b64d8-aee3-4947-81b7-36b8fe5a44d2.h5ad	82478
488	488	e5f58829-1a66-40b5-a624-9046778e74f5	Tabula Sapiens	10.1126/science.abl4896	97a17473-e2b1-4f31-a544-44a60773e2dd	Tabula Sapiens - Epithelial	97a17473-e2b1-4f31-a544-44a60773e2dd.h5ad	104148
489	489	e5f58829-1a66-40b5-a624-9046778e74f5	Tabula Sapiens	10.1126/science.abl4896	c5d88abe-f23a-45fa-a534-788985e93dad	Tabula Sapiens - Immune	c5d88abe-f23a-45fa-a534-788985e93dad.h5ad	264824
490	490	e5f58829-1a66-40b5-a624-9046778e74f5	Tabula Sapiens	10.1126/science.abl4896	5a11f879-d1ef-458a-910c-9b0bdfca5ebf	Tabula Sapiens - Endothelial	5a11f879-d1ef-458a-910c-9b0bdfca5ebf.h5ad	31691
491	491	e5f58829-1a66-40b5-a624-9046778e74f5	Tabula Sapiens	10.1126/science.abl4896	53d208b0-2cfd-4366-9866-c3c6114081bc	Tabula Sapiens - All Cells	53d208b0-2cfd-4366-9866-c3c6114081bc.h5ad	483152

What features are in a dataset?¶

Goal: lookup the features present in a given dataset.

This example also demonstrates the ability to do the query on multiple datasets.

[6]:

# Slice the dataset(s) of interest, and get the joinid(s)
dataset_joinids = datasets_df.loc[datasets_df.collection_id == "17481d16-ee44-49e5-bcf0-28c0780d8c4a"].soma_joinid

# Slice the presence matrix by the first dimension, i.e., by dataset
var_joinids = presence_matrix[dataset_joinids, :].tocoo().col

# From the feature (var) dataframe, slice out features which have a joinid in the list.
var_df.loc[var_df.soma_joinid.isin(var_joinids)]

[6]:

	soma_joinid	feature_id	feature_name	feature_length
0	0	ENSG00000238009	RP11-34P13.7	3726
1	1	ENSG00000279457	WASH9P	1397
2	2	ENSG00000228463	AP006222.1	8224
3	3	ENSG00000237094	RP4-669L17.4	6204
4	4	ENSG00000230021	RP11-206L10.17	5495
...	...	...	...	...
56689	56689	ENSG00000283063	TRBV6-2	424
56698	56698	ENSG00000283095	ABC11-4932300O16.1	1535
56705	56705	ENSG00000283117	MGC4859	3118
56706	56706	ENSG00000283118	RP11-107E5.4	644
56709	56709	ENSG00000283125	RP11-299P2.2	547

27211 rows × 4 columns