Genes measured in each cell (dataset presence matrix)¶

The Census is a compilation of cells from multiple datasets that may differ by the sets of genes they measure. This notebook describes the way to identify the genes measured per dataset.

The presence matrix is a sparse boolean array, indicating which features (var) were present in each dataset. The array has dimensions [n_datasets, n_var], and is stored in the SOMA Measurement varp collection. The first dimension is indexed by the soma_joinid in the census_datasets dataframe. The second is indexed by the soma_joinid in the var dataframe of the measurement.

As a reminder the obs data frame has a column dataset_id that can be used to link any cell in the Census to the presence matrix.

Contents

Opening the Census.
Fetching the IDs of the Census datasets.
Fetching the dataset presence matrix.
Identifying genes measured in a specific dataset.
Identifying datasets that measured specific genes.
Identifying all genes measured in a dataset.

Opening the Census¶

The cellxgene_census python package contains a convenient API to open the latest version of the Census.

[1]:

import numpy as np
from scipy import sparse
import cellxgene_census

census = cellxgene_census.open_soma()

Fetching the IDs of the Census datasets¶

Let’s grab a table of all the datasets included in the Census and use this table in combination with the presence matrix below.

[2]:

# Grab the experiment containing human data, and the measurement therein with RNA
human = census["census_data"]["homo_sapiens"]
human_rna = human.ms["RNA"]

# The census-wide datasets
datasets_df = census["census_info"]["datasets"].read().concat().to_pandas()

datasets_df

[2]:

	soma_joinid	collection_id	collection_name	collection_doi	dataset_id	dataset_title	dataset_h5ad_path	dataset_total_cell_count
0	0	43d4bb39-21af-4d05-b973-4c1fed7b916c	Transcriptional Programming of Normal and Infl...	10.1016/j.celrep.2018.09.006	f512b8b6-369d-4a85-a695-116e0806857f	Skin	f512b8b6-369d-4a85-a695-116e0806857f.h5ad	68036
1	1	2b02dff7-e427-4cdc-96fb-c0f354c099aa	Single-Cell Analysis of Crohn’s Disease Lesion...	10.1016/j.cell.2019.08.008	36c867a7-be10-4e69-9b39-5de12b0af6da	Ileum	36c867a7-be10-4e69-9b39-5de12b0af6da.h5ad	32458
2	2	e9eec7f5-8519-42f6-99b4-6dbd9cc5ef03	Humoral immunity at the brain borders in homeo...	10.1016/j.coi.2022.102188	58b01044-c5e5-4b0f-8a2d-6ebf951e01ff	A scRNA-seq atlas of immune cells at the CNS b...	58b01044-c5e5-4b0f-8a2d-6ebf951e01ff.h5ad	130908
3	3	a72afd53-ab92-4511-88da-252fb0e26b9a	Single-cell atlas of peripheral immune respons...	10.1038/s41591-020-0944-y	456e8b9b-f872-488b-871d-94534090a865	Single-cell atlas of peripheral immune respons...	456e8b9b-f872-488b-871d-94534090a865.h5ad	44721
4	4	e4c9ed14-e560-4900-a3bf-b0f8d2ce6a10	A molecular single-cell lung atlas of lethal C...	10.1038/s41586-021-03569-1	d8da613f-e681-4c69-b463-e94f5e66847f	A molecular single-cell lung atlas of lethal C...	d8da613f-e681-4c69-b463-e94f5e66847f.h5ad	116313
...	...	...	...	...	...	...	...	...
524	524	e3f391f6-5a75-4e96-8450-da47c3d2a939	COVID-19 mRNA vaccine elicits a potent adaptiv...		30498543-4fdd-4f86-9e1b-05c1a1454a6a	B cells -- CV19 infection, vaccination and HC	30498543-4fdd-4f86-9e1b-05c1a1454a6a.h5ad	20727
525	525	e3f391f6-5a75-4e96-8450-da47c3d2a939	COVID-19 mRNA vaccine elicits a potent adaptiv...		b5191f01-f67d-44b8-bc8d-511a4ecd07bb	innate T cells -- CV19 infection, vaccination ...	b5191f01-f67d-44b8-bc8d-511a4ecd07bb.h5ad	33415
526	526	e3f391f6-5a75-4e96-8450-da47c3d2a939	COVID-19 mRNA vaccine elicits a potent adaptiv...		e463dae9-3fc1-476d-870e-d98a04c56cd6	M cells -- CV19 infection, vaccination and HC	e463dae9-3fc1-476d-870e-d98a04c56cd6.h5ad	41130
527	527	e3f391f6-5a75-4e96-8450-da47c3d2a939	COVID-19 mRNA vaccine elicits a potent adaptiv...		1b699e04-1127-42ea-998b-011ace4a5b81	T cells -- CV19 infection, vaccination and HC	1b699e04-1127-42ea-998b-011ace4a5b81.h5ad	98068
528	528	e3f391f6-5a75-4e96-8450-da47c3d2a939	COVID-19 mRNA vaccine elicits a potent adaptiv...		87ce26ed-e5d1-44b4-81cc-cc5b709a169f	Circulating Immune cells -- CV19 infection, va...	87ce26ed-e5d1-44b4-81cc-cc5b709a169f.h5ad	195632

529 rows × 8 columns

Fetching the dataset presence matrix¶

Now let’s fetch the dataset presence matrix.

For convenience, read the entire presence matrix (for Homo sapiens) into a SciPy array. There is a convenience API providing this capability, returning the matrix in a scipy.sparse.array.

[3]:

presence_matrix = cellxgene_census.get_presence_matrix(census, organism="Homo sapiens", measurement_name="RNA")

presence_matrix

[3]:

<529x60664 sparse matrix of type '<class 'numpy.uint8'>'
        with 14006372 stored elements in Compressed Sparse Row format>

We also need the var dataframe, which is read into a Pandas DataFrame for convenient manipulation:

[4]:

var_df = human_rna.var.read().concat().to_pandas()

var_df

[4]:

	soma_joinid	feature_id	feature_name	feature_length
0	0	ENSG00000238009	RP11-34P13.7	3726
1	1	ENSG00000279457	WASH9P	1397
2	2	ENSG00000228463	AP006222.1	8224
3	3	ENSG00000237094	RP4-669L17.4	6204
4	4	ENSG00000230021	RP11-206L10.17	5495
...	...	...	...	...
60659	60659	ENSG00000288719	RP4-669P10.21	4252
60660	60660	ENSG00000288720	RP11-852E15.3	7007
60661	60661	ENSG00000288721	RP5-973N23.5	7765
60662	60662	ENSG00000288723	RP11-553N16.6	1015
60663	60663	ENSG00000288724	RP13-546I2.2	625

60664 rows × 4 columns

Identifying genes measured in a specific dataset.¶

Now that we have the dataset table, the genes metadata table, and the dataset presence matrix, we can check if a gene or set of genes were measured in a specific dataset.

Important: the presence matrix is indexed by soma_joinid, and is NOT positionally indexed. In other words: * the first dimension of the presence matrix is the dataset’s soma_joinid, as stored in the census_datasets dataframe. * the second dimension of the presence matrix is the feature’s soma_joinid, as stored in the var dataframe.

Let’s find out if the the gene "ENSG00000286096" was measured in the dataset with id "97a17473-e2b1-4f31-a544-44a60773e2dd".

[5]:

var_joinid = var_df.loc[var_df.feature_id == "ENSG00000286096"].soma_joinid
dataset_joinid = datasets_df.loc[datasets_df.dataset_id == "97a17473-e2b1-4f31-a544-44a60773e2dd"].soma_joinid
is_present = presence_matrix[dataset_joinid, var_joinid][0, 0]
print(f'Feature is {"present" if is_present else "not present"}.')

Feature is present.

Identifying datasets that measured specific genes¶

Similarly, we can determine the datasets that measured a specific gene or set of genes.

[6]:

# Grab the feature's soma_joinid from the var dataframe
var_joinid = var_df.loc[var_df.feature_id == "ENSG00000286096"].soma_joinid

# The presence matrix is indexed by the joinids of the dataset and var dataframes,
# so slice out the feature of interest by its joinid.
dataset_joinids = presence_matrix[:, var_joinid].tocoo().row

# From the datasets dataframe, slice out the datasets which have a joinid in the list
datasets_df.loc[datasets_df.soma_joinid.isin(dataset_joinids)]

[6]:

	soma_joinid	collection_id	collection_name	collection_doi	dataset_id	dataset_title	dataset_h5ad_path	dataset_total_cell_count
335	335	e5f58829-1a66-40b5-a624-9046778e74f5	Tabula Sapiens	10.1126/science.abl4896	a68b64d8-aee3-4947-81b7-36b8fe5a44d2	Tabula Sapiens - Stromal	a68b64d8-aee3-4947-81b7-36b8fe5a44d2.h5ad	82478
336	336	e5f58829-1a66-40b5-a624-9046778e74f5	Tabula Sapiens	10.1126/science.abl4896	97a17473-e2b1-4f31-a544-44a60773e2dd	Tabula Sapiens - Epithelial	97a17473-e2b1-4f31-a544-44a60773e2dd.h5ad	104148
337	337	e5f58829-1a66-40b5-a624-9046778e74f5	Tabula Sapiens	10.1126/science.abl4896	c5d88abe-f23a-45fa-a534-788985e93dad	Tabula Sapiens - Immune	c5d88abe-f23a-45fa-a534-788985e93dad.h5ad	264824
338	338	e5f58829-1a66-40b5-a624-9046778e74f5	Tabula Sapiens	10.1126/science.abl4896	5a11f879-d1ef-458a-910c-9b0bdfca5ebf	Tabula Sapiens - Endothelial	5a11f879-d1ef-458a-910c-9b0bdfca5ebf.h5ad	31691
339	339	e5f58829-1a66-40b5-a624-9046778e74f5	Tabula Sapiens	10.1126/science.abl4896	53d208b0-2cfd-4366-9866-c3c6114081bc	Tabula Sapiens - All Cells	53d208b0-2cfd-4366-9866-c3c6114081bc.h5ad	483152
358	358	283d65eb-dd53-496d-adb7-7570c7caa443	Transcriptomic diversity of cell types across ...	10.1101/2022.10.12.511898	07b1d7c8-5c2e-42f7-9246-26f746cd6013	Dissection: Myelencephalon (medulla oblongata)...	07b1d7c8-5c2e-42f7-9246-26f746cd6013.h5ad	27210
371	371	283d65eb-dd53-496d-adb7-7570c7caa443	Transcriptomic diversity of cell types across ...	10.1101/2022.10.12.511898	7c1c3d47-3166-43e5-9a95-65ceb2d45f78	Dissection: Pons (Pn) - Pontine reticular form...	7c1c3d47-3166-43e5-9a95-65ceb2d45f78.h5ad	49512
372	372	283d65eb-dd53-496d-adb7-7570c7caa443	Transcriptomic diversity of cell types across ...	10.1101/2022.10.12.511898	9372df2d-13d6-4fac-980b-919a5b7eb483	Dissection: Midbrain (M) - Periaqueductal gray...	9372df2d-13d6-4fac-980b-919a5b7eb483.h5ad	33794
399	399	283d65eb-dd53-496d-adb7-7570c7caa443	Transcriptomic diversity of cell types across ...	10.1101/2022.10.12.511898	dd03ce70-3243-4c96-9561-330cc461e4d7	Dissection: Cerebral cortex (Cx) - Perirhinal ...	dd03ce70-3243-4c96-9561-330cc461e4d7.h5ad	23732
413	413	283d65eb-dd53-496d-adb7-7570c7caa443	Transcriptomic diversity of cell types across ...	10.1101/2022.10.12.511898	7a0a8891-9a22-4549-a55b-c2aca23c3a2a	Supercluster: Hippocampal CA1-3	7a0a8891-9a22-4549-a55b-c2aca23c3a2a.h5ad	74979
415	415	283d65eb-dd53-496d-adb7-7570c7caa443	Transcriptomic diversity of cell types across ...	10.1101/2022.10.12.511898	d2b5efc1-14c6-4b5f-bd98-40f9084872d7	Dissection: Tail of Hippocampus (HiT) - Caudal...	d2b5efc1-14c6-4b5f-bd98-40f9084872d7.h5ad	36886
419	419	283d65eb-dd53-496d-adb7-7570c7caa443	Transcriptomic diversity of cell types across ...	10.1101/2022.10.12.511898	f8dda921-5fb4-4c94-a654-c6fc346bfd6d	Dissection: Cerebral cortex (Cx) - Occipitotem...	f8dda921-5fb4-4c94-a654-c6fc346bfd6d.h5ad	31899
422	422	283d65eb-dd53-496d-adb7-7570c7caa443	Transcriptomic diversity of cell types across ...	10.1101/2022.10.12.511898	3a7f3ab4-a280-4b3b-b2c0-6dd05614a78c	Supercluster: Splatter	3a7f3ab4-a280-4b3b-b2c0-6dd05614a78c.h5ad	291833
424	424	283d65eb-dd53-496d-adb7-7570c7caa443	Transcriptomic diversity of cell types across ...	10.1101/2022.10.12.511898	bdb26abd-f4ba-4ea3-8862-c2340e7a4f55	Supercluster: CGE interneuron	bdb26abd-f4ba-4ea3-8862-c2340e7a4f55.h5ad	227671
426	426	283d65eb-dd53-496d-adb7-7570c7caa443	Transcriptomic diversity of cell types across ...	10.1101/2022.10.12.511898	5e5ab909-f73f-4b57-98a0-6d2c5662f6a4	Dissection: Midbrain (M) - Inferior colliculus...	5e5ab909-f73f-4b57-98a0-6d2c5662f6a4.h5ad	32306
434	434	283d65eb-dd53-496d-adb7-7570c7caa443	Transcriptomic diversity of cell types across ...	10.1101/2022.10.12.511898	35c8a04c-8639-4d15-8228-765d8d93fc96	Dissection: Hypothalamus (HTH) - supraoptic re...	35c8a04c-8639-4d15-8228-765d8d93fc96.h5ad	16753
436	436	283d65eb-dd53-496d-adb7-7570c7caa443	Transcriptomic diversity of cell types across ...	10.1101/2022.10.12.511898	3f56901c-dd4a-47d6-b60b-7b0c0111cfb2	Dissection: Head of hippocampus (HiH) - CA1-3	3f56901c-dd4a-47d6-b60b-7b0c0111cfb2.h5ad	37911
452	452	283d65eb-dd53-496d-adb7-7570c7caa443	Transcriptomic diversity of cell types across ...	10.1101/2022.10.12.511898	93131426-0124-4ab4-a013-9dfbcd99d467	Dissection: Epithalamus - ETH	93131426-0124-4ab4-a013-9dfbcd99d467.h5ad	24327
458	458	283d65eb-dd53-496d-adb7-7570c7caa443	Transcriptomic diversity of cell types across ...	10.1101/2022.10.12.511898	acae7679-d077-461c-b857-ee6ccfeb267f	Dissection: Head of hippocampus (HiH) - CA1	acae7679-d077-461c-b857-ee6ccfeb267f.h5ad	39147
460	460	283d65eb-dd53-496d-adb7-7570c7caa443	Transcriptomic diversity of cell types across ...	10.1101/2022.10.12.511898	c202b243-1aa1-4b16-bc9a-b36241f3b1e3	Supercluster: Amygdala excitatory	c202b243-1aa1-4b16-bc9a-b36241f3b1e3.h5ad	109452
462	462	283d65eb-dd53-496d-adb7-7570c7caa443	Transcriptomic diversity of cell types across ...	10.1101/2022.10.12.511898	0325478a-9b52-45b5-b40a-2e2ab0d72eb1	Supercluster: Upper-layer intratelencephalic	0325478a-9b52-45b5-b40a-2e2ab0d72eb1.h5ad	455006
464	464	283d65eb-dd53-496d-adb7-7570c7caa443	Transcriptomic diversity of cell types across ...	10.1101/2022.10.12.511898	c2aad8fc-b63b-4f9b-9cfd-baf7bc9c1771	Dissection: Cerebral cortex (Cx) - Temporal po...	c2aad8fc-b63b-4f9b-9cfd-baf7bc9c1771.h5ad	37642
465	465	283d65eb-dd53-496d-adb7-7570c7caa443	Transcriptomic diversity of cell types across ...	10.1101/2022.10.12.511898	c4b03352-af8d-492a-8d6b-40f304e0a122	Supercluster: Medium spiny neuron	c4b03352-af8d-492a-8d6b-40f304e0a122.h5ad	152189
482	482	283d65eb-dd53-496d-adb7-7570c7caa443	Transcriptomic diversity of cell types across ...	10.1101/2022.10.12.511898	fe1a73ab-a203-45fd-84e9-0f7fd19efcbd	Dissection: Amygdaloid complex (AMY) - basolat...	fe1a73ab-a203-45fd-84e9-0f7fd19efcbd.h5ad	35285

Identifying all genes measured in a dataset¶

Finally, we can find the set of genes that were measured in the cells of a given dataset.

[7]:

# Slice the dataset(s) of interest, and get the joinid(s)
dataset_joinids = datasets_df.loc[datasets_df.collection_id == "17481d16-ee44-49e5-bcf0-28c0780d8c4a"].soma_joinid

# Slice the presence matrix by the first dimension, i.e., by dataset
var_joinids = presence_matrix[dataset_joinids, :].tocoo().col

# From the feature (var) dataframe, slice out features which have a joinid in the list.
var_df.loc[var_df.soma_joinid.isin(var_joinids)]

[7]:

	soma_joinid	feature_id	feature_name	feature_length
0	0	ENSG00000238009	RP11-34P13.7	3726
1	1	ENSG00000279457	WASH9P	1397
2	2	ENSG00000228463	AP006222.1	8224
3	3	ENSG00000237094	RP4-669L17.4	6204
4	4	ENSG00000230021	RP11-206L10.17	5495
...	...	...	...	...
40210	40210	ENSG00000255669	RP11-885B4.2	1023
40219	40219	ENSG00000255618	LINC02440	741
40352	40352	ENSG00000261627	RP11-91I8.1	508
40367	40367	ENSG00000233376	RP5-881L22.6	403
40376	40376	ENSG00000267465	AC011525.4	3363

27211 rows × 4 columns