Genes measured in each cell (dataset presence matrix)

The Census is a compilation of cells from multiple datasets that may differ by the sets of genes they measure. This notebook describes the way to identify the genes measured per dataset.

The presence matrix is a sparse boolean array, indicating which features (var) were present in each dataset. The array has dimensions [n_datasets, n_var], and is stored in the SOMA Measurement varp collection. The first dimension is indexed by the soma_joinid in the census_datasets dataframe. The second is indexed by the soma_joinid in the var dataframe of the measurement.

As a reminder the obs data frame has a column dataset_id that can be used to link any cell in the Census to the presence matrix.

Contents

  1. Opening the Census.

  2. Fetching the IDs of the Census datasets.

  3. Fetching the dataset presence matrix.

  4. Identifying genes measured in a specific dataset.

  5. Identifying datasets that measured specific genes.

  6. Identifying all genes measured in a dataset.

Opening the Census

The cellxgene_census python package contains a convenient API to open the latest version of the Census.

[1]:
import numpy as np
from scipy import sparse
import cellxgene_census

census = cellxgene_census.open_soma()

Fetching the IDs of the Census datasets

Let’s grab a table of all the datasets included in the Census and use this table in combination with the presence matrix below.

[2]:
# Grab the experiment containing human data, and the measurement therein with RNA
human = census["census_data"]["homo_sapiens"]
human_rna = human.ms["RNA"]

# The census-wide datasets
datasets_df = census["census_info"]["datasets"].read().concat().to_pandas()

datasets_df
[2]:
soma_joinid collection_id collection_name collection_doi dataset_id dataset_title dataset_h5ad_path dataset_total_cell_count
0 0 43d4bb39-21af-4d05-b973-4c1fed7b916c Transcriptional Programming of Normal and Infl... 10.1016/j.celrep.2018.09.006 f512b8b6-369d-4a85-a695-116e0806857f Skin f512b8b6-369d-4a85-a695-116e0806857f.h5ad 68036
1 1 2b02dff7-e427-4cdc-96fb-c0f354c099aa Single-Cell Analysis of Crohn’s Disease Lesion... 10.1016/j.cell.2019.08.008 36c867a7-be10-4e69-9b39-5de12b0af6da Ileum 36c867a7-be10-4e69-9b39-5de12b0af6da.h5ad 32458
2 2 e9eec7f5-8519-42f6-99b4-6dbd9cc5ef03 Humoral immunity at the brain borders in homeo... 10.1016/j.coi.2022.102188 58b01044-c5e5-4b0f-8a2d-6ebf951e01ff A scRNA-seq atlas of immune cells at the CNS b... 58b01044-c5e5-4b0f-8a2d-6ebf951e01ff.h5ad 130908
3 3 a72afd53-ab92-4511-88da-252fb0e26b9a Single-cell atlas of peripheral immune respons... 10.1038/s41591-020-0944-y 456e8b9b-f872-488b-871d-94534090a865 Single-cell atlas of peripheral immune respons... 456e8b9b-f872-488b-871d-94534090a865.h5ad 44721
4 4 e4c9ed14-e560-4900-a3bf-b0f8d2ce6a10 A molecular single-cell lung atlas of lethal C... 10.1038/s41586-021-03569-1 d8da613f-e681-4c69-b463-e94f5e66847f A molecular single-cell lung atlas of lethal C... d8da613f-e681-4c69-b463-e94f5e66847f.h5ad 116313
... ... ... ... ... ... ... ... ...
524 524 e3f391f6-5a75-4e96-8450-da47c3d2a939 COVID-19 mRNA vaccine elicits a potent adaptiv... 30498543-4fdd-4f86-9e1b-05c1a1454a6a B cells -- CV19 infection, vaccination and HC 30498543-4fdd-4f86-9e1b-05c1a1454a6a.h5ad 20727
525 525 e3f391f6-5a75-4e96-8450-da47c3d2a939 COVID-19 mRNA vaccine elicits a potent adaptiv... b5191f01-f67d-44b8-bc8d-511a4ecd07bb innate T cells -- CV19 infection, vaccination ... b5191f01-f67d-44b8-bc8d-511a4ecd07bb.h5ad 33415
526 526 e3f391f6-5a75-4e96-8450-da47c3d2a939 COVID-19 mRNA vaccine elicits a potent adaptiv... e463dae9-3fc1-476d-870e-d98a04c56cd6 M cells -- CV19 infection, vaccination and HC e463dae9-3fc1-476d-870e-d98a04c56cd6.h5ad 41130
527 527 e3f391f6-5a75-4e96-8450-da47c3d2a939 COVID-19 mRNA vaccine elicits a potent adaptiv... 1b699e04-1127-42ea-998b-011ace4a5b81 T cells -- CV19 infection, vaccination and HC 1b699e04-1127-42ea-998b-011ace4a5b81.h5ad 98068
528 528 e3f391f6-5a75-4e96-8450-da47c3d2a939 COVID-19 mRNA vaccine elicits a potent adaptiv... 87ce26ed-e5d1-44b4-81cc-cc5b709a169f Circulating Immune cells -- CV19 infection, va... 87ce26ed-e5d1-44b4-81cc-cc5b709a169f.h5ad 195632

529 rows × 8 columns

Fetching the dataset presence matrix

Now let’s fetch the dataset presence matrix.

For convenience, read the entire presence matrix (for Homo sapiens) into a SciPy array. There is a convenience API providing this capability, returning the matrix in a scipy.sparse.array.

[3]:
presence_matrix = cellxgene_census.get_presence_matrix(census, organism="Homo sapiens", measurement_name="RNA")

presence_matrix
[3]:
<529x60664 sparse matrix of type '<class 'numpy.uint8'>'
        with 14006372 stored elements in Compressed Sparse Row format>

We also need the var dataframe, which is read into a Pandas DataFrame for convenient manipulation:

[4]:
var_df = human_rna.var.read().concat().to_pandas()

var_df
[4]:
soma_joinid feature_id feature_name feature_length
0 0 ENSG00000238009 RP11-34P13.7 3726
1 1 ENSG00000279457 WASH9P 1397
2 2 ENSG00000228463 AP006222.1 8224
3 3 ENSG00000237094 RP4-669L17.4 6204
4 4 ENSG00000230021 RP11-206L10.17 5495
... ... ... ... ...
60659 60659 ENSG00000288719 RP4-669P10.21 4252
60660 60660 ENSG00000288720 RP11-852E15.3 7007
60661 60661 ENSG00000288721 RP5-973N23.5 7765
60662 60662 ENSG00000288723 RP11-553N16.6 1015
60663 60663 ENSG00000288724 RP13-546I2.2 625

60664 rows × 4 columns

Identifying genes measured in a specific dataset.

Now that we have the dataset table, the genes metadata table, and the dataset presence matrix, we can check if a gene or set of genes were measured in a specific dataset.

Important: the presence matrix is indexed by soma_joinid, and is NOT positionally indexed. In other words: * the first dimension of the presence matrix is the dataset’s soma_joinid, as stored in the census_datasets dataframe. * the second dimension of the presence matrix is the feature’s soma_joinid, as stored in the var dataframe.

Let’s find out if the the gene "ENSG00000286096" was measured in the dataset with id "97a17473-e2b1-4f31-a544-44a60773e2dd".

[5]:
var_joinid = var_df.loc[var_df.feature_id == "ENSG00000286096"].soma_joinid
dataset_joinid = datasets_df.loc[datasets_df.dataset_id == "97a17473-e2b1-4f31-a544-44a60773e2dd"].soma_joinid
is_present = presence_matrix[dataset_joinid, var_joinid][0, 0]
print(f'Feature is {"present" if is_present else "not present"}.')
Feature is present.

Identifying datasets that measured specific genes

Similarly, we can determine the datasets that measured a specific gene or set of genes.

[6]:
# Grab the feature's soma_joinid from the var dataframe
var_joinid = var_df.loc[var_df.feature_id == "ENSG00000286096"].soma_joinid

# The presence matrix is indexed by the joinids of the dataset and var dataframes,
# so slice out the feature of interest by its joinid.
dataset_joinids = presence_matrix[:, var_joinid].tocoo().row

# From the datasets dataframe, slice out the datasets which have a joinid in the list
datasets_df.loc[datasets_df.soma_joinid.isin(dataset_joinids)]
[6]:
soma_joinid collection_id collection_name collection_doi dataset_id dataset_title dataset_h5ad_path dataset_total_cell_count
335 335 e5f58829-1a66-40b5-a624-9046778e74f5 Tabula Sapiens 10.1126/science.abl4896 a68b64d8-aee3-4947-81b7-36b8fe5a44d2 Tabula Sapiens - Stromal a68b64d8-aee3-4947-81b7-36b8fe5a44d2.h5ad 82478
336 336 e5f58829-1a66-40b5-a624-9046778e74f5 Tabula Sapiens 10.1126/science.abl4896 97a17473-e2b1-4f31-a544-44a60773e2dd Tabula Sapiens - Epithelial 97a17473-e2b1-4f31-a544-44a60773e2dd.h5ad 104148
337 337 e5f58829-1a66-40b5-a624-9046778e74f5 Tabula Sapiens 10.1126/science.abl4896 c5d88abe-f23a-45fa-a534-788985e93dad Tabula Sapiens - Immune c5d88abe-f23a-45fa-a534-788985e93dad.h5ad 264824
338 338 e5f58829-1a66-40b5-a624-9046778e74f5 Tabula Sapiens 10.1126/science.abl4896 5a11f879-d1ef-458a-910c-9b0bdfca5ebf Tabula Sapiens - Endothelial 5a11f879-d1ef-458a-910c-9b0bdfca5ebf.h5ad 31691
339 339 e5f58829-1a66-40b5-a624-9046778e74f5 Tabula Sapiens 10.1126/science.abl4896 53d208b0-2cfd-4366-9866-c3c6114081bc Tabula Sapiens - All Cells 53d208b0-2cfd-4366-9866-c3c6114081bc.h5ad 483152
358 358 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 07b1d7c8-5c2e-42f7-9246-26f746cd6013 Dissection: Myelencephalon (medulla oblongata)... 07b1d7c8-5c2e-42f7-9246-26f746cd6013.h5ad 27210
371 371 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 7c1c3d47-3166-43e5-9a95-65ceb2d45f78 Dissection: Pons (Pn) - Pontine reticular form... 7c1c3d47-3166-43e5-9a95-65ceb2d45f78.h5ad 49512
372 372 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 9372df2d-13d6-4fac-980b-919a5b7eb483 Dissection: Midbrain (M) - Periaqueductal gray... 9372df2d-13d6-4fac-980b-919a5b7eb483.h5ad 33794
399 399 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 dd03ce70-3243-4c96-9561-330cc461e4d7 Dissection: Cerebral cortex (Cx) - Perirhinal ... dd03ce70-3243-4c96-9561-330cc461e4d7.h5ad 23732
413 413 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 7a0a8891-9a22-4549-a55b-c2aca23c3a2a Supercluster: Hippocampal CA1-3 7a0a8891-9a22-4549-a55b-c2aca23c3a2a.h5ad 74979
415 415 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 d2b5efc1-14c6-4b5f-bd98-40f9084872d7 Dissection: Tail of Hippocampus (HiT) - Caudal... d2b5efc1-14c6-4b5f-bd98-40f9084872d7.h5ad 36886
419 419 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 f8dda921-5fb4-4c94-a654-c6fc346bfd6d Dissection: Cerebral cortex (Cx) - Occipitotem... f8dda921-5fb4-4c94-a654-c6fc346bfd6d.h5ad 31899
422 422 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 3a7f3ab4-a280-4b3b-b2c0-6dd05614a78c Supercluster: Splatter 3a7f3ab4-a280-4b3b-b2c0-6dd05614a78c.h5ad 291833
424 424 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 bdb26abd-f4ba-4ea3-8862-c2340e7a4f55 Supercluster: CGE interneuron bdb26abd-f4ba-4ea3-8862-c2340e7a4f55.h5ad 227671
426 426 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 5e5ab909-f73f-4b57-98a0-6d2c5662f6a4 Dissection: Midbrain (M) - Inferior colliculus... 5e5ab909-f73f-4b57-98a0-6d2c5662f6a4.h5ad 32306
434 434 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 35c8a04c-8639-4d15-8228-765d8d93fc96 Dissection: Hypothalamus (HTH) - supraoptic re... 35c8a04c-8639-4d15-8228-765d8d93fc96.h5ad 16753
436 436 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 3f56901c-dd4a-47d6-b60b-7b0c0111cfb2 Dissection: Head of hippocampus (HiH) - CA1-3 3f56901c-dd4a-47d6-b60b-7b0c0111cfb2.h5ad 37911
452 452 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 93131426-0124-4ab4-a013-9dfbcd99d467 Dissection: Epithalamus - ETH 93131426-0124-4ab4-a013-9dfbcd99d467.h5ad 24327
458 458 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 acae7679-d077-461c-b857-ee6ccfeb267f Dissection: Head of hippocampus (HiH) - CA1 acae7679-d077-461c-b857-ee6ccfeb267f.h5ad 39147
460 460 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 c202b243-1aa1-4b16-bc9a-b36241f3b1e3 Supercluster: Amygdala excitatory c202b243-1aa1-4b16-bc9a-b36241f3b1e3.h5ad 109452
462 462 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 0325478a-9b52-45b5-b40a-2e2ab0d72eb1 Supercluster: Upper-layer intratelencephalic 0325478a-9b52-45b5-b40a-2e2ab0d72eb1.h5ad 455006
464 464 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 c2aad8fc-b63b-4f9b-9cfd-baf7bc9c1771 Dissection: Cerebral cortex (Cx) - Temporal po... c2aad8fc-b63b-4f9b-9cfd-baf7bc9c1771.h5ad 37642
465 465 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 c4b03352-af8d-492a-8d6b-40f304e0a122 Supercluster: Medium spiny neuron c4b03352-af8d-492a-8d6b-40f304e0a122.h5ad 152189
482 482 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 fe1a73ab-a203-45fd-84e9-0f7fd19efcbd Dissection: Amygdaloid complex (AMY) - basolat... fe1a73ab-a203-45fd-84e9-0f7fd19efcbd.h5ad 35285

Identifying all genes measured in a dataset

Finally, we can find the set of genes that were measured in the cells of a given dataset.

[7]:
# Slice the dataset(s) of interest, and get the joinid(s)
dataset_joinids = datasets_df.loc[datasets_df.collection_id == "17481d16-ee44-49e5-bcf0-28c0780d8c4a"].soma_joinid

# Slice the presence matrix by the first dimension, i.e., by dataset
var_joinids = presence_matrix[dataset_joinids, :].tocoo().col

# From the feature (var) dataframe, slice out features which have a joinid in the list.
var_df.loc[var_df.soma_joinid.isin(var_joinids)]
[7]:
soma_joinid feature_id feature_name feature_length
0 0 ENSG00000238009 RP11-34P13.7 3726
1 1 ENSG00000279457 WASH9P 1397
2 2 ENSG00000228463 AP006222.1 8224
3 3 ENSG00000237094 RP4-669L17.4 6204
4 4 ENSG00000230021 RP11-206L10.17 5495
... ... ... ... ...
40210 40210 ENSG00000255669 RP11-885B4.2 1023
40219 40219 ENSG00000255618 LINC02440 741
40352 40352 ENSG00000261627 RP11-91I8.1 508
40367 40367 ENSG00000233376 RP5-881L22.6 403
40376 40376 ENSG00000267465 AC011525.4 3363

27211 rows × 4 columns