Genes measured in each cell (dataset presence matrix)¶
The Census is a compilation of cells from multiple datasets that may differ by the sets of genes they measure. This notebook describes the way to identify the genes measured per dataset.
The presence matrix is a sparse boolean array, indicating which features (var) were present in each dataset. The array has dimensions [n_datasets, n_var], and is stored in the SOMA Measurement varp collection. The first dimension is indexed by the soma_joinid in the census_datasets dataframe. The second is indexed by the soma_joinid in the var dataframe of the measurement.
As a reminder the obs data frame has a column dataset_id that can be used to link any cell in the Census to the presence matrix.
Contents
Opening the Census.
Fetching the IDs of the Census datasets.
Fetching the dataset presence matrix.
Identifying genes measured in a specific dataset.
Identifying datasets that measured specific genes.
Identifying all genes measured in a dataset.
Opening the Census¶
The cellxgene_census python package contains a convenient API to open the latest version of the Census.
[1]:
import numpy as np
from scipy import sparse
import cellxgene_census
census = cellxgene_census.open_soma()
Fetching the IDs of the Census datasets¶
Let’s grab a table of all the datasets included in the Census and use this table in combination with the presence matrix below.
[2]:
# Grab the experiment containing human data, and the measurement therein with RNA
human = census["census_data"]["homo_sapiens"]
human_rna = human.ms["RNA"]
# The census-wide datasets
datasets_df = census["census_info"]["datasets"].read().concat().to_pandas()
datasets_df
[2]:
| soma_joinid | collection_id | collection_name | collection_doi | dataset_id | dataset_title | dataset_h5ad_path | dataset_total_cell_count | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 43d4bb39-21af-4d05-b973-4c1fed7b916c | Transcriptional Programming of Normal and Infl... | 10.1016/j.celrep.2018.09.006 | f512b8b6-369d-4a85-a695-116e0806857f | Skin | f512b8b6-369d-4a85-a695-116e0806857f.h5ad | 68036 |
| 1 | 1 | 2b02dff7-e427-4cdc-96fb-c0f354c099aa | Single-Cell Analysis of Crohn’s Disease Lesion... | 10.1016/j.cell.2019.08.008 | 36c867a7-be10-4e69-9b39-5de12b0af6da | Ileum | 36c867a7-be10-4e69-9b39-5de12b0af6da.h5ad | 32458 |
| 2 | 2 | e9eec7f5-8519-42f6-99b4-6dbd9cc5ef03 | Humoral immunity at the brain borders in homeo... | 10.1016/j.coi.2022.102188 | 58b01044-c5e5-4b0f-8a2d-6ebf951e01ff | A scRNA-seq atlas of immune cells at the CNS b... | 58b01044-c5e5-4b0f-8a2d-6ebf951e01ff.h5ad | 130908 |
| 3 | 3 | a72afd53-ab92-4511-88da-252fb0e26b9a | Single-cell atlas of peripheral immune respons... | 10.1038/s41591-020-0944-y | 456e8b9b-f872-488b-871d-94534090a865 | Single-cell atlas of peripheral immune respons... | 456e8b9b-f872-488b-871d-94534090a865.h5ad | 44721 |
| 4 | 4 | e4c9ed14-e560-4900-a3bf-b0f8d2ce6a10 | A molecular single-cell lung atlas of lethal C... | 10.1038/s41586-021-03569-1 | d8da613f-e681-4c69-b463-e94f5e66847f | A molecular single-cell lung atlas of lethal C... | d8da613f-e681-4c69-b463-e94f5e66847f.h5ad | 116313 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 524 | 524 | e3f391f6-5a75-4e96-8450-da47c3d2a939 | COVID-19 mRNA vaccine elicits a potent adaptiv... | 30498543-4fdd-4f86-9e1b-05c1a1454a6a | B cells -- CV19 infection, vaccination and HC | 30498543-4fdd-4f86-9e1b-05c1a1454a6a.h5ad | 20727 | |
| 525 | 525 | e3f391f6-5a75-4e96-8450-da47c3d2a939 | COVID-19 mRNA vaccine elicits a potent adaptiv... | b5191f01-f67d-44b8-bc8d-511a4ecd07bb | innate T cells -- CV19 infection, vaccination ... | b5191f01-f67d-44b8-bc8d-511a4ecd07bb.h5ad | 33415 | |
| 526 | 526 | e3f391f6-5a75-4e96-8450-da47c3d2a939 | COVID-19 mRNA vaccine elicits a potent adaptiv... | e463dae9-3fc1-476d-870e-d98a04c56cd6 | M cells -- CV19 infection, vaccination and HC | e463dae9-3fc1-476d-870e-d98a04c56cd6.h5ad | 41130 | |
| 527 | 527 | e3f391f6-5a75-4e96-8450-da47c3d2a939 | COVID-19 mRNA vaccine elicits a potent adaptiv... | 1b699e04-1127-42ea-998b-011ace4a5b81 | T cells -- CV19 infection, vaccination and HC | 1b699e04-1127-42ea-998b-011ace4a5b81.h5ad | 98068 | |
| 528 | 528 | e3f391f6-5a75-4e96-8450-da47c3d2a939 | COVID-19 mRNA vaccine elicits a potent adaptiv... | 87ce26ed-e5d1-44b4-81cc-cc5b709a169f | Circulating Immune cells -- CV19 infection, va... | 87ce26ed-e5d1-44b4-81cc-cc5b709a169f.h5ad | 195632 |
529 rows × 8 columns
Fetching the dataset presence matrix¶
Now let’s fetch the dataset presence matrix.
For convenience, read the entire presence matrix (for Homo sapiens) into a SciPy array. There is a convenience API providing this capability, returning the matrix in a scipy.sparse.array.
[3]:
presence_matrix = cellxgene_census.get_presence_matrix(census, organism="Homo sapiens", measurement_name="RNA")
presence_matrix
[3]:
<529x60664 sparse matrix of type '<class 'numpy.uint8'>'
with 14006372 stored elements in Compressed Sparse Row format>
We also need the var dataframe, which is read into a Pandas DataFrame for convenient manipulation:
[4]:
var_df = human_rna.var.read().concat().to_pandas()
var_df
[4]:
| soma_joinid | feature_id | feature_name | feature_length | |
|---|---|---|---|---|
| 0 | 0 | ENSG00000238009 | RP11-34P13.7 | 3726 |
| 1 | 1 | ENSG00000279457 | WASH9P | 1397 |
| 2 | 2 | ENSG00000228463 | AP006222.1 | 8224 |
| 3 | 3 | ENSG00000237094 | RP4-669L17.4 | 6204 |
| 4 | 4 | ENSG00000230021 | RP11-206L10.17 | 5495 |
| ... | ... | ... | ... | ... |
| 60659 | 60659 | ENSG00000288719 | RP4-669P10.21 | 4252 |
| 60660 | 60660 | ENSG00000288720 | RP11-852E15.3 | 7007 |
| 60661 | 60661 | ENSG00000288721 | RP5-973N23.5 | 7765 |
| 60662 | 60662 | ENSG00000288723 | RP11-553N16.6 | 1015 |
| 60663 | 60663 | ENSG00000288724 | RP13-546I2.2 | 625 |
60664 rows × 4 columns
Identifying genes measured in a specific dataset.¶
Now that we have the dataset table, the genes metadata table, and the dataset presence matrix, we can check if a gene or set of genes were measured in a specific dataset.
Important: the presence matrix is indexed by soma_joinid, and is NOT positionally indexed. In other words: * the first dimension of the presence matrix is the dataset’s soma_joinid, as stored in the census_datasets dataframe. * the second dimension of the presence matrix is the feature’s soma_joinid, as stored in the var dataframe.
Let’s find out if the the gene "ENSG00000286096" was measured in the dataset with id "97a17473-e2b1-4f31-a544-44a60773e2dd".
[5]:
var_joinid = var_df.loc[var_df.feature_id == "ENSG00000286096"].soma_joinid
dataset_joinid = datasets_df.loc[datasets_df.dataset_id == "97a17473-e2b1-4f31-a544-44a60773e2dd"].soma_joinid
is_present = presence_matrix[dataset_joinid, var_joinid][0, 0]
print(f'Feature is {"present" if is_present else "not present"}.')
Feature is present.
Identifying datasets that measured specific genes¶
Similarly, we can determine the datasets that measured a specific gene or set of genes.
[6]:
# Grab the feature's soma_joinid from the var dataframe
var_joinid = var_df.loc[var_df.feature_id == "ENSG00000286096"].soma_joinid
# The presence matrix is indexed by the joinids of the dataset and var dataframes,
# so slice out the feature of interest by its joinid.
dataset_joinids = presence_matrix[:, var_joinid].tocoo().row
# From the datasets dataframe, slice out the datasets which have a joinid in the list
datasets_df.loc[datasets_df.soma_joinid.isin(dataset_joinids)]
[6]:
| soma_joinid | collection_id | collection_name | collection_doi | dataset_id | dataset_title | dataset_h5ad_path | dataset_total_cell_count | |
|---|---|---|---|---|---|---|---|---|
| 335 | 335 | e5f58829-1a66-40b5-a624-9046778e74f5 | Tabula Sapiens | 10.1126/science.abl4896 | a68b64d8-aee3-4947-81b7-36b8fe5a44d2 | Tabula Sapiens - Stromal | a68b64d8-aee3-4947-81b7-36b8fe5a44d2.h5ad | 82478 |
| 336 | 336 | e5f58829-1a66-40b5-a624-9046778e74f5 | Tabula Sapiens | 10.1126/science.abl4896 | 97a17473-e2b1-4f31-a544-44a60773e2dd | Tabula Sapiens - Epithelial | 97a17473-e2b1-4f31-a544-44a60773e2dd.h5ad | 104148 |
| 337 | 337 | e5f58829-1a66-40b5-a624-9046778e74f5 | Tabula Sapiens | 10.1126/science.abl4896 | c5d88abe-f23a-45fa-a534-788985e93dad | Tabula Sapiens - Immune | c5d88abe-f23a-45fa-a534-788985e93dad.h5ad | 264824 |
| 338 | 338 | e5f58829-1a66-40b5-a624-9046778e74f5 | Tabula Sapiens | 10.1126/science.abl4896 | 5a11f879-d1ef-458a-910c-9b0bdfca5ebf | Tabula Sapiens - Endothelial | 5a11f879-d1ef-458a-910c-9b0bdfca5ebf.h5ad | 31691 |
| 339 | 339 | e5f58829-1a66-40b5-a624-9046778e74f5 | Tabula Sapiens | 10.1126/science.abl4896 | 53d208b0-2cfd-4366-9866-c3c6114081bc | Tabula Sapiens - All Cells | 53d208b0-2cfd-4366-9866-c3c6114081bc.h5ad | 483152 |
| 358 | 358 | 283d65eb-dd53-496d-adb7-7570c7caa443 | Transcriptomic diversity of cell types across ... | 10.1101/2022.10.12.511898 | 07b1d7c8-5c2e-42f7-9246-26f746cd6013 | Dissection: Myelencephalon (medulla oblongata)... | 07b1d7c8-5c2e-42f7-9246-26f746cd6013.h5ad | 27210 |
| 371 | 371 | 283d65eb-dd53-496d-adb7-7570c7caa443 | Transcriptomic diversity of cell types across ... | 10.1101/2022.10.12.511898 | 7c1c3d47-3166-43e5-9a95-65ceb2d45f78 | Dissection: Pons (Pn) - Pontine reticular form... | 7c1c3d47-3166-43e5-9a95-65ceb2d45f78.h5ad | 49512 |
| 372 | 372 | 283d65eb-dd53-496d-adb7-7570c7caa443 | Transcriptomic diversity of cell types across ... | 10.1101/2022.10.12.511898 | 9372df2d-13d6-4fac-980b-919a5b7eb483 | Dissection: Midbrain (M) - Periaqueductal gray... | 9372df2d-13d6-4fac-980b-919a5b7eb483.h5ad | 33794 |
| 399 | 399 | 283d65eb-dd53-496d-adb7-7570c7caa443 | Transcriptomic diversity of cell types across ... | 10.1101/2022.10.12.511898 | dd03ce70-3243-4c96-9561-330cc461e4d7 | Dissection: Cerebral cortex (Cx) - Perirhinal ... | dd03ce70-3243-4c96-9561-330cc461e4d7.h5ad | 23732 |
| 413 | 413 | 283d65eb-dd53-496d-adb7-7570c7caa443 | Transcriptomic diversity of cell types across ... | 10.1101/2022.10.12.511898 | 7a0a8891-9a22-4549-a55b-c2aca23c3a2a | Supercluster: Hippocampal CA1-3 | 7a0a8891-9a22-4549-a55b-c2aca23c3a2a.h5ad | 74979 |
| 415 | 415 | 283d65eb-dd53-496d-adb7-7570c7caa443 | Transcriptomic diversity of cell types across ... | 10.1101/2022.10.12.511898 | d2b5efc1-14c6-4b5f-bd98-40f9084872d7 | Dissection: Tail of Hippocampus (HiT) - Caudal... | d2b5efc1-14c6-4b5f-bd98-40f9084872d7.h5ad | 36886 |
| 419 | 419 | 283d65eb-dd53-496d-adb7-7570c7caa443 | Transcriptomic diversity of cell types across ... | 10.1101/2022.10.12.511898 | f8dda921-5fb4-4c94-a654-c6fc346bfd6d | Dissection: Cerebral cortex (Cx) - Occipitotem... | f8dda921-5fb4-4c94-a654-c6fc346bfd6d.h5ad | 31899 |
| 422 | 422 | 283d65eb-dd53-496d-adb7-7570c7caa443 | Transcriptomic diversity of cell types across ... | 10.1101/2022.10.12.511898 | 3a7f3ab4-a280-4b3b-b2c0-6dd05614a78c | Supercluster: Splatter | 3a7f3ab4-a280-4b3b-b2c0-6dd05614a78c.h5ad | 291833 |
| 424 | 424 | 283d65eb-dd53-496d-adb7-7570c7caa443 | Transcriptomic diversity of cell types across ... | 10.1101/2022.10.12.511898 | bdb26abd-f4ba-4ea3-8862-c2340e7a4f55 | Supercluster: CGE interneuron | bdb26abd-f4ba-4ea3-8862-c2340e7a4f55.h5ad | 227671 |
| 426 | 426 | 283d65eb-dd53-496d-adb7-7570c7caa443 | Transcriptomic diversity of cell types across ... | 10.1101/2022.10.12.511898 | 5e5ab909-f73f-4b57-98a0-6d2c5662f6a4 | Dissection: Midbrain (M) - Inferior colliculus... | 5e5ab909-f73f-4b57-98a0-6d2c5662f6a4.h5ad | 32306 |
| 434 | 434 | 283d65eb-dd53-496d-adb7-7570c7caa443 | Transcriptomic diversity of cell types across ... | 10.1101/2022.10.12.511898 | 35c8a04c-8639-4d15-8228-765d8d93fc96 | Dissection: Hypothalamus (HTH) - supraoptic re... | 35c8a04c-8639-4d15-8228-765d8d93fc96.h5ad | 16753 |
| 436 | 436 | 283d65eb-dd53-496d-adb7-7570c7caa443 | Transcriptomic diversity of cell types across ... | 10.1101/2022.10.12.511898 | 3f56901c-dd4a-47d6-b60b-7b0c0111cfb2 | Dissection: Head of hippocampus (HiH) - CA1-3 | 3f56901c-dd4a-47d6-b60b-7b0c0111cfb2.h5ad | 37911 |
| 452 | 452 | 283d65eb-dd53-496d-adb7-7570c7caa443 | Transcriptomic diversity of cell types across ... | 10.1101/2022.10.12.511898 | 93131426-0124-4ab4-a013-9dfbcd99d467 | Dissection: Epithalamus - ETH | 93131426-0124-4ab4-a013-9dfbcd99d467.h5ad | 24327 |
| 458 | 458 | 283d65eb-dd53-496d-adb7-7570c7caa443 | Transcriptomic diversity of cell types across ... | 10.1101/2022.10.12.511898 | acae7679-d077-461c-b857-ee6ccfeb267f | Dissection: Head of hippocampus (HiH) - CA1 | acae7679-d077-461c-b857-ee6ccfeb267f.h5ad | 39147 |
| 460 | 460 | 283d65eb-dd53-496d-adb7-7570c7caa443 | Transcriptomic diversity of cell types across ... | 10.1101/2022.10.12.511898 | c202b243-1aa1-4b16-bc9a-b36241f3b1e3 | Supercluster: Amygdala excitatory | c202b243-1aa1-4b16-bc9a-b36241f3b1e3.h5ad | 109452 |
| 462 | 462 | 283d65eb-dd53-496d-adb7-7570c7caa443 | Transcriptomic diversity of cell types across ... | 10.1101/2022.10.12.511898 | 0325478a-9b52-45b5-b40a-2e2ab0d72eb1 | Supercluster: Upper-layer intratelencephalic | 0325478a-9b52-45b5-b40a-2e2ab0d72eb1.h5ad | 455006 |
| 464 | 464 | 283d65eb-dd53-496d-adb7-7570c7caa443 | Transcriptomic diversity of cell types across ... | 10.1101/2022.10.12.511898 | c2aad8fc-b63b-4f9b-9cfd-baf7bc9c1771 | Dissection: Cerebral cortex (Cx) - Temporal po... | c2aad8fc-b63b-4f9b-9cfd-baf7bc9c1771.h5ad | 37642 |
| 465 | 465 | 283d65eb-dd53-496d-adb7-7570c7caa443 | Transcriptomic diversity of cell types across ... | 10.1101/2022.10.12.511898 | c4b03352-af8d-492a-8d6b-40f304e0a122 | Supercluster: Medium spiny neuron | c4b03352-af8d-492a-8d6b-40f304e0a122.h5ad | 152189 |
| 482 | 482 | 283d65eb-dd53-496d-adb7-7570c7caa443 | Transcriptomic diversity of cell types across ... | 10.1101/2022.10.12.511898 | fe1a73ab-a203-45fd-84e9-0f7fd19efcbd | Dissection: Amygdaloid complex (AMY) - basolat... | fe1a73ab-a203-45fd-84e9-0f7fd19efcbd.h5ad | 35285 |
Identifying all genes measured in a dataset¶
Finally, we can find the set of genes that were measured in the cells of a given dataset.
[7]:
# Slice the dataset(s) of interest, and get the joinid(s)
dataset_joinids = datasets_df.loc[datasets_df.collection_id == "17481d16-ee44-49e5-bcf0-28c0780d8c4a"].soma_joinid
# Slice the presence matrix by the first dimension, i.e., by dataset
var_joinids = presence_matrix[dataset_joinids, :].tocoo().col
# From the feature (var) dataframe, slice out features which have a joinid in the list.
var_df.loc[var_df.soma_joinid.isin(var_joinids)]
[7]:
| soma_joinid | feature_id | feature_name | feature_length | |
|---|---|---|---|---|
| 0 | 0 | ENSG00000238009 | RP11-34P13.7 | 3726 |
| 1 | 1 | ENSG00000279457 | WASH9P | 1397 |
| 2 | 2 | ENSG00000228463 | AP006222.1 | 8224 |
| 3 | 3 | ENSG00000237094 | RP4-669L17.4 | 6204 |
| 4 | 4 | ENSG00000230021 | RP11-206L10.17 | 5495 |
| ... | ... | ... | ... | ... |
| 40210 | 40210 | ENSG00000255669 | RP11-885B4.2 | 1023 |
| 40219 | 40219 | ENSG00000255618 | LINC02440 | 741 |
| 40352 | 40352 | ENSG00000261627 | RP11-91I8.1 | 508 |
| 40367 | 40367 | ENSG00000233376 | RP5-881L22.6 | 403 |
| 40376 | 40376 | ENSG00000267465 | AC011525.4 | 3363 |
27211 rows × 4 columns