Census datasets presence¶
Goal: demonstrate basic use of the datasets_presence_matrix array.
The presence matrix is a sparse array, indicating which features (var) were present in each dataset. The array has dimensions [n_datasets, n_var], and is stored in the SOMA Measurement varp collection. The first dimension is indexed by the soma_joinid in the census_datasets dataframe. The second is indexed by the soma_joinid in the var dataframe of the measurement.
[1]:
import numpy as np
from scipy import sparse
import cellxgene_census
census = cellxgene_census.open_soma()
# Grab the experiment containing human data, and the measurement therein with RNA
human = census["census_data"]["homo_sapiens"]
human_rna = human.ms["RNA"]
# The census-wide datasets
datasets_df = census["census_info"]["datasets"].read().concat().to_pandas()
datasets_df
[1]:
| soma_joinid | collection_id | collection_name | collection_doi | dataset_id | dataset_title | dataset_h5ad_path | dataset_total_cell_count | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 43d4bb39-21af-4d05-b973-4c1fed7b916c | Transcriptional Programming of Normal and Infl... | 10.1016/j.celrep.2018.09.006 | f512b8b6-369d-4a85-a695-116e0806857f | Skin | f512b8b6-369d-4a85-a695-116e0806857f.h5ad | 68036 |
| 1 | 1 | d36ca85c-3e8b-444c-ba3e-a645040c6185 | A molecular atlas of the human postmenopausal ... | 10.1101/2022.08.04.502826 | 90d4a63b-5c02-43eb-acde-c49345681601 | Fallopian tube RNA | 90d4a63b-5c02-43eb-acde-c49345681601.h5ad | 60574 |
| 2 | 2 | d36ca85c-3e8b-444c-ba3e-a645040c6185 | A molecular atlas of the human postmenopausal ... | 10.1101/2022.08.04.502826 | d1207c81-7309-43a7-a5a0-f4283670b62b | Ovary RNA | d1207c81-7309-43a7-a5a0-f4283670b62b.h5ad | 26134 |
| 3 | 3 | 2b02dff7-e427-4cdc-96fb-c0f354c099aa | Single-Cell Analysis of Crohn’s Disease Lesion... | 10.1016/j.cell.2019.08.008 | 36c867a7-be10-4e69-9b39-5de12b0af6da | Ileum | 36c867a7-be10-4e69-9b39-5de12b0af6da.h5ad | 32458 |
| 4 | 4 | e9eec7f5-8519-42f6-99b4-6dbd9cc5ef03 | Humoral immunity at the brain borders in homeo... | 10.1016/j.coi.2022.102188 | 58b01044-c5e5-4b0f-8a2d-6ebf951e01ff | A scRNA-seq atlas of immune cells at the CNS b... | 58b01044-c5e5-4b0f-8a2d-6ebf951e01ff.h5ad | 130908 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 517 | 517 | a18474f4-ff1e-4864-af69-270b956cee5b | Single-cell RNA sequencing unifies development... | 10.1158/2159-8290.cd-22-0824 | 63bb6359-3945-4658-92eb-3072419953e4 | UMAP of T-Cells cells | 63bb6359-3945-4658-92eb-3072419953e4.h5ad | 14970 |
| 518 | 518 | a18474f4-ff1e-4864-af69-270b956cee5b | Single-cell RNA sequencing unifies development... | 10.1158/2159-8290.cd-22-0824 | 94423ec1-21f8-40e8-b5c9-c3ea82350ca4 | UMAP of Myeloid cells | 94423ec1-21f8-40e8-b5c9-c3ea82350ca4.h5ad | 3282 |
| 519 | 519 | a18474f4-ff1e-4864-af69-270b956cee5b | Single-cell RNA sequencing unifies development... | 10.1158/2159-8290.cd-22-0824 | 773b9b2e-70c8-40be-8cbb-e7b5abab360d | UMAP of Columnar cells | 773b9b2e-70c8-40be-8cbb-e7b5abab360d.h5ad | 79522 |
| 520 | 520 | a18474f4-ff1e-4864-af69-270b956cee5b | Single-cell RNA sequencing unifies development... | 10.1158/2159-8290.cd-22-0824 | e5233a94-9e43-418c-8209-6f1400c31530 | UMAP of all data | e5233a94-9e43-418c-8209-6f1400c31530.h5ad | 146583 |
| 521 | 521 | 10bf5c50-8d85-4c5f-94b4-22c1363d9f31 | Intratumoral heterogeneity in recurrent pediat... | be884a28-0eac-4eca-9c23-1294a93cf56e | Intratumoral heterogeneity in recurrent pediat... | be884a28-0eac-4eca-9c23-1294a93cf56e.h5ad | 35637 |
522 rows × 8 columns
For convenience, read the entire presence matrix (for Homo sapiens) into a SciPy array. There is a convenience API providing this capability, returning the matrix in a scipy.sparse.array:
[2]:
presence_matrix = cellxgene_census.get_presence_matrix(census, organism="Homo sapiens", measurement_name="RNA")
presence_matrix
[2]:
<522x60664 sparse matrix of type '<class 'numpy.uint8'>'
with 13844504 stored elements in Compressed Sparse Row format>
We also need the var dataframe, which is read into a Pandas DataFrame for convenient manipulation:
[3]:
var_df = human_rna.var.read().concat().to_pandas()
var_df
[3]:
| soma_joinid | feature_id | feature_name | feature_length | |
|---|---|---|---|---|
| 0 | 0 | ENSG00000238009 | RP11-34P13.7 | 3726 |
| 1 | 1 | ENSG00000279457 | WASH9P | 1397 |
| 2 | 2 | ENSG00000228463 | AP006222.1 | 8224 |
| 3 | 3 | ENSG00000237094 | RP4-669L17.4 | 6204 |
| 4 | 4 | ENSG00000230021 | RP11-206L10.17 | 5495 |
| ... | ... | ... | ... | ... |
| 60659 | 60659 | ENSG00000288719 | RP4-669P10.21 | 4252 |
| 60660 | 60660 | ENSG00000288720 | RP11-852E15.3 | 7007 |
| 60661 | 60661 | ENSG00000288721 | RP5-973N23.5 | 7765 |
| 60662 | 60662 | ENSG00000288723 | RP11-553N16.6 | 1015 |
| 60663 | 60663 | ENSG00000288724 | RP13-546I2.2 | 625 |
60664 rows × 4 columns
Is a feature present in a dataset?¶
Goal: test if a given feature is present in a given dataset.
Important: the presence matrix is indexed by soma_joinid, and is NOT positionally indexed. In other words: * the first dimension of the presence matrix is the dataset’s soma_joinid, as stored in the census_datasets dataframe. * the second dimension of the presence matrix is the feature’s soma_joinid, as stored in the var dataframe.
[4]:
var_joinid = var_df.loc[var_df.feature_id == "ENSG00000286096"].soma_joinid
dataset_joinid = datasets_df.loc[datasets_df.dataset_id == "97a17473-e2b1-4f31-a544-44a60773e2dd"].soma_joinid
is_present = presence_matrix[dataset_joinid, var_joinid][0, 0]
print(f'Feature is {"present" if is_present else "not present"}.')
Feature is present.
What datasets contain a feature?¶
Goal: look up all datasets that have a feature_id present.
[5]:
# Grab the feature's soma_joinid from the var dataframe
var_joinid = var_df.loc[var_df.feature_id == "ENSG00000286096"].soma_joinid
# The presence matrix is indexed by the joinids of the dataset and var dataframes,
# so slice out the feature of interest by its joinid.
dataset_joinids = presence_matrix[:, var_joinid].tocoo().row
# From the datasets dataframe, slice out the datasets which have a joinid in the list
datasets_df.loc[datasets_df.soma_joinid.isin(dataset_joinids)]
[5]:
| soma_joinid | collection_id | collection_name | collection_doi | dataset_id | dataset_title | dataset_h5ad_path | dataset_total_cell_count | |
|---|---|---|---|---|---|---|---|---|
| 89 | 89 | 283d65eb-dd53-496d-adb7-7570c7caa443 | Transcriptomic diversity of cell types across ... | 10.1101/2022.10.12.511898 | 07b1d7c8-5c2e-42f7-9246-26f746cd6013 | Dissection: Myelencephalon (medulla oblongata)... | 07b1d7c8-5c2e-42f7-9246-26f746cd6013.h5ad | 27210 |
| 102 | 102 | 283d65eb-dd53-496d-adb7-7570c7caa443 | Transcriptomic diversity of cell types across ... | 10.1101/2022.10.12.511898 | 7c1c3d47-3166-43e5-9a95-65ceb2d45f78 | Dissection: Pons (Pn) - Pontine reticular form... | 7c1c3d47-3166-43e5-9a95-65ceb2d45f78.h5ad | 49512 |
| 103 | 103 | 283d65eb-dd53-496d-adb7-7570c7caa443 | Transcriptomic diversity of cell types across ... | 10.1101/2022.10.12.511898 | 9372df2d-13d6-4fac-980b-919a5b7eb483 | Dissection: Midbrain (M) - Periaqueductal gray... | 9372df2d-13d6-4fac-980b-919a5b7eb483.h5ad | 33794 |
| 131 | 131 | 283d65eb-dd53-496d-adb7-7570c7caa443 | Transcriptomic diversity of cell types across ... | 10.1101/2022.10.12.511898 | dd03ce70-3243-4c96-9561-330cc461e4d7 | Dissection: Cerebral cortex (Cx) - Perirhinal ... | dd03ce70-3243-4c96-9561-330cc461e4d7.h5ad | 23732 |
| 145 | 145 | 283d65eb-dd53-496d-adb7-7570c7caa443 | Transcriptomic diversity of cell types across ... | 10.1101/2022.10.12.511898 | 7a0a8891-9a22-4549-a55b-c2aca23c3a2a | Supercluster: Hippocampal CA1-3 | 7a0a8891-9a22-4549-a55b-c2aca23c3a2a.h5ad | 74979 |
| 147 | 147 | 283d65eb-dd53-496d-adb7-7570c7caa443 | Transcriptomic diversity of cell types across ... | 10.1101/2022.10.12.511898 | d2b5efc1-14c6-4b5f-bd98-40f9084872d7 | Dissection: Tail of Hippocampus (HiT) - Caudal... | d2b5efc1-14c6-4b5f-bd98-40f9084872d7.h5ad | 36886 |
| 151 | 151 | 283d65eb-dd53-496d-adb7-7570c7caa443 | Transcriptomic diversity of cell types across ... | 10.1101/2022.10.12.511898 | f8dda921-5fb4-4c94-a654-c6fc346bfd6d | Dissection: Cerebral cortex (Cx) - Occipitotem... | f8dda921-5fb4-4c94-a654-c6fc346bfd6d.h5ad | 31899 |
| 154 | 154 | 283d65eb-dd53-496d-adb7-7570c7caa443 | Transcriptomic diversity of cell types across ... | 10.1101/2022.10.12.511898 | 3a7f3ab4-a280-4b3b-b2c0-6dd05614a78c | Supercluster: Splatter | 3a7f3ab4-a280-4b3b-b2c0-6dd05614a78c.h5ad | 291833 |
| 156 | 156 | 283d65eb-dd53-496d-adb7-7570c7caa443 | Transcriptomic diversity of cell types across ... | 10.1101/2022.10.12.511898 | bdb26abd-f4ba-4ea3-8862-c2340e7a4f55 | Supercluster: CGE interneuron | bdb26abd-f4ba-4ea3-8862-c2340e7a4f55.h5ad | 227671 |
| 158 | 158 | 283d65eb-dd53-496d-adb7-7570c7caa443 | Transcriptomic diversity of cell types across ... | 10.1101/2022.10.12.511898 | 5e5ab909-f73f-4b57-98a0-6d2c5662f6a4 | Dissection: Midbrain (M) - Inferior colliculus... | 5e5ab909-f73f-4b57-98a0-6d2c5662f6a4.h5ad | 32306 |
| 166 | 166 | 283d65eb-dd53-496d-adb7-7570c7caa443 | Transcriptomic diversity of cell types across ... | 10.1101/2022.10.12.511898 | 35c8a04c-8639-4d15-8228-765d8d93fc96 | Dissection: Hypothalamus (HTH) - supraoptic re... | 35c8a04c-8639-4d15-8228-765d8d93fc96.h5ad | 16753 |
| 168 | 168 | 283d65eb-dd53-496d-adb7-7570c7caa443 | Transcriptomic diversity of cell types across ... | 10.1101/2022.10.12.511898 | 3f56901c-dd4a-47d6-b60b-7b0c0111cfb2 | Dissection: Head of hippocampus (HiH) - CA1-3 | 3f56901c-dd4a-47d6-b60b-7b0c0111cfb2.h5ad | 37911 |
| 184 | 184 | 283d65eb-dd53-496d-adb7-7570c7caa443 | Transcriptomic diversity of cell types across ... | 10.1101/2022.10.12.511898 | 93131426-0124-4ab4-a013-9dfbcd99d467 | Dissection: Epithalamus - ETH | 93131426-0124-4ab4-a013-9dfbcd99d467.h5ad | 24327 |
| 190 | 190 | 283d65eb-dd53-496d-adb7-7570c7caa443 | Transcriptomic diversity of cell types across ... | 10.1101/2022.10.12.511898 | acae7679-d077-461c-b857-ee6ccfeb267f | Dissection: Head of hippocampus (HiH) - CA1 | acae7679-d077-461c-b857-ee6ccfeb267f.h5ad | 39147 |
| 192 | 192 | 283d65eb-dd53-496d-adb7-7570c7caa443 | Transcriptomic diversity of cell types across ... | 10.1101/2022.10.12.511898 | c202b243-1aa1-4b16-bc9a-b36241f3b1e3 | Supercluster: Amygdala excitatory | c202b243-1aa1-4b16-bc9a-b36241f3b1e3.h5ad | 109452 |
| 194 | 194 | 283d65eb-dd53-496d-adb7-7570c7caa443 | Transcriptomic diversity of cell types across ... | 10.1101/2022.10.12.511898 | 0325478a-9b52-45b5-b40a-2e2ab0d72eb1 | Supercluster: Upper-layer intratelencephalic | 0325478a-9b52-45b5-b40a-2e2ab0d72eb1.h5ad | 455006 |
| 196 | 196 | 283d65eb-dd53-496d-adb7-7570c7caa443 | Transcriptomic diversity of cell types across ... | 10.1101/2022.10.12.511898 | c2aad8fc-b63b-4f9b-9cfd-baf7bc9c1771 | Dissection: Cerebral cortex (Cx) - Temporal po... | c2aad8fc-b63b-4f9b-9cfd-baf7bc9c1771.h5ad | 37642 |
| 197 | 197 | 283d65eb-dd53-496d-adb7-7570c7caa443 | Transcriptomic diversity of cell types across ... | 10.1101/2022.10.12.511898 | c4b03352-af8d-492a-8d6b-40f304e0a122 | Supercluster: Medium spiny neuron | c4b03352-af8d-492a-8d6b-40f304e0a122.h5ad | 152189 |
| 214 | 214 | 283d65eb-dd53-496d-adb7-7570c7caa443 | Transcriptomic diversity of cell types across ... | 10.1101/2022.10.12.511898 | fe1a73ab-a203-45fd-84e9-0f7fd19efcbd | Dissection: Amygdaloid complex (AMY) - basolat... | fe1a73ab-a203-45fd-84e9-0f7fd19efcbd.h5ad | 35285 |
| 487 | 487 | e5f58829-1a66-40b5-a624-9046778e74f5 | Tabula Sapiens | 10.1126/science.abl4896 | a68b64d8-aee3-4947-81b7-36b8fe5a44d2 | Tabula Sapiens - Stromal | a68b64d8-aee3-4947-81b7-36b8fe5a44d2.h5ad | 82478 |
| 488 | 488 | e5f58829-1a66-40b5-a624-9046778e74f5 | Tabula Sapiens | 10.1126/science.abl4896 | 97a17473-e2b1-4f31-a544-44a60773e2dd | Tabula Sapiens - Epithelial | 97a17473-e2b1-4f31-a544-44a60773e2dd.h5ad | 104148 |
| 489 | 489 | e5f58829-1a66-40b5-a624-9046778e74f5 | Tabula Sapiens | 10.1126/science.abl4896 | c5d88abe-f23a-45fa-a534-788985e93dad | Tabula Sapiens - Immune | c5d88abe-f23a-45fa-a534-788985e93dad.h5ad | 264824 |
| 490 | 490 | e5f58829-1a66-40b5-a624-9046778e74f5 | Tabula Sapiens | 10.1126/science.abl4896 | 5a11f879-d1ef-458a-910c-9b0bdfca5ebf | Tabula Sapiens - Endothelial | 5a11f879-d1ef-458a-910c-9b0bdfca5ebf.h5ad | 31691 |
| 491 | 491 | e5f58829-1a66-40b5-a624-9046778e74f5 | Tabula Sapiens | 10.1126/science.abl4896 | 53d208b0-2cfd-4366-9866-c3c6114081bc | Tabula Sapiens - All Cells | 53d208b0-2cfd-4366-9866-c3c6114081bc.h5ad | 483152 |
What features are in a dataset?¶
Goal: lookup the features present in a given dataset.
This example also demonstrates the ability to do the query on multiple datasets.
[6]:
# Slice the dataset(s) of interest, and get the joinid(s)
dataset_joinids = datasets_df.loc[datasets_df.collection_id == "17481d16-ee44-49e5-bcf0-28c0780d8c4a"].soma_joinid
# Slice the presence matrix by the first dimension, i.e., by dataset
var_joinids = presence_matrix[dataset_joinids, :].tocoo().col
# From the feature (var) dataframe, slice out features which have a joinid in the list.
var_df.loc[var_df.soma_joinid.isin(var_joinids)]
[6]:
| soma_joinid | feature_id | feature_name | feature_length | |
|---|---|---|---|---|
| 0 | 0 | ENSG00000238009 | RP11-34P13.7 | 3726 |
| 1 | 1 | ENSG00000279457 | WASH9P | 1397 |
| 2 | 2 | ENSG00000228463 | AP006222.1 | 8224 |
| 3 | 3 | ENSG00000237094 | RP4-669L17.4 | 6204 |
| 4 | 4 | ENSG00000230021 | RP11-206L10.17 | 5495 |
| ... | ... | ... | ... | ... |
| 56689 | 56689 | ENSG00000283063 | TRBV6-2 | 424 |
| 56698 | 56698 | ENSG00000283095 | ABC11-4932300O16.1 | 1535 |
| 56705 | 56705 | ENSG00000283117 | MGC4859 | 3118 |
| 56706 | 56706 | ENSG00000283118 | RP11-107E5.4 | 644 |
| 56709 | 56709 | ENSG00000283125 | RP11-299P2.2 | 547 |
27211 rows × 4 columns