Census Datasets example¶
Goal: demonstrate basic use of the census_datasets dataframe.
Each Census contains a top-level dataframe itemizing the datasets contained therein. You can read this into a Pandas DataFrame:
[1]:
import cellxgene_census
import tiledbsoma as soma
census = cellxgene_census.open_soma()
census_datasets = census["census_info"]["datasets"].read().concat().to_pandas()
# for convenience, indexing on the soma_joinid which links this to other census data.
census_datasets = census_datasets.set_index("soma_joinid")
census_datasets
[1]:
| collection_id | collection_name | collection_doi | dataset_id | dataset_title | dataset_h5ad_path | dataset_total_cell_count | |
|---|---|---|---|---|---|---|---|
| soma_joinid | |||||||
| 0 | 43d4bb39-21af-4d05-b973-4c1fed7b916c | Transcriptional Programming of Normal and Infl... | 10.1016/j.celrep.2018.09.006 | f512b8b6-369d-4a85-a695-116e0806857f | Skin | f512b8b6-369d-4a85-a695-116e0806857f.h5ad | 68036 |
| 1 | d36ca85c-3e8b-444c-ba3e-a645040c6185 | A molecular atlas of the human postmenopausal ... | 10.1101/2022.08.04.502826 | 90d4a63b-5c02-43eb-acde-c49345681601 | Fallopian tube RNA | 90d4a63b-5c02-43eb-acde-c49345681601.h5ad | 60574 |
| 2 | d36ca85c-3e8b-444c-ba3e-a645040c6185 | A molecular atlas of the human postmenopausal ... | 10.1101/2022.08.04.502826 | d1207c81-7309-43a7-a5a0-f4283670b62b | Ovary RNA | d1207c81-7309-43a7-a5a0-f4283670b62b.h5ad | 26134 |
| 3 | 2b02dff7-e427-4cdc-96fb-c0f354c099aa | Single-Cell Analysis of Crohn’s Disease Lesion... | 10.1016/j.cell.2019.08.008 | 36c867a7-be10-4e69-9b39-5de12b0af6da | Ileum | 36c867a7-be10-4e69-9b39-5de12b0af6da.h5ad | 32458 |
| 4 | e9eec7f5-8519-42f6-99b4-6dbd9cc5ef03 | Humoral immunity at the brain borders in homeo... | 10.1016/j.coi.2022.102188 | 58b01044-c5e5-4b0f-8a2d-6ebf951e01ff | A scRNA-seq atlas of immune cells at the CNS b... | 58b01044-c5e5-4b0f-8a2d-6ebf951e01ff.h5ad | 130908 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 517 | a18474f4-ff1e-4864-af69-270b956cee5b | Single-cell RNA sequencing unifies development... | 10.1158/2159-8290.cd-22-0824 | 63bb6359-3945-4658-92eb-3072419953e4 | UMAP of T-Cells cells | 63bb6359-3945-4658-92eb-3072419953e4.h5ad | 14970 |
| 518 | a18474f4-ff1e-4864-af69-270b956cee5b | Single-cell RNA sequencing unifies development... | 10.1158/2159-8290.cd-22-0824 | 94423ec1-21f8-40e8-b5c9-c3ea82350ca4 | UMAP of Myeloid cells | 94423ec1-21f8-40e8-b5c9-c3ea82350ca4.h5ad | 3282 |
| 519 | a18474f4-ff1e-4864-af69-270b956cee5b | Single-cell RNA sequencing unifies development... | 10.1158/2159-8290.cd-22-0824 | 773b9b2e-70c8-40be-8cbb-e7b5abab360d | UMAP of Columnar cells | 773b9b2e-70c8-40be-8cbb-e7b5abab360d.h5ad | 79522 |
| 520 | a18474f4-ff1e-4864-af69-270b956cee5b | Single-cell RNA sequencing unifies development... | 10.1158/2159-8290.cd-22-0824 | e5233a94-9e43-418c-8209-6f1400c31530 | UMAP of all data | e5233a94-9e43-418c-8209-6f1400c31530.h5ad | 146583 |
| 521 | 10bf5c50-8d85-4c5f-94b4-22c1363d9f31 | Intratumoral heterogeneity in recurrent pediat... | be884a28-0eac-4eca-9c23-1294a93cf56e | Intratumoral heterogeneity in recurrent pediat... | be884a28-0eac-4eca-9c23-1294a93cf56e.h5ad | 35637 |
522 rows × 7 columns
The sum cells across all datasets should match the number of cells across all SOMA experiments (human, mouse).
[2]:
# Count cells across all experiments
all_experiments = (
(organism_name, organism_experiment) for organism_name, organism_experiment in census["census_data"].items()
)
experiments_total_cells = 0
print("Count by experiment:")
for organism_name, organism_experiment in all_experiments:
num_cells = len(organism_experiment.obs.read(column_names=["soma_joinid"]).concat().to_pandas())
print(f"\t{num_cells} cells in {organism_name}")
experiments_total_cells += num_cells
print(f"\nFound {experiments_total_cells} cells in all experiments.")
# Count cells across all datasets
print(f"Found {census_datasets.dataset_total_cell_count.sum()} cells in all datasets.")
Count by experiment:
45501425 cells in homo_sapiens
3922090 cells in mus_musculus
Found 49423515 cells in all experiments.
Found 49423515 cells in all datasets.
Lets pick one dataset to slice out of the census, and turn into an AnnData in-memory object. This can be used with the ScanPy toolchain. You can also save this AnnData locally using the AnnData `write <https://anndata.readthedocs.io/en/latest/api.html#writing>`__ API.
[3]:
census_datasets[census_datasets.dataset_id == "0bd1a1de-3aee-40e0-b2ec-86c7a30c7149"]
[3]:
| collection_id | collection_name | collection_doi | dataset_id | dataset_title | dataset_h5ad_path | dataset_total_cell_count | |
|---|---|---|---|---|---|---|---|
| soma_joinid | |||||||
| 338 | 0b9d8a04-bb9d-44da-aa27-705bb65b54eb | Tabula Muris Senis | 10.1038/s41586-020-2496-1 | 0bd1a1de-3aee-40e0-b2ec-86c7a30c7149 | Bone marrow - A single-cell transcriptomic atl... | 0bd1a1de-3aee-40e0-b2ec-86c7a30c7149.h5ad | 40220 |
Create a query on the mouse experiment, “RNA” measurement, for the dataset_id.
[4]:
mouse = census["census_data"]["mus_musculus"]
with mouse.axis_query(
"RNA",
obs_query=soma.AxisQuery(value_filter="dataset_id == '0bd1a1de-3aee-40e0-b2ec-86c7a30c7149'"),
) as query:
adata = query.to_anndata("raw")
adata
[4]:
AnnData object with n_obs × n_vars = 40220 × 52392
obs: 'soma_joinid', 'dataset_id', 'assay', 'assay_ontology_term_id', 'cell_type', 'cell_type_ontology_term_id', 'development_stage', 'development_stage_ontology_term_id', 'disease', 'disease_ontology_term_id', 'donor_id', 'is_primary_data', 'self_reported_ethnicity', 'self_reported_ethnicity_ontology_term_id', 'sex', 'sex_ontology_term_id', 'suspension_type', 'tissue', 'tissue_ontology_term_id', 'tissue_general', 'tissue_general_ontology_term_id'
var: 'soma_joinid', 'feature_id', 'feature_name', 'feature_length'
You can also use the cellxgene_census.get_h5ad_uri() API to fetch a URI pointing to the H5AD associated with this dataset_id. This is the same H5AD you can download from the CELLxGENE Portal, and may contain additional data-submitter provided information which was not included in the Census.
The “locator” returned by this API will include a uri and additional information that may be necessary to use the URI (eg, the S3 region).
You will need to use a download API to fetch this H5AD, such as `fsspec <https://filesystem-spec.readthedocs.io/en/latest/>`__.
[5]:
uri = cellxgene_census.get_source_h5ad_uri("0bd1a1de-3aee-40e0-b2ec-86c7a30c7149")
uri
[5]:
{'uri': 's3://cellxgene-data-public/cell-census/2023-04-04/h5ads/0bd1a1de-3aee-40e0-b2ec-86c7a30c7149.h5ad',
's3_region': 'us-west-2'}
Close the census
[6]:
census.close()