Exploring the Census Datasets table¶

This tutorial demonstrates basic use of the census_datasets dataframe that contains metadata of the Census source datasets. This metadata can be joined to the cell metadata dataframe (obs) via the column dataset_id,

Contents

Fetching the datasets table.
Fetching the expression data from a single dataset.
Downloading the original source H5AD file of a dataset.

Fetching the datasets table¶

Each Census contains a top-level dataframe itemizing the datasets contained therein. You can read this into a pandas.DataFrame.

[1]:

import cellxgene_census

census = cellxgene_census.open_soma()
census_datasets = census["census_info"]["datasets"].read().concat().to_pandas()

# for convenience, indexing on the soma_joinid which links this to other census data.
census_datasets = census_datasets.set_index("soma_joinid")

census_datasets

[1]:

	collection_id	collection_name	collection_doi	dataset_id	dataset_title	dataset_h5ad_path	dataset_total_cell_count
soma_joinid
0	43d4bb39-21af-4d05-b973-4c1fed7b916c	Transcriptional Programming of Normal and Infl...	10.1016/j.celrep.2018.09.006	f512b8b6-369d-4a85-a695-116e0806857f	Skin	f512b8b6-369d-4a85-a695-116e0806857f.h5ad	68036
1	2b02dff7-e427-4cdc-96fb-c0f354c099aa	Single-Cell Analysis of Crohn’s Disease Lesion...	10.1016/j.cell.2019.08.008	36c867a7-be10-4e69-9b39-5de12b0af6da	Ileum	36c867a7-be10-4e69-9b39-5de12b0af6da.h5ad	32458
2	e9eec7f5-8519-42f6-99b4-6dbd9cc5ef03	Humoral immunity at the brain borders in homeo...	10.1016/j.coi.2022.102188	58b01044-c5e5-4b0f-8a2d-6ebf951e01ff	A scRNA-seq atlas of immune cells at the CNS b...	58b01044-c5e5-4b0f-8a2d-6ebf951e01ff.h5ad	130908
3	a72afd53-ab92-4511-88da-252fb0e26b9a	Single-cell atlas of peripheral immune respons...	10.1038/s41591-020-0944-y	456e8b9b-f872-488b-871d-94534090a865	Single-cell atlas of peripheral immune respons...	456e8b9b-f872-488b-871d-94534090a865.h5ad	44721
4	e4c9ed14-e560-4900-a3bf-b0f8d2ce6a10	A molecular single-cell lung atlas of lethal C...	10.1038/s41586-021-03569-1	d8da613f-e681-4c69-b463-e94f5e66847f	A molecular single-cell lung atlas of lethal C...	d8da613f-e681-4c69-b463-e94f5e66847f.h5ad	116313
...	...	...	...	...	...	...	...
524	e3f391f6-5a75-4e96-8450-da47c3d2a939	COVID-19 mRNA vaccine elicits a potent adaptiv...		30498543-4fdd-4f86-9e1b-05c1a1454a6a	B cells -- CV19 infection, vaccination and HC	30498543-4fdd-4f86-9e1b-05c1a1454a6a.h5ad	20727
525	e3f391f6-5a75-4e96-8450-da47c3d2a939	COVID-19 mRNA vaccine elicits a potent adaptiv...		b5191f01-f67d-44b8-bc8d-511a4ecd07bb	innate T cells -- CV19 infection, vaccination ...	b5191f01-f67d-44b8-bc8d-511a4ecd07bb.h5ad	33415
526	e3f391f6-5a75-4e96-8450-da47c3d2a939	COVID-19 mRNA vaccine elicits a potent adaptiv...		e463dae9-3fc1-476d-870e-d98a04c56cd6	M cells -- CV19 infection, vaccination and HC	e463dae9-3fc1-476d-870e-d98a04c56cd6.h5ad	41130
527	e3f391f6-5a75-4e96-8450-da47c3d2a939	COVID-19 mRNA vaccine elicits a potent adaptiv...		1b699e04-1127-42ea-998b-011ace4a5b81	T cells -- CV19 infection, vaccination and HC	1b699e04-1127-42ea-998b-011ace4a5b81.h5ad	98068
528	e3f391f6-5a75-4e96-8450-da47c3d2a939	COVID-19 mRNA vaccine elicits a potent adaptiv...		87ce26ed-e5d1-44b4-81cc-cc5b709a169f	Circulating Immune cells -- CV19 infection, va...	87ce26ed-e5d1-44b4-81cc-cc5b709a169f.h5ad	195632

529 rows × 7 columns

The sum cells across all datasets should match the number of cells across all SOMA experiments (human, mouse).

[2]:

# Count cells across all experiments
all_experiments = (
    (organism_name, organism_experiment) for organism_name, organism_experiment in census["census_data"].items()
)
experiments_total_cells = 0
print("Count by experiment:")
for organism_name, organism_experiment in all_experiments:
    num_cells = len(organism_experiment.obs.read(column_names=["soma_joinid"]).concat().to_pandas())
    print(f"\t{num_cells} cells in {organism_name}")
    experiments_total_cells += num_cells

print(f"\nFound {experiments_total_cells} cells in all experiments.")

# Count cells across all datasets
print(f"Found {census_datasets.dataset_total_cell_count.sum()} cells in all datasets.")

Count by experiment:
        46050829 cells in homo_sapiens
        3922090 cells in mus_musculus

Found 49972919 cells in all experiments.
Found 49972919 cells in all datasets.

Fetching the expression data from a single dataset¶

Lets pick one dataset to slice out of the census, and turn into an AnnData in-memory object. This can be used with the ScanPy toolchain. You can also save this AnnData locally using the AnnData `write <https://anndata.readthedocs.io/en/latest/api.html#writing>`__ API.

[3]:

census_datasets[census_datasets.dataset_id == "0bd1a1de-3aee-40e0-b2ec-86c7a30c7149"]

[3]:

	collection_id	collection_name	collection_doi	dataset_id	dataset_title	dataset_h5ad_path	dataset_total_cell_count
soma_joinid
197	0b9d8a04-bb9d-44da-aa27-705bb65b54eb	Tabula Muris Senis	10.1038/s41586-020-2496-1	0bd1a1de-3aee-40e0-b2ec-86c7a30c7149	Bone marrow - A single-cell transcriptomic atl...	0bd1a1de-3aee-40e0-b2ec-86c7a30c7149.h5ad	40220

Create a query on the mouse experiment, “RNA” measurement, for the dataset_id.

[4]:

adata = cellxgene_census.get_anndata(
    census, organism="Mus musculus", obs_value_filter="dataset_id == '0bd1a1de-3aee-40e0-b2ec-86c7a30c7149'"
)

adata

[4]:

AnnData object with n_obs × n_vars = 40220 × 52392
    obs: 'soma_joinid', 'dataset_id', 'assay', 'assay_ontology_term_id', 'cell_type', 'cell_type_ontology_term_id', 'development_stage', 'development_stage_ontology_term_id', 'disease', 'disease_ontology_term_id', 'donor_id', 'is_primary_data', 'self_reported_ethnicity', 'self_reported_ethnicity_ontology_term_id', 'sex', 'sex_ontology_term_id', 'suspension_type', 'tissue', 'tissue_ontology_term_id', 'tissue_general', 'tissue_general_ontology_term_id'
    var: 'soma_joinid', 'feature_id', 'feature_name', 'feature_length'

Downloading the original source H5AD file of a dataset.¶

You can download the original H5AD file for any given dataset. This is the same H5AD you can download from the CZ CELLxGENE Discover, and may contain additional data-submitter provided information which was not included in the Census.

To do this you can fetch the location in the cloud or directly download to your system using the cellxgene-census

[5]:

# Option 1: Direct download
cellxgene_census.download_source_h5ad(
    "0bd1a1de-3aee-40e0-b2ec-86c7a30c7149", to_path="Tabula_Muris_Senis-bone_marrow.h5ad"
)

[6]:

# Option 2: Get location and download via preferred method
uri = cellxgene_census.get_source_h5ad_uri("0bd1a1de-3aee-40e0-b2ec-86c7a30c7149")
uri

# you can now download the H5AD in shell via AWS CLI e.g. `aws s3 cp uri ./`

[6]:

{'uri': 's3://cellxgene-data-public/cell-census/2023-04-10/h5ads/0bd1a1de-3aee-40e0-b2ec-86c7a30c7149.h5ad',
 's3_region': 'us-west-2'}

Close the census

[7]:

census.close()