Census query & extract subsets¶

Goal: demonstrate the ability to query subsets of the Census based upon user-defined obs/var metadata, and extract those slices into in-memory data structures for further analysis.

NOTE: all examples in this notebook assume that sufficient memory exists on the host machine to store query results. There are other notebooks which provide examples for out-of-core processing.

[1]:

import cellxgene_census

census = cellxgene_census.open_soma(census_version="latest")

The Census includes SOMA Experiments for both human and mouse. These experiments can be queried based upon metadata values (eg, tissue type), and the query result can be extracted into a variety of formats.

⚠️ NOTE: The following is experimental query code. It is is built upon SOMA, but not (yet) part of SOMA. If it becomes sufficiently useful, we plan to propose it as a SOMA extension.

Basic idea:

define per-axis (i.e., obs, var) query criteria
specify the experiment and measurement name to be queried
specify the column names you want as part of the results
and read the query result into an in-memory format.

This utilizes the SOMA value_filter query language. Keep in mind that the results must fit into memory, so it is best to define a selective query and only fetch those axis metadata columns which are necessary.

The cellxgene_census package includes a convenience function to extract a slice of the Census and read into an AnnData, for use with ScanPy. This function accepts a variety of arguments, including: * the organism to slice * the per-axis slice criteria * the columns to fetch and include in the AnnData

For more complex query scenarios, there is an advanced query API demonstrated in other notebooks.

[2]:

# Define a simple obs-axis query for all cells where tissue is UBERON:0001264 and sex is PATO:0000383.
adata = cellxgene_census.get_anndata(
    census,
    "Homo sapiens",
    obs_value_filter="tissue_ontology_term_id=='UBERON:0002048' and sex_ontology_term_id=='PATO:0000383' and cell_type_ontology_term_id in ['CL:0002063', 'CL:0000499']",
)

display(adata)

AnnData object with n_obs × n_vars = 129240 × 60664
    obs: 'soma_joinid', 'dataset_id', 'assay', 'assay_ontology_term_id', 'cell_type', 'cell_type_ontology_term_id', 'development_stage', 'development_stage_ontology_term_id', 'disease', 'disease_ontology_term_id', 'donor_id', 'is_primary_data', 'self_reported_ethnicity', 'self_reported_ethnicity_ontology_term_id', 'sex', 'sex_ontology_term_id', 'suspension_type', 'tissue', 'tissue_ontology_term_id', 'tissue_general', 'tissue_general_ontology_term_id'
    var: 'soma_joinid', 'feature_id', 'feature_name', 'feature_length'

[3]:

# You can also query on both axis. This example adds a var-axis query for a handful of genes, and queries the mouse experiment.
adata = cellxgene_census.get_anndata(
    census,
    "Mus musculus",
    obs_value_filter="tissue == 'brain'",
    var_value_filter="feature_name in ['Gm16259', 'Dcaf5', 'Gm53058']",
    column_names={"obs": ["tissue", "cell_type", "sex"]},
)

display(adata)

AnnData object with n_obs × n_vars = 133674 × 3
    obs: 'tissue', 'cell_type', 'sex'
    var: 'soma_joinid', 'feature_id', 'feature_name', 'feature_length'

Close the census

[4]:

census.close()