Census query & extract subsets

Goal: demonstrate the ability to query subsets of the Census based upon user-defined obs/var metadata, and extract those slices into in-memory data structures for further analysis.

NOTE: all examples in this notebook assume that sufficient memory exists on the host machine to store query results. There are other notebooks which provide examples for out-of-core processing.

[1]:
import cellxgene_census

census = cellxgene_census.open_soma(census_version="latest")

The Census includes SOMA Experiments for both human and mouse. These experiments can be queried based upon metadata values (eg, tissue type), and the query result can be extracted into a variety of formats.

⚠️ NOTE: The following is experimental query code. It is is built upon SOMA, but not (yet) part of SOMA. If it becomes sufficiently useful, we plan to propose it as a SOMA extension.

Basic idea:

  • define per-axis (i.e., obs, var) query criteria

  • specify the experiment and measurement name to be queried

  • specify the column names you want as part of the results

  • and read the query result into an in-memory format.

This utilizes the SOMA value_filter query language. Keep in mind that the results must fit into memory, so it is best to define a selective query and only fetch those axis metadata columns which are necessary.

The cellxgene_census package includes a convenience function to extract a slice of the Census and read into an AnnData, for use with ScanPy. This function accepts a variety of arguments, including: * the organism to slice * the per-axis slice criteria * the columns to fetch and include in the AnnData

For more complex query scenarios, there is an advanced query API demonstrated in other notebooks.

[2]:
# Define a simple obs-axis query for all cells where tissue is UBERON:0001264 and sex is PATO:0000383.
adata = cellxgene_census.get_anndata(
    census,
    "Homo sapiens",
    obs_value_filter="tissue_ontology_term_id=='UBERON:0002048' and sex_ontology_term_id=='PATO:0000383' and cell_type_ontology_term_id in ['CL:0002063', 'CL:0000499']",
)

display(adata)
AnnData object with n_obs × n_vars = 129240 × 60664
    obs: 'soma_joinid', 'dataset_id', 'assay', 'assay_ontology_term_id', 'cell_type', 'cell_type_ontology_term_id', 'development_stage', 'development_stage_ontology_term_id', 'disease', 'disease_ontology_term_id', 'donor_id', 'is_primary_data', 'self_reported_ethnicity', 'self_reported_ethnicity_ontology_term_id', 'sex', 'sex_ontology_term_id', 'suspension_type', 'tissue', 'tissue_ontology_term_id', 'tissue_general', 'tissue_general_ontology_term_id'
    var: 'soma_joinid', 'feature_id', 'feature_name', 'feature_length'
[3]:
# You can also query on both axis. This example adds a var-axis query for a handful of genes, and queries the mouse experiment.
adata = cellxgene_census.get_anndata(
    census,
    "Mus musculus",
    obs_value_filter="tissue == 'brain'",
    var_value_filter="feature_name in ['Gm16259', 'Dcaf5', 'Gm53058']",
    column_names={"obs": ["tissue", "cell_type", "sex"]},
)

display(adata)
AnnData object with n_obs × n_vars = 133674 × 3
    obs: 'tissue', 'cell_type', 'sex'
    var: 'soma_joinid', 'feature_id', 'feature_name', 'feature_length'

Close the census

[4]:
census.close()