Querying and fetching the single-cell data and cell/gene metadata.¶

This tutorial showcases the easiest ways to query the expression data and cell/gene metadata from the Census, and load them into common in-memory Python objects, including pandas.DataFrame and anndata.AnnData.

Contents

Opening the census.
Querying cell metadata (obs).
Querying gene metadata (var).
Querying expression data.

Opening the census¶

The cellxgene_census python package contains a convenient API to open the latest version of the Census.

[1]:

import cellxgene_census

census = cellxgene_census.open_soma()

You can learn more about the cellxgene_census methods by accessing their corresponding documentation via help(). For example help(cellxgene_census.open_soma).

Querying cell metadata (obs)¶

The human gene metadata of the Census, for RNA assays, is located at census["census_data"]["homo_sapiens"].obs. This is a SOMADataFrame and as such it can be materialized as a pandas.DataFrame via the methods read().concat().to_pandas().

The mouse cell metadata is at census["census_data"]["mus_musculus"].obs.

For slicing the cell metadata there are two relevant arguments that can be passed through read():

column_names — list of strings indicating what metadata columns to fetch.
value_filter — Python expression with selection conditions to fetch rows, it is similar to `pandas.DataFrame.query() <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html>`__, for full details see `tiledb.QueryCondition <https://tiledb-inc-tiledb.readthedocs-hosted.com/projects/tiledb-py/en/stable/python-api.html#query-condition>`__ shortly:
- Expressions are one or more comparisons
- Comparisons are one of <column> <op> <value> or <column> <op> <column>
- Expressions can combine comparisons using and, or, & or |
- op is one of < | > | <= | >= | == | != or in

To learn what metadata columns are available for fetching and filtering we can directly look at the keys of the cell metadata.

[2]:

keys = list(census["census_data"]["homo_sapiens"].obs.keys())

keys

[2]:

['soma_joinid',
 'dataset_id',
 'assay',
 'assay_ontology_term_id',
 'cell_type',
 'cell_type_ontology_term_id',
 'development_stage',
 'development_stage_ontology_term_id',
 'disease',
 'disease_ontology_term_id',
 'donor_id',
 'is_primary_data',
 'self_reported_ethnicity',
 'self_reported_ethnicity_ontology_term_id',
 'sex',
 'sex_ontology_term_id',
 'suspension_type',
 'tissue',
 'tissue_ontology_term_id',
 'tissue_general',
 'tissue_general_ontology_term_id']

soma_joinid is a special SOMADataFrame column that is used for join operations. The definition for all other columns can be found at the Census schema.

All of these can be used to fetch specific columns or specific rows matching a condition. For the latter we need to know the values we are looking for a priori.

For example let’s see what are the possible values available for sex. To this we can load all cell metadata but fetching only for the column sex.

[3]:

sex_cell_metadata = census["census_data"]["homo_sapiens"].obs.read(column_names=["sex"]).concat().to_pandas()

sex_cell_metadata.drop_duplicates()

[3]:

	sex
0	male
24982	female
154744	unknown

As you can see there are only three different values for sex, that is "male", "female" and "unknown".

With this information we can fetch all cell metatadata for a specific sex value, for example "unknown".

[4]:

cell_metadata_all_unknown_sex = (
    census["census_data"]["homo_sapiens"].obs.read(value_filter="sex == 'unknown'").concat().to_pandas()
)

cell_metadata_all_unknown_sex

[4]:

	soma_joinid	dataset_id	assay	assay_ontology_term_id	cell_type	cell_type_ontology_term_id	development_stage	development_stage_ontology_term_id	disease	disease_ontology_term_id	...	is_primary_data	self_reported_ethnicity	self_reported_ethnicity_ontology_term_id	sex	sex_ontology_term_id	suspension_type	tissue	tissue_ontology_term_id	tissue_general	tissue_general_ontology_term_id
0	154744	36c867a7-be10-4e69-9b39-5de12b0af6da	10x 3' v1	EFO:0009901	plasma cell	CL:0000786	human early adulthood stage	HsapDv:0000088	Crohn ileitis	MONDO:0000709	...	True	unknown	unknown	unknown	unknown	cell	ileum	UBERON:0002116	small intestine	UBERON:0002108
1	154745	36c867a7-be10-4e69-9b39-5de12b0af6da	10x 3' v1	EFO:0009901	native cell	CL:0000003	human early adulthood stage	HsapDv:0000088	Crohn ileitis	MONDO:0000709	...	True	unknown	unknown	unknown	unknown	cell	ileum	UBERON:0002116	small intestine	UBERON:0002108
2	154746	36c867a7-be10-4e69-9b39-5de12b0af6da	10x 3' v1	EFO:0009901	alpha-beta T cell	CL:0000789	human early adulthood stage	HsapDv:0000088	Crohn ileitis	MONDO:0000709	...	True	unknown	unknown	unknown	unknown	cell	ileum	UBERON:0002116	small intestine	UBERON:0002108
3	154747	36c867a7-be10-4e69-9b39-5de12b0af6da	10x 3' v1	EFO:0009901	native cell	CL:0000003	human early adulthood stage	HsapDv:0000088	Crohn ileitis	MONDO:0000709	...	True	unknown	unknown	unknown	unknown	cell	ileum	UBERON:0002116	small intestine	UBERON:0002108
4	154748	36c867a7-be10-4e69-9b39-5de12b0af6da	10x 3' v1	EFO:0009901	plasma cell	CL:0000786	human early adulthood stage	HsapDv:0000088	Crohn ileitis	MONDO:0000709	...	True	unknown	unknown	unknown	unknown	cell	ileum	UBERON:0002116	small intestine	UBERON:0002108
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2082502	45465783	e5233a94-9e43-418c-8209-6f1400c31530	10x 3' v2	EFO:0009899	activated CD8-positive, alpha-beta T cell	CL:0000906	unknown	unknown	normal	PATO:0000461	...	False	unknown	unknown	unknown	unknown	cell	body of stomach	UBERON:0001161	stomach	UBERON:0000945
2082503	45465784	e5233a94-9e43-418c-8209-6f1400c31530	10x 3' v2	EFO:0009899	gamma-delta T cell	CL:0000798	unknown	unknown	normal	PATO:0000461	...	False	unknown	unknown	unknown	unknown	cell	body of stomach	UBERON:0001161	stomach	UBERON:0000945
2082504	45465785	e5233a94-9e43-418c-8209-6f1400c31530	10x 3' v2	EFO:0009899	activated CD8-positive, alpha-beta T cell	CL:0000906	unknown	unknown	normal	PATO:0000461	...	False	unknown	unknown	unknown	unknown	cell	body of stomach	UBERON:0001161	stomach	UBERON:0000945
2082505	45465786	e5233a94-9e43-418c-8209-6f1400c31530	10x 3' v2	EFO:0009899	natural T-regulatory cell	CL:0000903	unknown	unknown	normal	PATO:0000461	...	False	unknown	unknown	unknown	unknown	cell	body of stomach	UBERON:0001161	stomach	UBERON:0000945
2082506	45465787	e5233a94-9e43-418c-8209-6f1400c31530	10x 3' v2	EFO:0009899	fibroblast	CL:0000057	unknown	unknown	normal	PATO:0000461	...	False	unknown	unknown	unknown	unknown	cell	body of stomach	UBERON:0001161	stomach	UBERON:0000945

2082507 rows × 21 columns

You can use both column_names and value_filter to perform specific queries. For example let’s fetch the disease columns for the cell_type "B cell" in the tissue_general "lung".

[5]:

cell_metadata_b_cell = (
    census["census_data"]["homo_sapiens"]
    .obs.read(
        value_filter="cell_type == 'B cell' and tissue_general == 'lung'",
        column_names=["disease"],
    )
    .concat()
    .to_pandas()
)

cell_metadata_b_cell.value_counts()

[5]:

disease                                cell_type  tissue_general
lung adenocarcinoma                    B cell     lung              50228
non-small cell lung carcinoma          B cell     lung              17484
normal                                 B cell     lung              16801
squamous cell lung carcinoma           B cell     lung              11584
chronic obstructive pulmonary disease  B cell     lung               7147
interstitial lung disease 2            B cell     lung               5141
interstitial lung disease              B cell     lung               1655
pulmonary emphysema                    B cell     lung               1512
COVID-19                               B cell     lung               1445
small cell lung carcinoma              B cell     lung                583
non-specific interstitial pneumonia    B cell     lung                284
hypersensitivity pneumonitis           B cell     lung                 13
sarcoidosis                            B cell     lung                  6
Name: count, dtype: int64

Querying gene metadata (var)¶

The human gene metadata of the Census is located at census["census_data"]["homo_sapiens"].ms["RNA"].var. Similarly to the cell metadata, it is a SOMADataFrame and thus we can also use its method read().

The mouse gene metadata is at census["census_data"]["mus_musculus"].ms["RNA"].var.

Let’s take a look at the metadata available for column selection and row filtering.

[6]:

keys = list(census["census_data"]["homo_sapiens"].ms["RNA"].var.keys())

keys

[6]:

['soma_joinid', 'feature_id', 'feature_name', 'feature_length']

With the exception of soma_joinid these columns are defined in the Census schema. Similarly to the cell metadata, we can use the same operations to learn and fetch gene metadata.

For example, to get the feature_name and feature_length of the genes "ENSG00000161798" and "ENSG00000188229" we can do the following.

[7]:

gene_metadata = (
    census["census_data"]["homo_sapiens"]
    .ms["RNA"]
    .var.read(
        value_filter="feature_id in ['ENSG00000161798', 'ENSG00000188229']",
        column_names=["feature_name", "feature_length"],
    )
    .concat()
    .to_pandas()
)

gene_metadata

[7]:

	feature_name	feature_length	feature_id
0	TUBB4B	2037	ENSG00000188229
1	AQP5	1884	ENSG00000161798

Querying expression data¶

A convenient way to query and fetch expression data is to use the get_anndata method of the cellxgene_census API. This is a method that combines the column selection and value filtering we described above to obtain slices of the expression data based on metadata queries.

The method will return an anndata.AnnData object, it takes as an input a census object, the string for an organism, and for both cell and gene metadata we can specify filters and column selection as described above but with the following arguments:

column_names — a dictionary with two keys obs and var whose values are lists of strings indicating the columns to select for cell and gene metadata respectively.
obs_value_filter — python expression with selection conditions to fetch cells meeting a criteria. For full details see `tiledb.QueryCondition <https://tiledb-inc-tiledb.readthedocs-hosted.com/projects/tiledb-py/en/stable/python-api.html#query-condition>`__.
var_value_filter — python expression with selection conditions to fetch genes meeting a criteria. Details as above. For full details see `tiledb.QueryCondition <https://tiledb-inc-tiledb.readthedocs-hosted.com/projects/tiledb-py/en/stable/python-api.html#query-condition>`__.

For example if we want to fetch the expression data for:

Genes "ENSG00000161798" and "ENSG00000188229".
All "B cells" of "lung" with "COVID-19".
With all gene metadata and adding sex cell metadata.

[8]:

adata = cellxgene_census.get_anndata(
    census=census,
    organism="Homo sapiens",
    var_value_filter="feature_id in ['ENSG00000161798', 'ENSG00000188229']",
    obs_value_filter="cell_type == 'B cell' and tissue_general == 'lung' and disease == 'COVID-19'",
    column_names={"obs": ["sex"]},
)

And now we can take a look at the results.

[9]:

adata

[9]:

AnnData object with n_obs × n_vars = 1445 × 2
    obs: 'sex', 'cell_type', 'tissue_general', 'disease'
    var: 'soma_joinid', 'feature_id', 'feature_name', 'feature_length'

[10]:

adata.obs

[10]:

	sex	cell_type	tissue_general	disease
0	male	B cell	lung	COVID-19
1	male	B cell	lung	COVID-19
2	male	B cell	lung	COVID-19
3	male	B cell	lung	COVID-19
4	male	B cell	lung	COVID-19
...	...	...	...	...
1440	male	B cell	lung	COVID-19
1441	male	B cell	lung	COVID-19
1442	male	B cell	lung	COVID-19
1443	male	B cell	lung	COVID-19
1444	male	B cell	lung	COVID-19

1445 rows × 4 columns

[11]:

adata.var

[11]:

	soma_joinid	feature_id	feature_name	feature_length
0	9938	ENSG00000188229	TUBB4B	2037
1	12150	ENSG00000161798	AQP5	1884

For a full description of get_anndata() refer to help(cellxgene_census.get_anndata)

Don’t forget to close the census!

[12]:

census.close()