Querying and fetching the single-cell data and cell/gene metadata.

This tutorial showcases the easiest ways to query the expression data and cell/gene metadata from the Census, and load them into common in-memory Python objects, including pandas.DataFrame and anndata.AnnData.

Contents

  1. Opening the census.

  2. Querying cell metadata (obs).

  3. Querying gene metadata (var).

  4. Querying expression data.

Opening the census

The cellxgene_census python package contains a convenient API to open the latest version of the Census.

[1]:
import cellxgene_census

census = cellxgene_census.open_soma()

You can learn more about the cellxgene_census methods by accessing their corresponding documentation via help(). For example help(cellxgene_census.open_soma).

Querying cell metadata (obs)

The human gene metadata of the Census, for RNA assays, is located at census["census_data"]["homo_sapiens"].obs. This is a SOMADataFrame and as such it can be materialized as a pandas.DataFrame via the methods read().concat().to_pandas().

The mouse cell metadata is at census["census_data"]["mus_musculus"].obs.

For slicing the cell metadata there are two relevant arguments that can be passed through read():

To learn what metadata columns are available for fetching and filtering we can directly look at the keys of the cell metadata.

[2]:
keys = list(census["census_data"]["homo_sapiens"].obs.keys())

keys
[2]:
['soma_joinid',
 'dataset_id',
 'assay',
 'assay_ontology_term_id',
 'cell_type',
 'cell_type_ontology_term_id',
 'development_stage',
 'development_stage_ontology_term_id',
 'disease',
 'disease_ontology_term_id',
 'donor_id',
 'is_primary_data',
 'self_reported_ethnicity',
 'self_reported_ethnicity_ontology_term_id',
 'sex',
 'sex_ontology_term_id',
 'suspension_type',
 'tissue',
 'tissue_ontology_term_id',
 'tissue_general',
 'tissue_general_ontology_term_id']

soma_joinid is a special SOMADataFrame column that is used for join operations. The definition for all other columns can be found at the Census schema.

All of these can be used to fetch specific columns or specific rows matching a condition. For the latter we need to know the values we are looking for a priori.

For example let’s see what are the possible values available for sex. To this we can load all cell metadata but fetching only for the column sex.

[3]:
sex_cell_metadata = census["census_data"]["homo_sapiens"].obs.read(column_names=["sex"]).concat().to_pandas()

sex_cell_metadata.drop_duplicates()
[3]:
sex
0 male
24982 female
154744 unknown

As you can see there are only three different values for sex, that is "male", "female" and "unknown".

With this information we can fetch all cell metatadata for a specific sex value, for example "unknown".

[4]:
cell_metadata_all_unknown_sex = (
    census["census_data"]["homo_sapiens"].obs.read(value_filter="sex == 'unknown'").concat().to_pandas()
)

cell_metadata_all_unknown_sex
[4]:
soma_joinid dataset_id assay assay_ontology_term_id cell_type cell_type_ontology_term_id development_stage development_stage_ontology_term_id disease disease_ontology_term_id ... is_primary_data self_reported_ethnicity self_reported_ethnicity_ontology_term_id sex sex_ontology_term_id suspension_type tissue tissue_ontology_term_id tissue_general tissue_general_ontology_term_id
0 154744 36c867a7-be10-4e69-9b39-5de12b0af6da 10x 3' v1 EFO:0009901 plasma cell CL:0000786 human early adulthood stage HsapDv:0000088 Crohn ileitis MONDO:0000709 ... True unknown unknown unknown unknown cell ileum UBERON:0002116 small intestine UBERON:0002108
1 154745 36c867a7-be10-4e69-9b39-5de12b0af6da 10x 3' v1 EFO:0009901 native cell CL:0000003 human early adulthood stage HsapDv:0000088 Crohn ileitis MONDO:0000709 ... True unknown unknown unknown unknown cell ileum UBERON:0002116 small intestine UBERON:0002108
2 154746 36c867a7-be10-4e69-9b39-5de12b0af6da 10x 3' v1 EFO:0009901 alpha-beta T cell CL:0000789 human early adulthood stage HsapDv:0000088 Crohn ileitis MONDO:0000709 ... True unknown unknown unknown unknown cell ileum UBERON:0002116 small intestine UBERON:0002108
3 154747 36c867a7-be10-4e69-9b39-5de12b0af6da 10x 3' v1 EFO:0009901 native cell CL:0000003 human early adulthood stage HsapDv:0000088 Crohn ileitis MONDO:0000709 ... True unknown unknown unknown unknown cell ileum UBERON:0002116 small intestine UBERON:0002108
4 154748 36c867a7-be10-4e69-9b39-5de12b0af6da 10x 3' v1 EFO:0009901 plasma cell CL:0000786 human early adulthood stage HsapDv:0000088 Crohn ileitis MONDO:0000709 ... True unknown unknown unknown unknown cell ileum UBERON:0002116 small intestine UBERON:0002108
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2082502 45465783 e5233a94-9e43-418c-8209-6f1400c31530 10x 3' v2 EFO:0009899 activated CD8-positive, alpha-beta T cell CL:0000906 unknown unknown normal PATO:0000461 ... False unknown unknown unknown unknown cell body of stomach UBERON:0001161 stomach UBERON:0000945
2082503 45465784 e5233a94-9e43-418c-8209-6f1400c31530 10x 3' v2 EFO:0009899 gamma-delta T cell CL:0000798 unknown unknown normal PATO:0000461 ... False unknown unknown unknown unknown cell body of stomach UBERON:0001161 stomach UBERON:0000945
2082504 45465785 e5233a94-9e43-418c-8209-6f1400c31530 10x 3' v2 EFO:0009899 activated CD8-positive, alpha-beta T cell CL:0000906 unknown unknown normal PATO:0000461 ... False unknown unknown unknown unknown cell body of stomach UBERON:0001161 stomach UBERON:0000945
2082505 45465786 e5233a94-9e43-418c-8209-6f1400c31530 10x 3' v2 EFO:0009899 natural T-regulatory cell CL:0000903 unknown unknown normal PATO:0000461 ... False unknown unknown unknown unknown cell body of stomach UBERON:0001161 stomach UBERON:0000945
2082506 45465787 e5233a94-9e43-418c-8209-6f1400c31530 10x 3' v2 EFO:0009899 fibroblast CL:0000057 unknown unknown normal PATO:0000461 ... False unknown unknown unknown unknown cell body of stomach UBERON:0001161 stomach UBERON:0000945

2082507 rows × 21 columns

You can use both column_names and value_filter to perform specific queries. For example let’s fetch the disease columns for the cell_type "B cell" in the tissue_general "lung".

[5]:
cell_metadata_b_cell = (
    census["census_data"]["homo_sapiens"]
    .obs.read(
        value_filter="cell_type == 'B cell' and tissue_general == 'lung'",
        column_names=["disease"],
    )
    .concat()
    .to_pandas()
)

cell_metadata_b_cell.value_counts()
[5]:
disease                                cell_type  tissue_general
lung adenocarcinoma                    B cell     lung              50228
non-small cell lung carcinoma          B cell     lung              17484
normal                                 B cell     lung              16801
squamous cell lung carcinoma           B cell     lung              11584
chronic obstructive pulmonary disease  B cell     lung               7147
interstitial lung disease 2            B cell     lung               5141
interstitial lung disease              B cell     lung               1655
pulmonary emphysema                    B cell     lung               1512
COVID-19                               B cell     lung               1445
small cell lung carcinoma              B cell     lung                583
non-specific interstitial pneumonia    B cell     lung                284
hypersensitivity pneumonitis           B cell     lung                 13
sarcoidosis                            B cell     lung                  6
Name: count, dtype: int64

Querying gene metadata (var)

The human gene metadata of the Census is located at census["census_data"]["homo_sapiens"].ms["RNA"].var. Similarly to the cell metadata, it is a SOMADataFrame and thus we can also use its method read().

The mouse gene metadata is at census["census_data"]["mus_musculus"].ms["RNA"].var.

Let’s take a look at the metadata available for column selection and row filtering.

[6]:
keys = list(census["census_data"]["homo_sapiens"].ms["RNA"].var.keys())

keys
[6]:
['soma_joinid', 'feature_id', 'feature_name', 'feature_length']

With the exception of soma_joinid these columns are defined in the Census schema. Similarly to the cell metadata, we can use the same operations to learn and fetch gene metadata.

For example, to get the feature_name and feature_length of the genes "ENSG00000161798" and "ENSG00000188229" we can do the following.

[7]:
gene_metadata = (
    census["census_data"]["homo_sapiens"]
    .ms["RNA"]
    .var.read(
        value_filter="feature_id in ['ENSG00000161798', 'ENSG00000188229']",
        column_names=["feature_name", "feature_length"],
    )
    .concat()
    .to_pandas()
)

gene_metadata
[7]:
feature_name feature_length feature_id
0 TUBB4B 2037 ENSG00000188229
1 AQP5 1884 ENSG00000161798

Querying expression data

A convenient way to query and fetch expression data is to use the get_anndata method of the cellxgene_census API. This is a method that combines the column selection and value filtering we described above to obtain slices of the expression data based on metadata queries.

The method will return an anndata.AnnData object, it takes as an input a census object, the string for an organism, and for both cell and gene metadata we can specify filters and column selection as described above but with the following arguments:

For example if we want to fetch the expression data for:

  • Genes "ENSG00000161798" and "ENSG00000188229".

  • All "B cells" of "lung" with "COVID-19".

  • With all gene metadata and adding sex cell metadata.

[8]:
adata = cellxgene_census.get_anndata(
    census=census,
    organism="Homo sapiens",
    var_value_filter="feature_id in ['ENSG00000161798', 'ENSG00000188229']",
    obs_value_filter="cell_type == 'B cell' and tissue_general == 'lung' and disease == 'COVID-19'",
    column_names={"obs": ["sex"]},
)

And now we can take a look at the results.

[9]:
adata
[9]:
AnnData object with n_obs × n_vars = 1445 × 2
    obs: 'sex', 'cell_type', 'tissue_general', 'disease'
    var: 'soma_joinid', 'feature_id', 'feature_name', 'feature_length'
[10]:
adata.obs
[10]:
sex cell_type tissue_general disease
0 male B cell lung COVID-19
1 male B cell lung COVID-19
2 male B cell lung COVID-19
3 male B cell lung COVID-19
4 male B cell lung COVID-19
... ... ... ... ...
1440 male B cell lung COVID-19
1441 male B cell lung COVID-19
1442 male B cell lung COVID-19
1443 male B cell lung COVID-19
1444 male B cell lung COVID-19

1445 rows × 4 columns

[11]:
adata.var
[11]:
soma_joinid feature_id feature_name feature_length
0 9938 ENSG00000188229 TUBB4B 2037
1 12150 ENSG00000161798 AQP5 1884

For a full description of get_anndata() refer to help(cellxgene_census.get_anndata)

Don’t forget to close the census!

[12]:
census.close()