Querying and fetching the single-cell data and cell/gene metadata.¶
This tutorial showcases the easiest ways to query the expression data and cell/gene metadata from the Census, and load them into common in-memory Python objects, including pandas.DataFrame and anndata.AnnData.
Contents
Opening the census.
Querying cell metadata (obs).
Querying gene metadata (var).
Querying expression data.
Opening the census¶
The cellxgene_census python package contains a convenient API to open the latest version of the Census.
[1]:
import cellxgene_census
census = cellxgene_census.open_soma()
You can learn more about the cellxgene_census methods by accessing their corresponding documentation via help(). For example help(cellxgene_census.open_soma).
Querying cell metadata (obs)¶
The human gene metadata of the Census, for RNA assays, is located at census["census_data"]["homo_sapiens"].obs. This is a SOMADataFrame and as such it can be materialized as a pandas.DataFrame via the methods read().concat().to_pandas().
The mouse cell metadata is at census["census_data"]["mus_musculus"].obs.
For slicing the cell metadata there are two relevant arguments that can be passed through read():
column_names— list of strings indicating what metadata columns to fetch.value_filter— Python expression with selection conditions to fetch rows, it is similar to`pandas.DataFrame.query()<https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html>`__, for full details see`tiledb.QueryCondition<https://tiledb-inc-tiledb.readthedocs-hosted.com/projects/tiledb-py/en/stable/python-api.html#query-condition>`__ shortly:Expressions are one or more comparisons
Comparisons are one of
<column> <op> <value>or<column> <op> <column>Expressions can combine comparisons using and, or, & or |
op is one of < | > | <= | >= | == | != or in
To learn what metadata columns are available for fetching and filtering we can directly look at the keys of the cell metadata.
[2]:
keys = list(census["census_data"]["homo_sapiens"].obs.keys())
keys
[2]:
['soma_joinid',
'dataset_id',
'assay',
'assay_ontology_term_id',
'cell_type',
'cell_type_ontology_term_id',
'development_stage',
'development_stage_ontology_term_id',
'disease',
'disease_ontology_term_id',
'donor_id',
'is_primary_data',
'self_reported_ethnicity',
'self_reported_ethnicity_ontology_term_id',
'sex',
'sex_ontology_term_id',
'suspension_type',
'tissue',
'tissue_ontology_term_id',
'tissue_general',
'tissue_general_ontology_term_id']
soma_joinid is a special SOMADataFrame column that is used for join operations. The definition for all other columns can be found at the Census schema.
All of these can be used to fetch specific columns or specific rows matching a condition. For the latter we need to know the values we are looking for a priori.
For example let’s see what are the possible values available for sex. To this we can load all cell metadata but fetching only for the column sex.
[3]:
sex_cell_metadata = census["census_data"]["homo_sapiens"].obs.read(column_names=["sex"]).concat().to_pandas()
sex_cell_metadata.drop_duplicates()
[3]:
| sex | |
|---|---|
| 0 | male |
| 24982 | female |
| 154744 | unknown |
As you can see there are only three different values for sex, that is "male", "female" and "unknown".
With this information we can fetch all cell metatadata for a specific sex value, for example "unknown".
[4]:
cell_metadata_all_unknown_sex = (
census["census_data"]["homo_sapiens"].obs.read(value_filter="sex == 'unknown'").concat().to_pandas()
)
cell_metadata_all_unknown_sex
[4]:
| soma_joinid | dataset_id | assay | assay_ontology_term_id | cell_type | cell_type_ontology_term_id | development_stage | development_stage_ontology_term_id | disease | disease_ontology_term_id | ... | is_primary_data | self_reported_ethnicity | self_reported_ethnicity_ontology_term_id | sex | sex_ontology_term_id | suspension_type | tissue | tissue_ontology_term_id | tissue_general | tissue_general_ontology_term_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 154744 | 36c867a7-be10-4e69-9b39-5de12b0af6da | 10x 3' v1 | EFO:0009901 | plasma cell | CL:0000786 | human early adulthood stage | HsapDv:0000088 | Crohn ileitis | MONDO:0000709 | ... | True | unknown | unknown | unknown | unknown | cell | ileum | UBERON:0002116 | small intestine | UBERON:0002108 |
| 1 | 154745 | 36c867a7-be10-4e69-9b39-5de12b0af6da | 10x 3' v1 | EFO:0009901 | native cell | CL:0000003 | human early adulthood stage | HsapDv:0000088 | Crohn ileitis | MONDO:0000709 | ... | True | unknown | unknown | unknown | unknown | cell | ileum | UBERON:0002116 | small intestine | UBERON:0002108 |
| 2 | 154746 | 36c867a7-be10-4e69-9b39-5de12b0af6da | 10x 3' v1 | EFO:0009901 | alpha-beta T cell | CL:0000789 | human early adulthood stage | HsapDv:0000088 | Crohn ileitis | MONDO:0000709 | ... | True | unknown | unknown | unknown | unknown | cell | ileum | UBERON:0002116 | small intestine | UBERON:0002108 |
| 3 | 154747 | 36c867a7-be10-4e69-9b39-5de12b0af6da | 10x 3' v1 | EFO:0009901 | native cell | CL:0000003 | human early adulthood stage | HsapDv:0000088 | Crohn ileitis | MONDO:0000709 | ... | True | unknown | unknown | unknown | unknown | cell | ileum | UBERON:0002116 | small intestine | UBERON:0002108 |
| 4 | 154748 | 36c867a7-be10-4e69-9b39-5de12b0af6da | 10x 3' v1 | EFO:0009901 | plasma cell | CL:0000786 | human early adulthood stage | HsapDv:0000088 | Crohn ileitis | MONDO:0000709 | ... | True | unknown | unknown | unknown | unknown | cell | ileum | UBERON:0002116 | small intestine | UBERON:0002108 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2082502 | 45465783 | e5233a94-9e43-418c-8209-6f1400c31530 | 10x 3' v2 | EFO:0009899 | activated CD8-positive, alpha-beta T cell | CL:0000906 | unknown | unknown | normal | PATO:0000461 | ... | False | unknown | unknown | unknown | unknown | cell | body of stomach | UBERON:0001161 | stomach | UBERON:0000945 |
| 2082503 | 45465784 | e5233a94-9e43-418c-8209-6f1400c31530 | 10x 3' v2 | EFO:0009899 | gamma-delta T cell | CL:0000798 | unknown | unknown | normal | PATO:0000461 | ... | False | unknown | unknown | unknown | unknown | cell | body of stomach | UBERON:0001161 | stomach | UBERON:0000945 |
| 2082504 | 45465785 | e5233a94-9e43-418c-8209-6f1400c31530 | 10x 3' v2 | EFO:0009899 | activated CD8-positive, alpha-beta T cell | CL:0000906 | unknown | unknown | normal | PATO:0000461 | ... | False | unknown | unknown | unknown | unknown | cell | body of stomach | UBERON:0001161 | stomach | UBERON:0000945 |
| 2082505 | 45465786 | e5233a94-9e43-418c-8209-6f1400c31530 | 10x 3' v2 | EFO:0009899 | natural T-regulatory cell | CL:0000903 | unknown | unknown | normal | PATO:0000461 | ... | False | unknown | unknown | unknown | unknown | cell | body of stomach | UBERON:0001161 | stomach | UBERON:0000945 |
| 2082506 | 45465787 | e5233a94-9e43-418c-8209-6f1400c31530 | 10x 3' v2 | EFO:0009899 | fibroblast | CL:0000057 | unknown | unknown | normal | PATO:0000461 | ... | False | unknown | unknown | unknown | unknown | cell | body of stomach | UBERON:0001161 | stomach | UBERON:0000945 |
2082507 rows × 21 columns
You can use both column_names and value_filter to perform specific queries. For example let’s fetch the disease columns for the cell_type "B cell" in the tissue_general "lung".
[5]:
cell_metadata_b_cell = (
census["census_data"]["homo_sapiens"]
.obs.read(
value_filter="cell_type == 'B cell' and tissue_general == 'lung'",
column_names=["disease"],
)
.concat()
.to_pandas()
)
cell_metadata_b_cell.value_counts()
[5]:
disease cell_type tissue_general
lung adenocarcinoma B cell lung 50228
non-small cell lung carcinoma B cell lung 17484
normal B cell lung 16801
squamous cell lung carcinoma B cell lung 11584
chronic obstructive pulmonary disease B cell lung 7147
interstitial lung disease 2 B cell lung 5141
interstitial lung disease B cell lung 1655
pulmonary emphysema B cell lung 1512
COVID-19 B cell lung 1445
small cell lung carcinoma B cell lung 583
non-specific interstitial pneumonia B cell lung 284
hypersensitivity pneumonitis B cell lung 13
sarcoidosis B cell lung 6
Name: count, dtype: int64
Querying gene metadata (var)¶
The human gene metadata of the Census is located at census["census_data"]["homo_sapiens"].ms["RNA"].var. Similarly to the cell metadata, it is a SOMADataFrame and thus we can also use its method read().
The mouse gene metadata is at census["census_data"]["mus_musculus"].ms["RNA"].var.
Let’s take a look at the metadata available for column selection and row filtering.
[6]:
keys = list(census["census_data"]["homo_sapiens"].ms["RNA"].var.keys())
keys
[6]:
['soma_joinid', 'feature_id', 'feature_name', 'feature_length']
With the exception of soma_joinid these columns are defined in the Census schema. Similarly to the cell metadata, we can use the same operations to learn and fetch gene metadata.
For example, to get the feature_name and feature_length of the genes "ENSG00000161798" and "ENSG00000188229" we can do the following.
[7]:
gene_metadata = (
census["census_data"]["homo_sapiens"]
.ms["RNA"]
.var.read(
value_filter="feature_id in ['ENSG00000161798', 'ENSG00000188229']",
column_names=["feature_name", "feature_length"],
)
.concat()
.to_pandas()
)
gene_metadata
[7]:
| feature_name | feature_length | feature_id | |
|---|---|---|---|
| 0 | TUBB4B | 2037 | ENSG00000188229 |
| 1 | AQP5 | 1884 | ENSG00000161798 |
Querying expression data¶
A convenient way to query and fetch expression data is to use the get_anndata method of the cellxgene_census API. This is a method that combines the column selection and value filtering we described above to obtain slices of the expression data based on metadata queries.
The method will return an anndata.AnnData object, it takes as an input a census object, the string for an organism, and for both cell and gene metadata we can specify filters and column selection as described above but with the following arguments:
column_names— a dictionary with two keysobsandvarwhose values are lists of strings indicating the columns to select for cell and gene metadata respectively.obs_value_filter— python expression with selection conditions to fetch cells meeting a criteria. For full details see`tiledb.QueryCondition<https://tiledb-inc-tiledb.readthedocs-hosted.com/projects/tiledb-py/en/stable/python-api.html#query-condition>`__.var_value_filter— python expression with selection conditions to fetch genes meeting a criteria. Details as above. For full details see`tiledb.QueryCondition<https://tiledb-inc-tiledb.readthedocs-hosted.com/projects/tiledb-py/en/stable/python-api.html#query-condition>`__.
For example if we want to fetch the expression data for:
Genes
"ENSG00000161798"and"ENSG00000188229".All
"B cells"of"lung"with"COVID-19".With all gene metadata and adding
sexcell metadata.
[8]:
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
var_value_filter="feature_id in ['ENSG00000161798', 'ENSG00000188229']",
obs_value_filter="cell_type == 'B cell' and tissue_general == 'lung' and disease == 'COVID-19'",
column_names={"obs": ["sex"]},
)
And now we can take a look at the results.
[9]:
adata
[9]:
AnnData object with n_obs × n_vars = 1445 × 2
obs: 'sex', 'cell_type', 'tissue_general', 'disease'
var: 'soma_joinid', 'feature_id', 'feature_name', 'feature_length'
[10]:
adata.obs
[10]:
| sex | cell_type | tissue_general | disease | |
|---|---|---|---|---|
| 0 | male | B cell | lung | COVID-19 |
| 1 | male | B cell | lung | COVID-19 |
| 2 | male | B cell | lung | COVID-19 |
| 3 | male | B cell | lung | COVID-19 |
| 4 | male | B cell | lung | COVID-19 |
| ... | ... | ... | ... | ... |
| 1440 | male | B cell | lung | COVID-19 |
| 1441 | male | B cell | lung | COVID-19 |
| 1442 | male | B cell | lung | COVID-19 |
| 1443 | male | B cell | lung | COVID-19 |
| 1444 | male | B cell | lung | COVID-19 |
1445 rows × 4 columns
[11]:
adata.var
[11]:
| soma_joinid | feature_id | feature_name | feature_length | |
|---|---|---|---|---|
| 0 | 9938 | ENSG00000188229 | TUBB4B | 2037 |
| 1 | 12150 | ENSG00000161798 | AQP5 | 1884 |
For a full description of get_anndata() refer to help(cellxgene_census.get_anndata)
Don’t forget to close the census!
[12]:
census.close()