Axis Query Example

Goal: demonstrate basic axis metadata handling using Pandas.

The Census stores obs (cell) metadata in a SOMA DataFrame, which can be queried and read as a Pandas DataFrame. The Census also has a convenience package which simplifies opening the census.

Pandas DataFrame is an in-memory object. Take care that queries are small enough for results to fit in memory.

Opening the Census

The cellxgene_census python package contains a convenient API to open the latest version of the Census. If you open the Census, you should close it. open_soma() returns a context, so you can open/close it in several ways, like a Python file handle. The context manager is preferred, as it will automatically close upon an error raise.

You can learn more about the cellxgene_census methods by accessing their corresponding documentation via help(). For example help(cellxgene_census.open_soma).

[1]:
import cellxgene_census

# Preferred: use a Python context manager
with cellxgene_census.open_soma() as census:
    ...

# or, directly open the census (don't forget to close it!)
census = cellxgene_census.open_soma()

Summarize Census cell metadata

Tips:

  • You can read an entire SOMA dataframe into a Pandas DataFrame using soma_df.read().concat().to_pandas(), allowing the use of the standard Pandas API.

  • Queries will be much faster if you request only the DataFrame columns required for your analysis (e.g., column_names=["cell_type_ontology_term_id"]).

  • You can also further refine query results by using a value_filter, which will filter the census for matching records.

Example 1 - Summarize all cell types

This example reads the cell metadata (obs) into a Pandas DataFrame, and summarizes in a variety of ways using Pandas API.

[2]:
human = census["census_data"]["homo_sapiens"]

# Read entire _obs_ into a pandas dataframe.
obs_df = human.obs.read(column_names=["cell_type_ontology_term_id"]).concat().to_pandas()

# Use Pandas API to find all unique values in the `cell_type_ontology_term_id` column.
unique_cell_type_ontology_term_id = obs_df.cell_type_ontology_term_id.unique()

# Display only the first 10, as there are a LOT!
print(
    f"There are {len(unique_cell_type_ontology_term_id)} cell types in the Census! The first 10 are:",
    unique_cell_type_ontology_term_id[0:10].tolist(),
)

# Using Pandas API, count the instances of each cell type term and return the top 10.
top_10 = obs_df.cell_type_ontology_term_id.value_counts()[0:10]
print("\nThe top 10 cell types and their counts are:")
print(top_10)
There are 588 cell types in the Census! The first 10 are: ['CL:0000649', 'CL:0002187', 'CL:0000148', 'CL:0000312', 'CL:0000242', 'CL:0000988', 'CL:2000092', 'CL:0002189', 'CL:0000499', 'CL:0000623']

The top 10 cell types and their counts are:
cell_type_ontology_term_id
CL:0000540    5186967
CL:0000679    1889047
CL:0000128    1744548
CL:0000235    1458460
CL:0000624    1343608
CL:0000625    1285307
CL:0000860    1251739
CL:0000057    1075053
CL:0000623    1040446
CL:0000236     986032
Name: count, dtype: int64

Summarize a subset of cell types, selected with a value_filter

This example utilizes a SOMA “value filter” to read the subset of cells with tissue_ontology_term_id equal to UBERON:0002048 (lung tissue), and summarizes the query result using Pandas.

[3]:
# Count cell_type occurrences for cells with tissue == 'lung'
human = census["census_data"]["homo_sapiens"]

# Read cell_type terms for cells which have a specific tissue term
LUNG_TISSUE = "UBERON:0002048"

obs_df = (
    human.obs.read(
        column_names=["cell_type_ontology_term_id"],
        value_filter=f"tissue_ontology_term_id == '{LUNG_TISSUE}'",
    )
    .concat()
    .to_pandas()
)

# Use Pandas API to find all unique values in the `cell_type_ontology_term_id` column.
unique_cell_type_ontology_term_id = obs_df.cell_type_ontology_term_id.unique()

print(
    f"There are {len(unique_cell_type_ontology_term_id)} cell types in the Census where tissue_ontology_term_id == {LUNG_TISSUE}! The first 10 are:",
    unique_cell_type_ontology_term_id[0:10].tolist(),
)

# Use Pandas API to count, and grab 10 most common
top_10 = obs_df.cell_type_ontology_term_id.value_counts()[0:10]
print(f"\nTop 10 cell types where tissue_ontology_term_id == {LUNG_TISSUE}")
print(top_10)
There are 179 cell types in the Census where tissue_ontology_term_id == UBERON:0002048! The first 10 are: ['CL:1000271', 'CL:0000583', 'CL:0002063', 'CL:0000359', 'CL:1000143', 'CL:0000235', 'CL:4023054', 'CL:0000451', 'CL:0000623', 'CL:0000115']

Top 10 cell types where tissue_ontology_term_id == UBERON:0002048
cell_type_ontology_term_id
CL:0000235    524362
CL:0000583    330014
CL:0000624    273098
CL:0000625    251614
CL:0000623    168957
CL:0000003    168203
CL:0002063    163561
CL:0000860    160365
CL:0001064    149067
CL:0002632    126058
Name: count, dtype: int64

You can also define much more complex value filters. For example: * combine terms with and and or * use the in operator to query on multiple values

[4]:
# You can also do more complex queries, such as testing for inclusion in a list of values
human = census["census_data"]["homo_sapiens"]

VENTRICLES = ["UBERON:0002082", "UBERON:OOO2084", "UBERON:0002080"]

obs_df = (
    human.obs.read(
        column_names=["cell_type_ontology_term_id"],
        value_filter=f"tissue_ontology_term_id in {VENTRICLES}",
    )
    .concat()
    .to_pandas()
)

# Use Pandas API to summarize
top_10 = obs_df.cell_type_ontology_term_id.value_counts()[0:10]
display(top_10)
cell_type_ontology_term_id
CL:0000746    159096
CL:0008034     84750
CL:0002548     79618
CL:0000115     64190
CL:0002131     61830
CL:0000763     32088
CL:0000669     27515
CL:0000003     22707
CL:0000057     20117
CL:0002144     18593
Name: count, dtype: int64

Full census stats

This example queries all organisms in the Census, and summarizes the diversity of various metadata lables.

[5]:
COLS_TO_QUERY = [
    "cell_type_ontology_term_id",
    "assay_ontology_term_id",
    "tissue_ontology_term_id",
]

obs_df = {
    name: experiment.obs.read(column_names=COLS_TO_QUERY).concat().to_pandas()
    for name, experiment in census["census_data"].items()
}

# Use Pandas API to summarize each organism
print(f"Complete census contains {sum(len(df) for df in obs_df.values())} cells.")
for organism, df in obs_df.items():
    print(organism)
    for col in COLS_TO_QUERY:
        print(f"\tUnique {col} values: {len(df[col].unique())}")
Complete census contains 49423515 cells.
homo_sapiens
        Unique cell_type_ontology_term_id values: 588
        Unique assay_ontology_term_id values: 20
        Unique tissue_ontology_term_id values: 216
mus_musculus
        Unique cell_type_ontology_term_id values: 211
        Unique assay_ontology_term_id values: 9
        Unique tissue_ontology_term_id values: 45

Close the census

[6]:
census.close()