Querying axis metadata¶
This notebook provides examples for basic axis metadata handling using Pandas. The Census stores obs (cell) and var (gene) metadata in SOMADataFrame objects via the `TileDB-SOMA API <https://github.com/single-cell-data/TileDB-SOMA>`__ (documentation), which can be queried and read as a Pandas DataFrame using TileDB-SOMA.
Note that Pandas DataFrame is an in-memory object, therefore queries should be small enough for results to fit in memory.
Contents
Opening the Census
Summarizing cell metadata
Example: Summarize all cell types
Example: Summarize a subset of cell types, selected with a
value_filter
Full Census metadata stats
Opening the Census¶
The cellxgene_census python package contains a convenient API to open the latest version of the Census. If you open the Census, you should close it. open_soma() returns a context, so you can open/close it in several ways, like a Python file handle. The context manager is preferred, as it will automatically close upon an error raise.
You can learn more about the cellxgene_census methods by accessing their corresponding documentation via help(). For example help(cellxgene_census.open_soma).
[1]:
import cellxgene_census
# Preferred: use a Python context manager
with cellxgene_census.open_soma() as census:
...
# or, directly open the census (don't forget to close it!)
census = cellxgene_census.open_soma()
Summarizing cell metadata¶
Once the Census is open you can use its TileDB-SOMA methods as it is itself a SOMACollection. You can thus access the metadata SOMADataFrame objects encoding cell and gene metadata.
Tips:
You can read an entire
SOMADataFrameinto a PandasDataFrameusingsoma_df.read().concat().to_pandas(), allowing the use of the standard Pandas API.Queries will be much faster if you request only the DataFrame columns required for your analysis (e.g.,
column_names=["cell_type_ontology_term_id"]).You can also further refine query results by using a
value_filter, which will filter the census for matching records.
Example: Summarize all cell types¶
This example reads the cell metadata (obs) into a Pandas DataFrame, and summarizes in a variety of ways using Pandas API.
[2]:
human = census["census_data"]["homo_sapiens"]
# Read entire _obs_ into a pandas dataframe.
obs_df = human.obs.read(column_names=["cell_type_ontology_term_id"]).concat().to_pandas()
# Use Pandas API to find all unique values in the `cell_type_ontology_term_id` column.
unique_cell_type_ontology_term_id = obs_df.cell_type_ontology_term_id.unique()
# Display only the first 10, as there are a LOT!
print(
f"There are {len(unique_cell_type_ontology_term_id)} cell types in the Census! The first 10 are:",
unique_cell_type_ontology_term_id[0:10].tolist(),
)
# Using Pandas API, count the instances of each cell type term and return the top 10.
top_10 = obs_df.cell_type_ontology_term_id.value_counts()[0:10]
print("\nThe top 10 cell types and their counts are:")
print(top_10)
There are 588 cell types in the Census! The first 10 are: ['CL:0000649', 'CL:0002187', 'CL:0000148', 'CL:0000312', 'CL:0000242', 'CL:0000988', 'CL:2000092', 'CL:0002189', 'CL:0000499', 'CL:0000623']
The top 10 cell types and their counts are:
cell_type_ontology_term_id
CL:0000540 5186967
CL:0000679 1889047
CL:0000128 1744548
CL:0000235 1458460
CL:0000624 1343608
CL:0000625 1285307
CL:0000860 1251739
CL:0000057 1075053
CL:0000623 1040446
CL:0000236 986032
Name: count, dtype: int64
Example: Summarize a subset of cell types, selected with a value_filter¶
This example utilizes a SOMA “value filter” to read the subset of cells with tissue_ontology_term_id equal to UBERON:0002048 (lung tissue), and summarizes the query result using Pandas.
[3]:
# Count cell_type occurrences for cells with tissue == 'lung'
human = census["census_data"]["homo_sapiens"]
# Read cell_type terms for cells which have a specific tissue term
LUNG_TISSUE = "UBERON:0002048"
obs_df = (
human.obs.read(
column_names=["cell_type_ontology_term_id"],
value_filter=f"tissue_ontology_term_id == '{LUNG_TISSUE}'",
)
.concat()
.to_pandas()
)
# Use Pandas API to find all unique values in the `cell_type_ontology_term_id` column.
unique_cell_type_ontology_term_id = obs_df.cell_type_ontology_term_id.unique()
print(
f"There are {len(unique_cell_type_ontology_term_id)} cell types in the Census where tissue_ontology_term_id == {LUNG_TISSUE}! The first 10 are:",
unique_cell_type_ontology_term_id[0:10].tolist(),
)
# Use Pandas API to count, and grab 10 most common
top_10 = obs_df.cell_type_ontology_term_id.value_counts()[0:10]
print(f"\nTop 10 cell types where tissue_ontology_term_id == {LUNG_TISSUE}")
print(top_10)
There are 179 cell types in the Census where tissue_ontology_term_id == UBERON:0002048! The first 10 are: ['CL:1000271', 'CL:0000583', 'CL:0002063', 'CL:0000359', 'CL:1000143', 'CL:0000235', 'CL:4023054', 'CL:0000451', 'CL:0000623', 'CL:0000115']
Top 10 cell types where tissue_ontology_term_id == UBERON:0002048
cell_type_ontology_term_id
CL:0000235 524362
CL:0000583 330014
CL:0000624 273098
CL:0000625 251614
CL:0000623 168957
CL:0000003 168203
CL:0002063 163561
CL:0000860 160365
CL:0001064 149067
CL:0002632 126058
Name: count, dtype: int64
You can also define much more complex value filters. For example: * combine terms with and and or * use the in operator to query on multiple values
[4]:
# You can also do more complex queries, such as testing for inclusion in a list of values
human = census["census_data"]["homo_sapiens"]
VENTRICLES = ["UBERON:0002082", "UBERON:OOO2084", "UBERON:0002080"]
obs_df = (
human.obs.read(
column_names=["cell_type_ontology_term_id"],
value_filter=f"tissue_ontology_term_id in {VENTRICLES}",
)
.concat()
.to_pandas()
)
# Use Pandas API to summarize
top_10 = obs_df.cell_type_ontology_term_id.value_counts()[0:10]
display(top_10)
cell_type_ontology_term_id
CL:0000746 159096
CL:0008034 84750
CL:0002548 79618
CL:0000115 64190
CL:0002131 61830
CL:0000763 32088
CL:0000669 27515
CL:0000003 22707
CL:0000057 20117
CL:0002144 18593
Name: count, dtype: int64
Full Census metadata stats¶
This example queries all organisms in the Census, and summarizes the diversity of various metadata lables.
[5]:
COLS_TO_QUERY = [
"cell_type_ontology_term_id",
"assay_ontology_term_id",
"tissue_ontology_term_id",
]
obs_df = {
name: experiment.obs.read(column_names=COLS_TO_QUERY).concat().to_pandas()
for name, experiment in census["census_data"].items()
}
# Use Pandas API to summarize each organism
print(f"Complete census contains {sum(len(df) for df in obs_df.values())} cells.")
for organism, df in obs_df.items():
print(organism)
for col in COLS_TO_QUERY:
print(f"\tUnique {col} values: {len(df[col].unique())}")
Complete census contains 49423515 cells.
homo_sapiens
Unique cell_type_ontology_term_id values: 588
Unique assay_ontology_term_id values: 20
Unique tissue_ontology_term_id values: 216
mus_musculus
Unique cell_type_ontology_term_id values: 211
Unique assay_ontology_term_id values: 9
Unique tissue_ontology_term_id values: 45
Close the census
[6]:
census.close()