Learning about the CELLxGENE Census

The Census is a versioned container for the single-cell data hosted at CELLxGENE Discover. The Census utilizes SOMA powered by TileDB for storing, accessing, and efficiently filtering data.

This notebook shows you how to learn about the Census contents and how to query it.

Contents

  • Opening the census

  • Census organization

  • Census summary content tables

  • Cell metadata

  • Gene metadata

  • Understanding Census contents beyond the summary tables

Opening the Census

The cellxgene_census python package contains a convenient API to open the latest version of the Census. If you open the census, you should close it. open_soma() returns a context, so you can open/close it in several ways, like a Python file handle. The context manager is preferred, as it will automatically close upon an error raise.

[1]:
import cellxgene_census

# Preferred: use a Python context manager
with cellxgene_census.open_soma() as census:
    ...

# or
census = cellxgene_census.open_soma()
...
census.close()

You can learn more about the cellxgene_census methods by accessing their corresponding documentation via help(). For example help(cellxgene_census.open_soma).

[2]:
census = cellxgene_census.open_soma()

Census organization

The Census schema[INSERT LINK] defines the structure of the Census. In short, you can think of the Census as a structured collection of items that stores different pieces of information. All of these items and the parent collection are SOMA objects of various types and can all be accessed with the SOMA API.

The cellxgene_census package contains some convenient wrappers of the SOMA API. An example of this is the function we used to open the Census: cellxgene_census.open_soma()

Main Census components

With the command above you created census, which is a SOMACollection. It is analogous to a Python dictionary, and it has two items: census_info and census_data.

Census summary info

  • census["census_info"] A collection of tables providing information of the census as a whole.

    • census["census_info"]["summary"]: High-level information of this Census, e.g. build date, total cell count, etc.

    • census["census_info"]["datasets"]: A table with all datasets from CELLxGENE Discover used to create the Census.

    • census["census_info"]["summary_cell_counts"]: Cell counts stratified by relevant cell metadata

Census data

Data for each organism is stored in independent SOMAExperiment objects which are a specialized form of a SOMACollection. Each of these store a data matrix (cell by genes), cell metadata, gene metadata, and some other useful components not covered in this notebook.

This is how the data is organized for one organism – Homo sapiens:

  • census_obj["census_data"]["homo_sapiens"].obs: Cell metadata

  • census_obj["census_data"]["homo_sapiens"].ms["RNA"].X: Data matrices, currently only raw counts exist X["raw"]

  • census_obj["census_data"]["homo_sapiens"].ms["RNA"].var: Gene Metadata

Census summary content tables

You can take a quick look at the high-level Census information by looking at census["census_info"]["summary"]

[3]:
census_info = census["census_info"]["summary"].read().concat().to_pandas()

census_info
[3]:
soma_joinid label value
0 0 cell_census_schema_version 0.1.1
1 1 cell_census_build_date 2023-04-04
2 2 total_cell_count 49423515
3 3 unique_cell_count 32267841
4 4 number_donors_homo_sapiens 10361
5 5 number_donors_mus_musculus 1296

Of special interest are the label-value combinations for :

  • total_cell_count is the total number of cells in the Census.

  • unique_cell_count is the number of unique cells, as some cells may be present twice due to meta-analysis or consortia-like data.

  • number_donors_homo_sapiens and number_donors_mus_musculus are the number of individuals for human and mouse. These are not guaranteed to be unique as one individual ID may be present or identical in different datasets.

Cell counts by cell metadata

By looking at census["summary_cell_counts"] you can get a general idea of cell counts stratified by some relevant cell metadata. Not all cell metadata is included in this table, you can take a look at all cell and gene metadata available in the sections below “Cell metadata” and “Gene metadata”.

The line below retrieves this table and casts it into a pandas.DataFrame.

[4]:
census_counts = census["census_info"]["summary_cell_counts"].read().concat().to_pandas()

census_counts
[4]:
soma_joinid organism category ontology_term_id unique_cell_count total_cell_count label
0 0 Homo sapiens all na 29461044 45501425 na
1 1 Homo sapiens assay EFO:0008722 206279 260396 Drop-seq
2 2 Homo sapiens assay EFO:0008780 25652 51304 inDrop
3 3 Homo sapiens assay EFO:0008913 133511 133511 single-cell RNA sequencing
4 4 Homo sapiens assay EFO:0008919 44721 161998 Seq-Well
... ... ... ... ... ... ... ...
1259 1259 Mus musculus tissue_general UBERON:0002113 164881 188361 kidney
1260 1260 Mus musculus tissue_general UBERON:0002365 15577 31154 exocrine gland
1261 1261 Mus musculus tissue_general UBERON:0002367 37715 130135 prostate gland
1262 1262 Mus musculus tissue_general UBERON:0002368 13322 26644 endocrine gland
1263 1263 Mus musculus tissue_general UBERON:0002371 90225 144962 bone marrow

1264 rows × 7 columns

For each combination of organism and values for each category of cell metadata you can take a look at total_cell_count and unique_cell_count for the cell counts of that combination.

The values for each category are specified in ontology_term_id and label, which are the value’s IDs and labels, respectively.

Example: cell metadata included in the summary counts table

To get all the available cell metadata in the summary counts table you can do the following. Remember this is not all the cell metadata available, as some variables were omitted in the creation of this table.

[5]:
census_counts[["organism", "category"]].value_counts(sort=False)
[5]:
organism      category
Homo sapiens  all                          1
              assay                       20
              cell_type                  588
              disease                     59
              self_reported_ethnicity     16
              sex                          3
              suspension_type              1
              tissue                     216
              tissue_general              61
Mus musculus  all                          1
              assay                        9
              cell_type                  211
              disease                      5
              self_reported_ethnicity      1
              sex                          3
              suspension_type              1
              tissue                      45
              tissue_general              23
Name: count, dtype: int64

Example: cell counts for each sequencing assay in human data

To get the cell counts for each sequencing assay type in human data, you can perform the following pandas.DataFrame operations:

[6]:
census_human_assays = census_counts.query("organism == 'Homo sapiens' & category == 'assay'")
census_human_assays.sort_values("total_cell_count", ascending=False)
[6]:
soma_joinid organism category ontology_term_id unique_cell_count total_cell_count label
11 11 Homo sapiens assay EFO:0009922 10756095 19002323 10x 3' v3
8 8 Homo sapiens assay EFO:0009899 6941085 11410903 10x 3' v2
14 14 Homo sapiens assay EFO:0010550 4062980 5064268 sci-RNA-seq
15 15 Homo sapiens assay EFO:0011025 3368252 5058383 10x 5' v1
9 9 Homo sapiens assay EFO:0009900 1611873 1740273 10x 5' v2
17 17 Homo sapiens assay EFO:0030003 651223 657933 10x 3' transcription profiling
16 16 Homo sapiens assay EFO:0030002 625175 642559 microwell-seq
18 18 Homo sapiens assay EFO:0030004 600929 600929 10x 5' transcription profiling
1 1 Homo sapiens assay EFO:0008722 206279 260396 Drop-seq
19 19 Homo sapiens assay EFO:0700003 146278 177276 BD Rhapsody Whole Transcriptome Analysis
4 4 Homo sapiens assay EFO:0008919 44721 161998 Seq-Well
5 5 Homo sapiens assay EFO:0008931 78750 161197 Smart-seq2
3 3 Homo sapiens assay EFO:0008913 133511 133511 single-cell RNA sequencing
10 10 Homo sapiens assay EFO:0009901 45317 119191 10x 3' v1
13 13 Homo sapiens assay EFO:0010183 58981 117962 single cell library construction
20 20 Homo sapiens assay EFO:0700004 96145 96145 BD Rhapsody Targeted mRNA
2 2 Homo sapiens assay EFO:0008780 25652 51304 inDrop
7 7 Homo sapiens assay EFO:0008995 0 30244 10x technology
6 6 Homo sapiens assay EFO:0008953 4693 9386 STRT-seq
12 12 Homo sapiens assay EFO:0010010 3105 5244 CEL-seq2

Example: number of microglial cells in the Census

If you have a specific term from any of the categories shown above you can directly find out the number of cells for that term.

[7]:
census_counts.query("label == 'microglial cell'")
[7]:
soma_joinid organism category ontology_term_id unique_cell_count total_cell_count label
65 65 Homo sapiens cell_type CL:0000129 265305 327293 microglial cell
997 997 Mus musculus cell_type CL:0000129 48998 62617 microglial cell

Cell metadata

You can obtain all cell metadata variables by directly querying the columns of the corresponding SOMADataFrame.

All of these variables can be used for querying the Census in case you want to work with specific cells.

[8]:
keys = list(census["census_data"]["homo_sapiens"].obs.keys())

keys
[8]:
['soma_joinid',
 'dataset_id',
 'assay',
 'assay_ontology_term_id',
 'cell_type',
 'cell_type_ontology_term_id',
 'development_stage',
 'development_stage_ontology_term_id',
 'disease',
 'disease_ontology_term_id',
 'donor_id',
 'is_primary_data',
 'self_reported_ethnicity',
 'self_reported_ethnicity_ontology_term_id',
 'sex',
 'sex_ontology_term_id',
 'suspension_type',
 'tissue',
 'tissue_ontology_term_id',
 'tissue_general',
 'tissue_general_ontology_term_id']

All of these variables are defined in the CELLxGENE dataset schema except for the following:

  • soma_joinid: a SOMA-defined value use for join operations.

  • dataset_id: the dataset id as encoded in census["census-info"]["datasets"].

  • tissue_general and tissue_general_ontology_term_id: the high-level tissue mapping.

Gene metadata

Similarly, we can obtain all gene metadata variables by directly querying the columns of the corresponding SOMADataFrame.

These are the variables you can use for querying the Census in case there are specific genes you are interested in.

[9]:
keys = list(census["census_data"]["homo_sapiens"].ms["RNA"].var.keys())

keys
[9]:
['soma_joinid', 'feature_id', 'feature_name', 'feature_length']

All of these variables are defined in the CELLxGENE dataset schema except for the following:

  • soma_joinid: a SOMA-defined value use for join operations.

  • feature_length: the length in base pairs of the gene.

Understanding Census contents beyond the summary tables

While using the pre-computed tables in census["census_info"] is an easy and quick way to understand the contents of the Census, it falls short if you want to learn more about certain slices of the Census.

For example, you may want to learn more about:

  • What are the cell types available for human liver?

  • What are the total number of cells in all lung datasets stratified by sequencing technology?

  • What is the sex distribution of all cells from brain in mouse?

  • What are the diseases available for T cells?

All of these questions can be answered by directly querying the cell metadata as shown in the examples below.

Example: all cell types available in human

To exemplify the process of accessing and slicing cell metadata for summary stats, let’s start with a trivial example and take a look at all human cell types available in the Census:

[10]:
human_cell_types = (
    census["census_data"]["homo_sapiens"].obs.read(column_names=["cell_type", "is_primary_data"]).concat().to_pandas()
)
human_cell_types
[10]:
cell_type is_primary_data
0 prickle cell True
1 basal cell of epidermis True
2 basal cell of epidermis True
3 prickle cell True
4 basal cell of epidermis True
... ... ...
45501420 T cell True
45501421 T cell True
45501422 T cell True
45501423 neuron True
45501424 T cell True

45501425 rows × 2 columns

The number of rows is the total number of cells for humans. Now, if you wish to get the cell counts per cell type we can perform some pandas operations on this object.

In addition, we will only focus on cells that are marked with is_primary_data=True as this ensures we de-duplicate cells that appear more than once in CELLxGENE Discover.

[11]:
human_cell_types = (
    census["census_data"]["homo_sapiens"]
    .obs.read(column_names=["cell_type"], value_filter="is_primary_data == True")
    .concat()
    .to_pandas()
)

human_cell_types = human_cell_types[["cell_type"]]
human_cell_types.shape
[11]:
(29461044, 1)

This is the number of unique cells. Now let’s look at the counts per cell type:

[12]:
human_cell_type_counts = human_cell_types.value_counts()
human_cell_type_counts
[12]:
cell_type
neuron                                             2681768
glutamatergic neuron                               1546910
CD4-positive, alpha-beta T cell                    1162203
CD8-positive, alpha-beta T cell                    1132155
classical monocyte                                  980618
                                                    ...
microfold cell of epithelium of small intestine         19
mature conventional dendritic cell                      17
serous cell of epithelium of bronchus                   15
sperm                                                   11
type N enteroendocrine cell                             10
Name: count, Length: 574, dtype: int64

This shows you that the most abundant cell types are “glutamatergic neuron”, “CD8-positive, alpha-beta T cell”, and “CD4-positive, alpha-beta T cell”.

Now let’s take a look at the number of unique cell types:

[13]:
human_cell_type_counts.shape
[13]:
(574,)

That is the total number of different cell types for human.

All the information in this example can be quickly obtained from the summary table at census["census-info"]["summary_cell_counts"].

The examples below are more complex and can only be achieved by accessing the cell metadata.

Example: cell types available in human liver

Similar to the example above, we can learn what cell types are available for a specific tissue, e.g. liver.

To achieve this goal we just need to limit our cell metadata to that tissue. We will use the information in the cell metadata variable tissue_general. This variable contains the high-level tissue label for all cells in the Census:

[14]:
human_liver_cell_types = (
    census["census_data"]["homo_sapiens"]
    .obs.read(column_names=["cell_type"], value_filter="is_primary_data == True and tissue_general == 'liver'")
    .concat()
    .to_pandas()
)

human_liver_cell_types["cell_type"].value_counts()
[14]:
cell_type
hepatoblast                          58447
neoplastic cell                      52431
erythroblast                         42328
periportal region hepatocyte         23509
hepatocyte                           19694
                                     ...
group 2 innate lymphoid cell             1
myelocyte                                1
pulmonary artery endothelial cell        1
enteroendocrine cell                     1
type I pneumocyte                        1
Name: count, Length: 123, dtype: int64

These are the cell types and their cell counts in the human liver.

Example: diseased T cells in human tissues

In this example we are going to get the counts for all diseased cells annotated as T cells. For the sake of the example we will focus on “CD8-positive, alpha-beta T cell” and “CD4-positive, alpha-beta T cell”:

[15]:
t_cells_list = ["CD8-positive, alpha-beta T cell", "CD4-positive, alpha-beta T cell"]

t_cells_diseased = (
    census["census_data"]["homo_sapiens"]
    .obs.read(
        column_names=["disease", "tissue_general"],
        value_filter=f"is_primary_data == True and cell_type in {t_cells_list} and disease != 'normal'",
    )
    .concat()
    .to_pandas()
)

t_cells_diseased = t_cells_diseased[["disease", "tissue_general"]].value_counts(sort=False)
t_cells_diseased
[15]:
disease                                tissue_general
B-cell non-Hodgkin lymphoma            blood                  62499
COVID-19                               blood                 819428
                                       lung                   12647
                                       nose                      13
                                       respiratory system         4
                                       saliva                    41
Crohn disease                          colon                  17490
                                       small intestine        52029
Down syndrome                          bone marrow              181
chronic obstructive pulmonary disease  lung                    9382
follicular lymphoma                    lymph node              1089
influenza                              blood                   8871
lung adenocarcinoma                    adrenal gland            205
                                       brain                   3274
                                       liver                    507
                                       lung                  191527
                                       lymph node             24969
                                       pleural fluid          11558
non-small cell lung carcinoma          lung                   36573
respiratory system disorder            blood                  34301
squamous cell lung carcinoma           lung                   47966
                                       lymph node               100
systemic lupus erythematosus           blood                 355471
Name: count, dtype: int64

These are the cell counts annotated with the indicated disease across human tissues for “CD8-positive, alpha-beta T cell” or “CD4-positive, alpha-beta T cell”.

And, don’t forget to close the census!

[16]:
census.close()
del census