Frequently Asked Questions for CZ CELLxGENE Discover Census¶
Last updated: Apr, 2023.
Why does the Census not have a normalized layer or embeddings?
Can I retrieve the original H5AD datasets from which the Census was built?
Why should I use the Census?¶
The Census provides efficient low-latency access via Python and R APIs to most single-cell RNA data from CZ CELLxGENE Discover.
To accelerate your computational research, you should use the Census if you want to:
Easily get slices of data from more than 400 single-cell datasets spanning about 50 M cells from >60 K genes from human or mouse.
Get these data with standardized and harmonized cell and gene metadata.
Easily load multi-dataset slices into Scanpy or Seurat.
Implement out-of-core (a.k.a online) operations for larger-than-memory processes.
For example you could easily get “all T-cells from Lung with COVID-19” into an AnnData, Seurat, or into memory-sufficient data chunks via PyArrow or R Arrow.
You should not use the Census if you want to:
Access non-standardized cell metadata and gene metadata available in the original datasets.
Access the author-contributed normalized expression values or embeddings.
Access all data from a single dataset.
Access non-RNA or spatial data present in CZ CELLxGENE Discover as it is not yet supported in the Census.
For all of these cases you should perform web downloads from the CZ CELLxGENE Discover site, you can find instructions to do so here.
What data is contained in the Census?¶
Most RNA non-spatial data from CZ CELLxGENE Discover is included. You can see a general description of these data and their organization in the schema description or you can use the APIs to explore the data as indicated in this tutorial.
How do I cite the use of the Census for a publication?¶
Please follow the citation guidelines offered by CZ CELLxGENE Discover.
Why does the Census not have a normalized layer or embeddings?¶
The Census does not have normalized counts or embeddings because:
The original normalized values and embeddings are not harmonized or integrated across datasets and are therefore numerically incompatible.
We have not implemented a general-purpose normalization or embedding generation method to be used across all Census data.
If you have any suggestions for methods that our team should explore please share them with us via a feature request in the github repository.
How does the Census differentiate from other services?¶
The Census differentiates from existing single-cell services by providing access to the largest corpus of standardized single-cell data via TileDB-SOMA.
Thus, single-cell data from about 50 M cells across >60 K genes, with 11 standardized cell metadata variables and harmonized GENCODE annotations is at your finger tips to:
Open and read data at low latency from the cloud.
Query and access data using metadata filters.
Load and create AnnData objects.
Load and create Seurat objects.
From Python create PyArrow objects, SciPy sparse matrices, NumPy arrays, and Pandas data frames.
From R create R Arrow objects, sparse matrices (via the Matrix package), and standard data frames and (dense) matrices.
Can I query human and mouse data in a single query?¶
It is not possible to query both mouse and human data in a single query. This is due to the data from these organisms using different organism-specific gene annotations.
Where are the Census data hosted?¶
The Census data is publicly hosted free-of-cost in an Amazon Web Services (AWS) S3 bucket in the us-west-2 region.
Can I retrieve the original H5AD datasets from which the Census was built?¶
Yes, you can use the API function download_source_h5ad to do so. For usage see the reference documentation at the doc-site or directly from Python or R:
Python
import cellxgene_census
help(cellxgene_census.download_source_h5ad)
R
library(cellxgene.census)
?download_source_h5ad
How can I increase the performance of my queries?¶
Since the access patterns are via the internet, usually the main limiting step for data queries is bandwidth and client location.
We recommend the following to increase query efficency:
Utilize a computer connected to high-speed internet.
Utilize an ethernet connection and not a wifi connection.
If possible utilize online computing located in the west coast of the US.
Highly recommended: EC2 AWS instances in the
us-west-2region.
Can I use conda to install the Census Python API?¶
There is not a conda package available for cellxgene-census. However you can use conda in combination with pip to install the package in a conda environment:
conda create -n census_env python=3.10
conda activate census_env
pip install cellxgene-census
How can I ask for support?¶
You can either submit a github issue or post in the slack channel #cellxgene-census-users at the CZI Slack community.
How can I ask for new features?¶
You can submit a feature request in the github repository.
How can I contribute my data to the Census?¶
To inquire about submitting your data to CZ CELLxGENE Discover you need to follow these instructions.
If you data request is accepted, upon submission the data will automatically get included in the Census if it meets the biological criteria defined in the Census schema.
Why do I get an ArraySchema error when opening the Census?¶
You may get this error if you are trying to open a Census data build with an old version of the Census API. Please update your Python or R Census package.
If the error persists please file a github issue.