data_analysis
低风险
OmicVerse 内置数据集与模拟数据
OmicVerse 内置数据集:包括 pbmc3k、pancreas、dentategyrus、zebrafish、immune、spatial、multiome 等,加上 create_mock_dataset() 生成模拟数据以及预定义特征 GMT 基因集。
文件预览
2 个文件
SKILL.md
8.1 KB · 可预览
---
name: datasets-loading
title: OmicVerse built-in datasets and mock data
description: "OmicVerse built-in datasets: pbmc3k, pancreas, dentategyrus, zebrafish, immune, spatial, multiome, plus create_mock_dataset() and predefined_signatures GMT gene sets."
---
# OmicVerse Built-in Datasets
`ov.datasets` provides 30+ ready-to-use datasets with automatic download, caching, and fallback to mock data. Use these instead of manually downloading files or relying on `scanpy.datasets`.
## When to Use This Module
- **Tutorials/demos**: Load standard benchmarks (PBMC3k, Paul15, dentate gyrus) with one function call
- **Testing pipelines**: Use `create_mock_dataset()` to generate synthetic data without downloads
- **Gene set analysis**: Use `predefined_signatures` for curated GMT gene sets (cell cycle, gender, mitochondrial, tissue-specific)
- **Velocity workflows**: Load pre-formatted datasets with spliced/unspliced layers
## Dataset Catalog
### Single-Cell
| Function | Cells | Genes | Description |
|----------|-------|-------|-------------|
| `ov.datasets.pbmc3k()` | 2,700 | 32,738 | 10x PBMC3k (raw or processed) |
| `ov.datasets.pbmc8k()` | ~8,000 | — | 10x PBMC 8k |
| `ov.datasets.paul15()` | 2,730 | 3,451 | Myeloid progenitors |
| `ov.datasets.krumsiek11()` | 640 | 11 | Myeloid differentiation simulation |
| `ov.datasets.bone_marrow()` | 5,780 | 27,876 | Bone marrow hematopoietic |
| `ov.datasets.hematopoiesis()` | — | — | Processed hematopoiesis |
| `ov.datasets.hematopoiesis_raw()` | — | — | Raw hematopoiesis |
| `ov.datasets.sc_ref_Lymph_Node()` | ~10,000 | ~15,000 | Lymph node reference |
| `ov.datasets.bhattacherjee()` | ~5,000 | ~2,000 | Mouse PFC cocaine study |
| `ov.datasets.human_tfs()` | — | — | Human TF list (DataFrame) |
### RNA Velocity & Trajectories
| Function | Cells | Genes | Description |
|----------|-------|-------|-------------|
| `ov.datasets.dentate_gyrus()` | 18,213 | 27,998 | Dentate gyrus (loom) |
| `ov.datasets.dentate_gyrus_scvelo()` | 2,930 | 13,913 | DG subset from scVelo |
| `ov.datasets.zebrafish()` | 4,181 | 16,940 | Zebrafish developmental |
| `ov.datasets.pancreatic_endocrinogenesis()` | — | — | Pancreatic epithelial |
| `ov.datasets.pancreas_cellrank()` | 2,930 | 13,913 | Pancreas cellrank benchmark |
| `ov.datasets.scnt_seq_neuron_splicing()` | 13,476 | 44,021 | scNT-seq neuron splicing |
| `ov.datasets.scnt_seq_neuron_labeling()` | 3,060 | 24,078 | scNT-seq neuron labeling |
| `ov.datasets.sceu_seq_rpe1()` | ~2,930 | ~13,913 | scEU-seq RPE1 |
| `ov.datasets.sceu_seq_organoid()` | 3,831 | 9,157 | scEU-seq organoid |
| `ov.datasets.haber()` | 7,216 | 27,998 | Intestinal epithelium |
| `ov.datasets.chromaffin()` | — | — | Chromaffin cell lineage |
| `ov.datasets.hg_forebrain_glutamatergic()` | 1,720 | 32,738 | Human forebrain |
| `ov.datasets.toggleswitch()` | 200 | 2 | Two-gene simulation |
### Spatial & Multiome
| Function | Description |
|----------|-------------|
| `ov.datasets.seqfish()` | SeqFISH spatial transcriptomics |
| `ov.datasets.multi_brain_5k()` | 10x E18 mouse brain multiome (MuData) |
### Bulk RNA-seq & Deconvolution
| Function | Description |
|----------|-------------|
| `ov.datasets.burczynski06()` | UC/CD PBMC bulk (127 samples) |
| `ov.datasets.moignard15()` | Embryo hematopoiesis qRT-PCR |
| `ov.datasets.decov_bulk_covid_bulk()` | COVID-19 PBMC bulk |
| `ov.datasets.decov_bulk_covid_single()` | COVID-19 PBMC single-cell ref |
### Synthetic
| Function | Description |
|----------|-------------|
| `ov.datasets.create_mock_dataset()` | Configurable synthetic scRNA-seq |
| `ov.datasets.blobs()` | Gaussian blob clusters |
## Mock Data Generation
Use `create_mock_dataset()` when you need data without network access or for pipeline testing:
```python
import omicverse as ov
# Basic mock dataset
adata = ov.datasets.create_mock_dataset(
n_cells=2000,
n_genes=1500,
n_cell_types=6,
with_clustering=False,
random_state=42,
)
# adata.obs: cell_type, sample_id, condition, tissue
# adata.var: gene_symbols, highly_variable
# With full preprocessing (normalized, PCA, UMAP, leiden)
adata = ov.datasets.create_mock_dataset(
n_cells=5000,
n_genes=3000,
n_cell_types=10,
with_clustering=True,
)
```
**Features:**
- Negative binomial expression distribution
- Cell-type-specific marker genes (2-5x expression multiplier)
- Gene names: `Gene_0001`, `Gene_0002`, ...
- `with_clustering=True` adds: normalization, HVG, scaling, PCA, UMAP, leiden
## Predefined Gene Set Signatures
Pre-loaded GMT files for common scoring tasks:
```python
from omicverse.datasets import predefined_signatures, load_signatures_from_file
# Available signature keys
print(list(predefined_signatures.keys()))
# ['cell_cycle_human', 'cell_cycle_mouse', 'gender_human', 'gender_mouse',
# 'mitochondrial_genes_human', 'mitochondrial_genes_mouse',
# 'ribosomal_genes_human', 'ribosomal_genes_mouse',
# 'apoptosis_human', 'apoptosis_mouse',
# 'human_lung', 'mouse_lung', 'mouse_brain', 'mouse_liver', 'emt_human']
# Load a signature → dict[str, list[str]]
cell_cycle = load_signatures_from_file(predefined_signatures['cell_cycle_human'])
# {'S_genes': ['MCM5', 'PCNA', ...], 'G2M_genes': ['HMGB2', 'CDK1', ...]}
# Use with scoring
import scanpy as sc
sc.tl.score_genes_cell_cycle(adata, s_genes=cell_cycle['S_genes'],
g2m_genes=cell_cycle['G2M_genes'])
```
## Critical API Reference
```python
# CORRECT: use ov.datasets for standard benchmarks
adata = ov.datasets.pbmc3k()
# WRONG: manually downloading what's already built-in
# import urllib.request
# urllib.request.urlretrieve('https://...', 'pbmc3k.h5ad') # unnecessary!
# adata = ov.read('pbmc3k.h5ad')
# CORRECT: pbmc3k(processed=True) for pre-processed version
adata = ov.datasets.pbmc3k(processed=True)
# WRONG: loading raw then manually preprocessing for a demo
# adata = ov.datasets.pbmc3k()
# sc.pp.normalize_total(adata) # unnecessary if you just need a quick demo
# CORRECT: mock data for testing (no network needed)
adata = ov.datasets.create_mock_dataset(n_cells=500, n_genes=200)
# WRONG: creating synthetic data manually with numpy
# X = np.random.poisson(1, (500, 200)) # missing metadata, layers, etc.
```
## Caching Behavior
- **Default cache directory:** `./data/` (relative to working directory)
- **Skip if exists:** All functions check for existing files before downloading
- **Mirror fallback:** Stanford and Figshare mirrors for reliability
- **Mock fallback:** Most functions generate mock data if download fails (network issues)
- **`var_names_make_unique()`** called automatically after loading
## Troubleshooting
- **Download timeout / 403 error**: Some datasets use `download_data_requests()` with custom headers. If persistent, manually download the file to `./data/` with the expected filename and the function will find it.
- **`ModuleNotFoundError: No module named 'muon'`** when calling `multi_brain_5k()`: Install muon: `pip install muon`. This function returns MuData, not AnnData.
- **Mock dataset has no `.raw` or `layers['counts']`**: Add manually after creation: `ov.utils.store_layers(adata, layers='counts')` and `adata.raw = adata`.
- **`load_signatures_from_file` returns empty dict**: Verify the GMT file path. Use `predefined_signatures['key']` which resolves to the bundled file via `importlib.resources`.
- **Dentate gyrus loom download is slow**: The loom file is large (~200MB). Use `ov.datasets.dentate_gyrus_scvelo()` for the smaller pre-processed subset (2,930 cells).
## Dependencies
- Core: `omicverse`, `scanpy`, `anndata`, `numpy`, `pandas`
- Downloads: `tqdm`, `requests` (for mirror fallback)
- Multiome: `muon` (only for `multi_brain_5k()`)
- Signatures: `importlib.resources` (stdlib)
## Examples
- "Load the PBMC3k dataset and run the standard preprocessing pipeline."
- "Create a mock dataset with 5000 cells and 8 cell types for testing my clustering workflow."
- "Load cell cycle gene signatures and score my adata for S and G2M phase genes."
## References
- Quick copy/paste commands: [`reference.md`](reference.md)
SKILL.md
元数据
| name | datasets-loading |
|---|---|
| title | OmicVerse 内置数据集与模拟数据 |
| description | OmicVerse 内置数据集:包括 pbmc3k、pancreas、dentategyrus、zebrafish、immune、spatial、multiome 等,加上 create_mock_dataset() 生成模拟数据以及预定义特征 GMT 基因集。 |
OmicVerse 内置数据集
ov.datasets 提供 30+ 个即用型数据集,具备自动下载、缓存和模拟数据回退功能。请使用这些函数,避免手动下载文件或依赖 scanpy.datasets。
使用场景
- 教程/演示:一键加载标准基准数据集(PBMC3k、Paul15、齿状回)
- 流程测试:使用
create_mock_dataset()生成无需下载的合成数据 - 基因集分析:使用
predefined_signatures获取精选 GMT 基因集(细胞周期、性别、线粒体、组织特异性) - 速度分析流程:加载包含已分选/未分选层的预处理数据集
数据集目录
单细胞
| 函数 | 细胞数 | 基因数 | 描述 |
|---|---|---|---|
ov.datasets.pbmc3k() | 2,700 | 32,738 | 10x PBMC3k(原始或已处理) |
ov.datasets.pbmc8k() | ~8,000 | — | 10x PBMC 8k |
ov.datasets.paul15() | 2,730 | 3,451 | 髓系祖细胞 |
ov.datasets.krumsiek11() | 640 | 11 | 髓系分化模拟 |
ov.datasets.bone_marrow() | 5,780 | 27,876 | 骨髓造血 |
ov.datasets.hematopoiesis() | — | — | 已处理造血数据 |
ov.datasets.hematopoiesis_raw() | — | — | 原始造血数据 |
ov.datasets.sc_ref_Lymph_Node() | ~10,000 | ~15,000 | 淋巴结参考 |
ov.datasets.bhattacherjee() | ~5,000 | ~2,000 | 小鼠前额叶皮层可卡因研究 |
ov.datasets.human_tfs() | — | — | 人类转录因子列表(DataFrame) |
RNA 速度与轨迹
| 函数 | 细胞数 | 基因数 | 描述 |
|---|---|---|---|
ov.datasets.dentate_gyrus() | 18,213 | 27,998 | 齿状回(loom) |
ov.datasets.dentate_gyrus_scvelo() | 2,930 | 13,913 | scVelo 齿状回子集 |
ov.datasets.zebrafish() | 4,181 | 16,940 | 斑马鱼发育 |
ov.datasets.pancreatic_endocrinogenesis() | — | — | 胰腺上皮 |
ov.datasets.pancreas_cellrank() | 2,930 | 13,913 | 胰腺 cellrank 基准 |
ov.datasets.scnt_seq_neuron_splicing() | 13,476 | 44,021 | scNT-seq 神经元剪接 |
ov.datasets.scnt_seq_neuron_labeling() | 3,060 | 24,078 | scNT-seq 神经元标记 |
ov.datasets.sceu_seq_rpe1() | ~2,930 | ~13,913 | scEU-seq RPE1 |
ov.datasets.sceu_seq_organoid() | 3,831 | 9,157 | scEU-seq 类器官 |
ov.datasets.haber() | 7,216 | 27,998 | 肠上皮 |
ov.datasets.chromaffin() | — | — | 嗜铬细胞谱系 |
ov.datasets.hg_forebrain_glutamatergic() | 1,720 | 32,738 | 人类前脑 |
ov.datasets.toggleswitch() | 200 | 2 | 双基因模拟 |
空间与多组学
| 函数 | 描述 |
|---|---|
ov.datasets.seqfish() | SeqFISH 空间转录组学 |
ov.datasets.multi_brain_5k() | 10x E18 小鼠大脑多组学(MuData) |
批量 RNA-seq 与去卷积
| 函数 | 描述 |
|---|---|
ov.datasets.burczynski06() | UC/CD PBMC 批量(127 样本) |
ov.datasets.moignard15() | 胚胎造血 qRT-PCR |
ov.datasets.decov_bulk_covid_bulk() | COVID-19 PBMC 批量 |
ov.datasets.decov_bulk_covid_single() | COVID-19 PBMC 单细胞参考 |
合成数据
| 函数 | 描述 |
|---|---|
ov.datasets.create_mock_dataset() | 可配置的合成 scRNA-seq |
ov.datasets.blobs() | 高斯团簇 |
模拟数据生成
当需要离线数据或流程测试时,使用 create_mock_dataset():
python
import omicverse as ov
# 基本模拟数据集
adata = ov.datasets.create_mock_dataset(
n_cells=2000,
n_genes=1500,
n_cell_types=6,
with_clustering=False,
random_state=42,
)
# adata.obs: cell_type, sample_id, condition, tissue
# adata.var: gene_symbols, highly_variable
# 包含完整预处理(标准化、PCA、UMAP、leiden)
adata = ov.datasets.create_mock_dataset(
n_cells=5000,
n_genes=3000,
n_cell_types=10,
with_clustering=True,
)特性:
- 负二项表达分布
- 细胞类型特异性标记基因(2-5x 表达乘数)
- 基因名:
Gene_0001、Gene_0002…… with_clustering=True增加:标准化、HVG、缩放、PCA、UMAP、leiden
预定义基因集特征
预加载的 GMT 文件,用于常见打分任务:
python
from omicverse.datasets import predefined_signatures, load_signatures_from_file
# 可用特征键
print(list(predefined_signatures.keys()))
# ['cell_cycle_human', 'cell_cycle_mouse', 'gender_human', 'gender_mouse',
# 'mitochondrial_genes_human', 'mitochondrial_genes_mouse',
# 'ribosomal_genes_human', 'ribosomal_genes_mouse',
# 'apoptosis_human', 'apoptosis_mouse',
# 'human_lung', 'mouse_lung', 'mouse_brain', 'mouse_liver', 'emt_human']
# 加载特征 → dict[str, list[str]]
cell_cycle = load_signatures_from_file(predefined_signatures['cell_cycle_human'])
# {'S_genes': ['MCM5', 'PCNA', ...], 'G2M_genes': ['HMGB2', 'CDK1', ...]}
# 配合打分使用
import scanpy as sc
sc.tl.score_genes_cell_cycle(adata, s_genes=cell_cycle['S_genes'],
g2m_genes=cell_cycle['G2M_genes'])关键 API 参考
python
# 正确:使用 ov.datasets 加载标准基准
adata = ov.datasets.pbmc3k()
# 错误:手动下载内置数据
# import urllib.request
# urllib.request.urlretrieve('https://...', 'pbmc3k.h5ad') # 没必要!
# adata = ov.read('pbmc3k.h5ad')
# 正确:使用 pbmc3k(processed=True) 获取预处理版本
adata = ov.datasets.pbmc3k(processed=True)
# 错误:加载原始数据后手动预处理用于演示
# adata = ov.datasets.pbmc3k()
# sc.pp.normalize_total(adata) # 如果只是快速演示则没必要
# 正确:使用模拟数据进行测试(无需网络)
adata = ov.datasets.create_mock_dataset(n_cells=500, n_genes=200)
# 错误:用 numpy 手动创建合成数据
# X = np.random.poisson(1, (500, 200)) # 缺少元数据、层等缓存行为
- 默认缓存目录:
./data/(相对于工作目录) - 已存在则跳过:所有函数在下载前检查已有文件
- 镜像回退:Stanford 和 Figshare 镜像以确保可靠性
- 模拟回退:若下载失败(网络问题),多数函数会生成模拟数据
- 加载后自动调用
var_names_make_unique()
故障排除
- 下载超时 / 403 错误:部分数据集使用
download_data_requests()并携带自定义请求头。若问题持续,可手动下载文件到./data/并保持预期文件名,函数会自动识别。 - 调用
multi_brain_5k()时出现ModuleNotFoundError: No module named 'muon':安装 muon:pip install muon。该函数返回 MuData 而非 AnnData。 - 模拟数据集没有
.raw或layers['counts']:创建后手动添加:ov.utils.store_layers(adata, layers='counts')和adata.raw = adata。 load_signatures_from_file返回空字典:核实 GMT 文件路径。使用predefined_signatures['key']会通过importlib.resources解析到内置文件。- 齿状回 loom 下载慢:loom 文件较大(~200MB)。可改用
ov.datasets.dentate_gyrus_scvelo()获取较小的预处理子集(2,930 个细胞)。
依赖
- 核心:
omicverse、scanpy、anndata、numpy、pandas - 下载:
tqdm、requests(用于镜像回退) - 多组学:
muon(仅multi_brain_5k()需要) - 特征集:
importlib.resources(标准库)
示例
- "加载 PBMC3k 数据集并运行标准预处理流程。"
- "创建一个包含 5000 个细胞和 8 种细胞类型的模拟数据集,用于测试聚类流程。"
- "加载细胞周期基因特征并对我的 adata 进行 S 和 G2M 期基因打分。"
参考
- 快速复制粘贴命令:
reference.md