OmicVerse 内置数据集与模拟数据

OmicVerse 内置数据集：包括 pbmc3k、pancreas、dentategyrus、zebrafish、immune、spatial、multiome 等，加上 create_mock_dataset() 生成模拟数据以及预定义特征 GMT 基因集。

文件预览

2 个文件

SKILL.md

8.1 KB · 可预览

---
name: datasets-loading
title: OmicVerse built-in datasets and mock data
description: "OmicVerse built-in datasets: pbmc3k, pancreas, dentategyrus, zebrafish, immune, spatial, multiome, plus create_mock_dataset() and predefined_signatures GMT gene sets."
---

# OmicVerse Built-in Datasets

`ov.datasets` provides 30+ ready-to-use datasets with automatic download, caching, and fallback to mock data. Use these instead of manually downloading files or relying on `scanpy.datasets`.

## When to Use This Module

- **Tutorials/demos**: Load standard benchmarks (PBMC3k, Paul15, dentate gyrus) with one function call
- **Testing pipelines**: Use `create_mock_dataset()` to generate synthetic data without downloads
- **Gene set analysis**: Use `predefined_signatures` for curated GMT gene sets (cell cycle, gender, mitochondrial, tissue-specific)
- **Velocity workflows**: Load pre-formatted datasets with spliced/unspliced layers

## Dataset Catalog

### Single-Cell

| Function | Cells | Genes | Description |
|----------|-------|-------|-------------|
| `ov.datasets.pbmc3k()` | 2,700 | 32,738 | 10x PBMC3k (raw or processed) |
| `ov.datasets.pbmc8k()` | ~8,000 | — | 10x PBMC 8k |
| `ov.datasets.paul15()` | 2,730 | 3,451 | Myeloid progenitors |
| `ov.datasets.krumsiek11()` | 640 | 11 | Myeloid differentiation simulation |
| `ov.datasets.bone_marrow()` | 5,780 | 27,876 | Bone marrow hematopoietic |
| `ov.datasets.hematopoiesis()` | — | — | Processed hematopoiesis |
| `ov.datasets.hematopoiesis_raw()` | — | — | Raw hematopoiesis |
| `ov.datasets.sc_ref_Lymph_Node()` | ~10,000 | ~15,000 | Lymph node reference |
| `ov.datasets.bhattacherjee()` | ~5,000 | ~2,000 | Mouse PFC cocaine study |
| `ov.datasets.human_tfs()` | — | — | Human TF list (DataFrame) |

### RNA Velocity & Trajectories

| Function | Cells | Genes | Description |
|----------|-------|-------|-------------|
| `ov.datasets.dentate_gyrus()` | 18,213 | 27,998 | Dentate gyrus (loom) |
| `ov.datasets.dentate_gyrus_scvelo()` | 2,930 | 13,913 | DG subset from scVelo |
| `ov.datasets.zebrafish()` | 4,181 | 16,940 | Zebrafish developmental |
| `ov.datasets.pancreatic_endocrinogenesis()` | — | — | Pancreatic epithelial |
| `ov.datasets.pancreas_cellrank()` | 2,930 | 13,913 | Pancreas cellrank benchmark |
| `ov.datasets.scnt_seq_neuron_splicing()` | 13,476 | 44,021 | scNT-seq neuron splicing |
| `ov.datasets.scnt_seq_neuron_labeling()` | 3,060 | 24,078 | scNT-seq neuron labeling |
| `ov.datasets.sceu_seq_rpe1()` | ~2,930 | ~13,913 | scEU-seq RPE1 |
| `ov.datasets.sceu_seq_organoid()` | 3,831 | 9,157 | scEU-seq organoid |
| `ov.datasets.haber()` | 7,216 | 27,998 | Intestinal epithelium |
| `ov.datasets.chromaffin()` | — | — | Chromaffin cell lineage |
| `ov.datasets.hg_forebrain_glutamatergic()` | 1,720 | 32,738 | Human forebrain |
| `ov.datasets.toggleswitch()` | 200 | 2 | Two-gene simulation |

### Spatial & Multiome

| Function | Description |
|----------|-------------|
| `ov.datasets.seqfish()` | SeqFISH spatial transcriptomics |
| `ov.datasets.multi_brain_5k()` | 10x E18 mouse brain multiome (MuData) |

### Bulk RNA-seq & Deconvolution

| Function | Description |
|----------|-------------|
| `ov.datasets.burczynski06()` | UC/CD PBMC bulk (127 samples) |
| `ov.datasets.moignard15()` | Embryo hematopoiesis qRT-PCR |
| `ov.datasets.decov_bulk_covid_bulk()` | COVID-19 PBMC bulk |
| `ov.datasets.decov_bulk_covid_single()` | COVID-19 PBMC single-cell ref |

### Synthetic

| Function | Description |
|----------|-------------|
| `ov.datasets.create_mock_dataset()` | Configurable synthetic scRNA-seq |
| `ov.datasets.blobs()` | Gaussian blob clusters |

## Mock Data Generation

Use `create_mock_dataset()` when you need data without network access or for pipeline testing:

```python
import omicverse as ov

# Basic mock dataset
adata = ov.datasets.create_mock_dataset(
    n_cells=2000,
    n_genes=1500,
    n_cell_types=6,
    with_clustering=False,
    random_state=42,
)
# adata.obs: cell_type, sample_id, condition, tissue
# adata.var: gene_symbols, highly_variable

# With full preprocessing (normalized, PCA, UMAP, leiden)
adata = ov.datasets.create_mock_dataset(
    n_cells=5000,
    n_genes=3000,
    n_cell_types=10,
    with_clustering=True,
)
```

**Features:**
- Negative binomial expression distribution
- Cell-type-specific marker genes (2-5x expression multiplier)
- Gene names: `Gene_0001`, `Gene_0002`, ...
- `with_clustering=True` adds: normalization, HVG, scaling, PCA, UMAP, leiden

## Predefined Gene Set Signatures

Pre-loaded GMT files for common scoring tasks:

```python
from omicverse.datasets import predefined_signatures, load_signatures_from_file

# Available signature keys
print(list(predefined_signatures.keys()))
# ['cell_cycle_human', 'cell_cycle_mouse', 'gender_human', 'gender_mouse',
#  'mitochondrial_genes_human', 'mitochondrial_genes_mouse',
#  'ribosomal_genes_human', 'ribosomal_genes_mouse',
#  'apoptosis_human', 'apoptosis_mouse',
#  'human_lung', 'mouse_lung', 'mouse_brain', 'mouse_liver', 'emt_human']

# Load a signature → dict[str, list[str]]
cell_cycle = load_signatures_from_file(predefined_signatures['cell_cycle_human'])
# {'S_genes': ['MCM5', 'PCNA', ...], 'G2M_genes': ['HMGB2', 'CDK1', ...]}

# Use with scoring
import scanpy as sc
sc.tl.score_genes_cell_cycle(adata, s_genes=cell_cycle['S_genes'],
                              g2m_genes=cell_cycle['G2M_genes'])
```

## Critical API Reference

```python
# CORRECT: use ov.datasets for standard benchmarks
adata = ov.datasets.pbmc3k()

# WRONG: manually downloading what's already built-in
# import urllib.request
# urllib.request.urlretrieve('https://...', 'pbmc3k.h5ad')  # unnecessary!
# adata = ov.read('pbmc3k.h5ad')

# CORRECT: pbmc3k(processed=True) for pre-processed version
adata = ov.datasets.pbmc3k(processed=True)

# WRONG: loading raw then manually preprocessing for a demo
# adata = ov.datasets.pbmc3k()
# sc.pp.normalize_total(adata)  # unnecessary if you just need a quick demo

# CORRECT: mock data for testing (no network needed)
adata = ov.datasets.create_mock_dataset(n_cells=500, n_genes=200)

# WRONG: creating synthetic data manually with numpy
# X = np.random.poisson(1, (500, 200))  # missing metadata, layers, etc.
```

## Caching Behavior

- **Default cache directory:** `./data/` (relative to working directory)
- **Skip if exists:** All functions check for existing files before downloading
- **Mirror fallback:** Stanford and Figshare mirrors for reliability
- **Mock fallback:** Most functions generate mock data if download fails (network issues)
- **`var_names_make_unique()`** called automatically after loading

## Troubleshooting

- **Download timeout / 403 error**: Some datasets use `download_data_requests()` with custom headers. If persistent, manually download the file to `./data/` with the expected filename and the function will find it.
- **`ModuleNotFoundError: No module named 'muon'`** when calling `multi_brain_5k()`: Install muon: `pip install muon`. This function returns MuData, not AnnData.
- **Mock dataset has no `.raw` or `layers['counts']`**: Add manually after creation: `ov.utils.store_layers(adata, layers='counts')` and `adata.raw = adata`.
- **`load_signatures_from_file` returns empty dict**: Verify the GMT file path. Use `predefined_signatures['key']` which resolves to the bundled file via `importlib.resources`.
- **Dentate gyrus loom download is slow**: The loom file is large (~200MB). Use `ov.datasets.dentate_gyrus_scvelo()` for the smaller pre-processed subset (2,930 cells).

## Dependencies
- Core: `omicverse`, `scanpy`, `anndata`, `numpy`, `pandas`
- Downloads: `tqdm`, `requests` (for mirror fallback)
- Multiome: `muon` (only for `multi_brain_5k()`)
- Signatures: `importlib.resources` (stdlib)

## Examples
- "Load the PBMC3k dataset and run the standard preprocessing pipeline."
- "Create a mock dataset with 5000 cells and 8 cell types for testing my clustering workflow."
- "Load cell cycle gene signatures and score my adata for S and G2M phase genes."

## References
- Quick copy/paste commands: [`reference.md`](reference.md)

SKILL.md

raw

元数据

name	datasets-loading
title	OmicVerse 内置数据集与模拟数据
description	OmicVerse 内置数据集：包括 pbmc3k、pancreas、dentategyrus、zebrafish、immune、spatial、multiome 等，加上 create_mock_dataset() 生成模拟数据以及预定义特征 GMT 基因集。

OmicVerse 内置数据集 ov.datasets 提供 30+ 个即用型数据集，具备自动下载、缓存和模拟数据回退功能。请使用这些函数，避免手动下载文件或依赖 scanpy.datasets。 使用场景 教程/演示：一键加载标准基准数据集（PBMC3k、Paul15、齿状回） 流程测试：使用 create_mock_dataset() 生成无需下载的合成数据 基因集分析：使用 predefined_signatures 获取精选 GMT 基因集（细胞周期、性别、线粒体、组织特异性） 速度分析流程：加载包含已分选/未分选层的预处理数据集 数据集目录 单细胞 函数细胞数基因数描述ov.datasets.pbmc3k()2,70032,73810x PBMC3k（原始或已处理）ov.datasets.pbmc8k()~8,000—10x PBMC 8kov.datasets.paul15()2,7303,451髓系祖细胞ov.datasets.krumsiek11()64011髓系分化模拟ov.datasets.bone_marrow()5,78027,876骨髓造血ov.datasets.hematopoiesis()——已处理造血数据ov.datasets.hematopoiesis_raw()——原始造血数据ov.datasets.sc_ref_Lymph_Node()~10,000~15,000淋巴结参考ov.datasets.bhattacherjee()~5,000~2,000小鼠前额叶皮层可卡因研究ov.datasets.human_tfs()——人类转录因子列表（DataFrame） RNA 速度与轨迹 函数细胞数基因数描述ov.datasets.dentate_gyrus()18,21327,998齿状回（loom）ov.datasets.dentate_gyrus_scvelo()2,93013,913scVelo 齿状回子集ov.datasets.zebrafish()4,18116,940斑马鱼发育ov.datasets.pancreatic_endocrinogenesis()——胰腺上皮ov.datasets.pancreas_cellrank()2,93013,913胰腺 cellrank 基准ov.datasets.scnt_seq_neuron_splicing()13,47644,021scNT-seq 神经元剪接ov.datasets.scnt_seq_neuron_labeling()3,06024,078scNT-seq 神经元标记ov.datasets.sceu_seq_rpe1()~2,930~13,913scEU-seq RPE1ov.datasets.sceu_seq_organoid()3,8319,157scEU-seq 类器官ov.datasets.haber()7,21627,998肠上皮ov.datasets.chromaffin()——嗜铬细胞谱系ov.datasets.hg_forebrain_glutamatergic()1,72032,738人类前脑ov.datasets.toggleswitch()2002双基因模拟 空间与多组学 函数描述ov.datasets.seqfish()SeqFISH 空间转录组学ov.datasets.multi_brain_5k()10x E18 小鼠大脑多组学（MuData） 批量 RNA-seq 与去卷积 函数描述ov.datasets.burczynski06()UC/CD PBMC 批量（127 样本）ov.datasets.moignard15()胚胎造血 qRT-PCRov.datasets.decov_bulk_covid_bulk()COVID-19 PBMC 批量ov.datasets.decov_bulk_covid_single()COVID-19 PBMC 单细胞参考 合成数据 函数描述ov.datasets.create_mock_dataset()可配置的合成 scRNA-seqov.datasets.blobs()高斯团簇 模拟数据生成 当需要离线数据或流程测试时，使用 create_mock_dataset()： pythonCopy codeimport omicverse as ov # 基本模拟数据集 adata = ov.datasets.create_mock_dataset( n_cells=2000, n_genes=1500, n_cell_types=6, with_clustering=False, random_state=42, ) # adata.obs: cell_type, sample_id, condition, tissue # adata.var: gene_symbols, highly_variable # 包含完整预处理（标准化、PCA、UMAP、leiden） adata = ov.datasets.create_mock_dataset( n_cells=5000, n_genes=3000, n_cell_types=10, with_clustering=True, ) 特性： 负二项表达分布 细胞类型特异性标记基因（2-5x 表达乘数） 基因名：Gene_0001、Gene_0002…… with_clustering=True 增加：标准化、HVG、缩放、PCA、UMAP、leiden 预定义基因集特征 预加载的 GMT 文件，用于常见打分任务： pythonCopy codefrom omicverse.datasets import predefined_signatures, load_signatures_from_file # 可用特征键 print(list(predefined_signatures.keys())) # ['cell_cycle_human', 'cell_cycle_mouse', 'gender_human', 'gender_mouse', # 'mitochondrial_genes_human', 'mitochondrial_genes_mouse', # 'ribosomal_genes_human', 'ribosomal_genes_mouse', # 'apoptosis_human', 'apoptosis_mouse', # 'human_lung', 'mouse_lung', 'mouse_brain', 'mouse_liver', 'emt_human'] # 加载特征 → dict[str, list[str]] cell_cycle = load_signatures_from_file(predefined_signatures['cell_cycle_human']) # {'S_genes': ['MCM5', 'PCNA', ...], 'G2M_genes': ['HMGB2', 'CDK1', ...]} # 配合打分使用 import scanpy as sc sc.tl.score_genes_cell_cycle(adata, s_genes=cell_cycle['S_genes'], g2m_genes=cell_cycle['G2M_genes']) 关键 API 参考 pythonCopy code# 正确：使用 ov.datasets 加载标准基准 adata = ov.datasets.pbmc3k() # 错误：手动下载内置数据 # import urllib.request # urllib.request.urlretrieve('https://...', 'pbmc3k.h5ad') # 没必要！ # adata = ov.read('pbmc3k.h5ad') # 正确：使用 pbmc3k(processed=True) 获取预处理版本 adata = ov.datasets.pbmc3k(processed=True) # 错误：加载原始数据后手动预处理用于演示 # adata = ov.datasets.pbmc3k() # sc.pp.normalize_total(adata) # 如果只是快速演示则没必要 # 正确：使用模拟数据进行测试（无需网络） adata = ov.datasets.create_mock_dataset(n_cells=500, n_genes=200) # 错误：用 numpy 手动创建合成数据 # X = np.random.poisson(1, (500, 200)) # 缺少元数据、层等 缓存行为 默认缓存目录： ./data/（相对于工作目录） 已存在则跳过：所有函数在下载前检查已有文件 镜像回退：Stanford 和 Figshare 镜像以确保可靠性 模拟回退：若下载失败（网络问题），多数函数会生成模拟数据 加载后自动调用 var_names_make_unique() 故障排除 下载超时 / 403 错误：部分数据集使用 download_data_requests() 并携带自定义请求头。若问题持续，可手动下载文件到 ./data/ 并保持预期文件名，函数会自动识别。 调用 multi_brain_5k() 时出现 ModuleNotFoundError: No module named 'muon'：安装 muon：pip install muon。该函数返回 MuData 而非 AnnData。 模拟数据集没有 .raw 或 layers['counts']：创建后手动添加：ov.utils.store_layers(adata, layers='counts') 和 adata.raw = adata。 load_signatures_from_file 返回空字典：核实 GMT 文件路径。使用 predefined_signatures['key'] 会通过 importlib.resources 解析到内置文件。 齿状回 loom 下载慢：loom 文件较大（~200MB）。可改用 ov.datasets.dentate_gyrus_scvelo() 获取较小的预处理子集（2,930 个细胞）。 依赖 核心：omicverse、scanpy、anndata、numpy、pandas 下载：tqdm、requests（用于镜像回退） 多组学：muon（仅 multi_brain_5k() 需要） 特征集：importlib.resources（标准库） 示例 "加载 PBMC3k 数据集并运行标准预处理流程。" "创建一个包含 5000 个细胞和 8 种细胞类型的模拟数据集，用于测试聚类流程。" "加载细胞周期基因特征并对我的 adata 进行 S 和 G2M 期基因打分。" 参考 快速复制粘贴命令：reference.md

函数	细胞数	基因数	描述
`ov.datasets.pbmc3k()`	2,700	32,738	10x PBMC3k（原始或已处理）
`ov.datasets.pbmc8k()`	~8,000	—	10x PBMC 8k
`ov.datasets.paul15()`	2,730	3,451	髓系祖细胞
`ov.datasets.krumsiek11()`	640	11	髓系分化模拟
`ov.datasets.bone_marrow()`	5,780	27,876	骨髓造血
`ov.datasets.hematopoiesis()`	—	—	已处理造血数据
`ov.datasets.hematopoiesis_raw()`	—	—	原始造血数据
`ov.datasets.sc_ref_Lymph_Node()`	~10,000	~15,000	淋巴结参考
`ov.datasets.bhattacherjee()`	~5,000	~2,000	小鼠前额叶皮层可卡因研究
`ov.datasets.human_tfs()`	—	—	人类转录因子列表（DataFrame）

函数	细胞数	基因数	描述
`ov.datasets.dentate_gyrus()`	18,213	27,998	齿状回（loom）
`ov.datasets.dentate_gyrus_scvelo()`	2,930	13,913	scVelo 齿状回子集
`ov.datasets.zebrafish()`	4,181	16,940	斑马鱼发育
`ov.datasets.pancreatic_endocrinogenesis()`	—	—	胰腺上皮
`ov.datasets.pancreas_cellrank()`	2,930	13,913	胰腺 cellrank 基准
`ov.datasets.scnt_seq_neuron_splicing()`	13,476	44,021	scNT-seq 神经元剪接
`ov.datasets.scnt_seq_neuron_labeling()`	3,060	24,078	scNT-seq 神经元标记
`ov.datasets.sceu_seq_rpe1()`	~2,930	~13,913	scEU-seq RPE1
`ov.datasets.sceu_seq_organoid()`	3,831	9,157	scEU-seq 类器官
`ov.datasets.haber()`	7,216	27,998	肠上皮
`ov.datasets.chromaffin()`	—	—	嗜铬细胞谱系
`ov.datasets.hg_forebrain_glutamatergic()`	1,720	32,738	人类前脑
`ov.datasets.toggleswitch()`	200	2	双基因模拟

函数	描述
`ov.datasets.seqfish()`	SeqFISH 空间转录组学
`ov.datasets.multi_brain_5k()`	10x E18 小鼠大脑多组学（MuData）

函数	描述
`ov.datasets.burczynski06()`	UC/CD PBMC 批量（127 样本）
`ov.datasets.moignard15()`	胚胎造血 qRT-PCR
`ov.datasets.decov_bulk_covid_bulk()`	COVID-19 PBMC 批量
`ov.datasets.decov_bulk_covid_single()`	COVID-19 PBMC 单细胞参考

函数	描述
`ov.datasets.create_mock_dataset()`	可配置的合成 scRNA-seq
`ov.datasets.blobs()`	高斯团簇