The Cell Types Database is one of the basis data products produced by The Allen Institute. They are constructing an altas of type of cells found in brains of mice and humans. There are multiple ways the cells are represented in the database: electrophysiology spike train recordings, simulation models (GLIF or perisomatic), etc. Of particular interest for this project is the morphology data – the skeletons in the *.swc
files.
The Allen has about 500 SWC files for mouse neurons. Those ~500 are inside the red circle in the following Venn diagram of all mouse neurons in the Cell Types DB.
The main problem from The Allen's perspective is that they would like to have the red circle be a big as the main outer circle. It takes many hours to manually trace skeletons. Yet this would seem like the sort of task that CNNs and friends could automate. This is proving to be nontrivial. The Allen processes hundreds of such cells a year. This is a serious manual labor bottleneck.
Model training data
From a model training perspective, the skeleton in an SWC file can be seen as the "labels" for "the labeled training data." For training purposes, we're only interested in the subset of cells in the atlas Cell Types Database that have skeletons and a microscopy image stack. The image stack is the input the machine to be built, and the SWC file is the output. Each SWC files represents many hours of manual labor by trained specialists reviewing and editing the SWC file.
!pip install --quiet allensdk
# Query the Cell Types DB for files with skeletons a.k.a. reconstructions
# via https://allensdk.readthedocs.io/en/latest/cell_types.html#cell-types-cache
from allensdk.core.cell_types_cache import CellTypesCache
ctc = CellTypesCache(manifest_file='cell_types/manifest.json')
# a list of cell metadata for cells with reconstructions, download if necessary
cells = ctc.get_cells(require_reconstruction=True)
print('Number of cells with SWC files: %i' % len(cells))
Some of those are human cells, atop the roughly 500 mouse cells. Humans brains are much bigger than mouse brains. Training should focus on one species. The Allen has many more mouse neurons than human neurons. So, train on mouse neurons only.
from allensdk.api.queries.cell_types_api import CellTypesApi
# We want mouse cells that have images and skeletons, both.
# Former is data; latter is training labels a.k.a. gold standards.
cells = ctc.get_cells(require_reconstruction=True, require_morphology=True, species=[CellTypesApi.MOUSE])
print('Number of mouse cells with images and SWC files: %i' % len(cells))
So, The Allen's Cell Types Database can be used as a training dataset consisting of about 500 samples.