Accurate classification of protein subcellular localization from high throughput microscopy images using deep learning

Abstract

High throughput microscopy of many single cells generates high-dimensional data that are far from straightforward to analyze. One important problem is automatically detecting the cellular compartment where a fluorescently tagged protein resides, a task relatively simple for an experienced human, but difficult to automate on a computer. Here, we train an 11-layer neural network on data from mapping thousands of yeast proteins, achieving per cell localization classification accuracy of 91%, and per protein accuracy of 99% on held out images. We confirm that low-level network features correspond to basic image characteristics, while deeper layers separate localization classes. Using this network as a feature calculator, we train standard classifiers that assign proteins to previously unseen compartments after observing only a small number of training examples. Our results are the most accurate subcellular localization classifications to date, and demonstrate the usefulness of deep learning for high throughput microscopy.

A deep neural network for protein subcellular classification.

Data

We constructed a large-scale labeled dataset based on high-throughput proteomescale microscopy images from Chong et al. Each image has two channels: a red fluorescent protein (mCherry) with cytosolic localization, thus marking the cell contour, and green fluorescent protein (GFP), tagging an endogenous gene in the 3’ end, and characterizing the abundance and localization of the protein. The data are split into 65,000 examples for training, 12,500 for validation and 12,500 for testing.

Reference: Chong, Y.T., Koh, J.L., Friesen, H., Duffy, S.K., Cox, M.J., Moses, A., Moffat, J., Boone, C. and Andrews, B.J., 2015. Yeast proteome dynamics from single cell imaging and automated analysis. Cell, 161(6), pp.1413-1424. PubMed

Download main dataset, and transfer learning dataset.

Explore class counts, splits into training/test/validation, and class examples for main dataset, and transfer learning dataset (files S1,S6).

Code

We trained a deep convolutional neural network that has 11 layers (8 convolutional and 3 fully connected) with learnable weights.

A deep neural network for protein subcellular classification.

Download the pretrained Caffe model, and code to retrain the model.

Additional information

We explored many features of the data, presented in both the main text of the paper, main figures, as well as additional analyses.

Classification performance overview for DeepYeast and random forest (files S2, S3).
Bootstrap confidence intervals (table S2).
Examples of misclassified cells for DeepYeast, and for random forest (files S4,S5).
Figures for dimensionality reduction using outputs from various layers, and t-SNE or PCA.
Images that maximally and minimally excite each of the neurons in each of the layers (also available as a .zip file).

The code for generating reports, as well as a subset of figures is also available. Some path names require updating on a local machine, please contact us if you have problems doing so.

All the code is available under the MIT license.

Accurate classification of protein subcellular localization from high throughput microscopy images using deep learning

Tanel Pärnamaa, Leopold Parts

Abstract

Data

Code

Additional information