Accessing and analyzing protein turnover data on Synapse and ProteomeXchange


Overview

This documentation provides step-by-step instructions on how to retrieve and analyze the protein turnover dataset on ProteomeXchange (PXD002870) and Synapse (syn2289125). The dataset contains the expression and turnover information of over 3,000 proteins from normal and hypertrophic hearts from six common genetic strains of mice. We describe four potential use cases in an accompanying data descriptor manuscript. This document provides instructions for retrieving the datasets from the hosting repositories and getting started on data re-analysis.

Use Case 1: Retrieving the turnover rate of a specific protein from Synapse

Step 1

Access the Synapse project for the dataset at (http://dx.doi.org/doi:10.7303/syn2289125). Navigate to the file tab near the top of the page on the Synapse project to browse the directory structure for the dataset files. Click on the tidy folder (Fig. 1.1).

Fig.1.1 - The tab interface on the Synapse project

Step 2

The tidy folder contains the most cleaned-up versions of the data files that are ready for re-use. Click on the all_protein_k.txt file from inside the tidy folder to access the file on the project (Fig. 1.2a).

Fig.1.2a - Content of the tidy folder on the Synapse project.

On the file’s page, click on the file name next to the file icon to download the file (Fig. 1.2b)

Fig.1.2b - Link to download the all_protein_k.txt file.

Step 3

This all_protein_k.txt file contains the summarized and filtered protein turnover rates with the following five columns:

uniprot: UniProt ID of the protein.
strain: Mouse strain in which the protein in column 1 is quantified (aj: A/J; balbc: BALB/cJ; c57: C57BL/6J; cej: CE/J; dba: DBA/2J; fvb: FVB/NJ)
group: Condition in which the protein is quantified (ctrl: Normal heart; iso: Hypertrophic heart)
med.k: Turnover rate of the protein (d-1), expressed as the median of the optimized turnover rates for all constituent peptides that pass the stringency filter (R2 >= 0.8, SE <= 0.05)
mad.k: Variance of protein turnover rate of the protein, expressed as the median absolute deviation of the optimized turnover rates for all constituent peptides that pass the stringency filter (R2 >= 0.8, SE <= 0.05)
Fig.1.3 - How the header of the all_protein_k.txt file looks when open.

Step 4

Using a software program or programming language of your choice, you can sort the tables by each column or filter the turnover rate of a particular protein in a particular strain/condition. For the following examples, we will look for the turnover rate of the protein NADH dehydrogenase (ubiquinone) complex I, assembly factor 6 (Ndufaf5) in the normal heart of A/J mice using Microsoft Excel or R/RStudio (https://www.rstudio.org).

Protein identities are recorded by their Uniprot ID in the file. The Uniprot IDs for proteins can be found on the Uniprot website (http://uniprot.org).

  • Using Microsoft Excel:

    Open the text file. Go to the data tab on the Ribbon menu > Sort & Filter > Filter (Fig.1.4a). This will create a dropdown menu next to each column header.

    Fig.1.4a - Filtering the data table in Microsoft Excel

    Click on the dropdown button next to “uniprot” and type in A2AIL4. The spreadsheet now shows only results from Ndufaf5 (Fig.1.4b)

    Fig.1.4b - Filtered data
  • Using R and RStudio

    Set the working directory to the directory where the all_protein_k.txt file is located using the setwd() command, e.g., setwd(“~/downloads”). Then install and load the dplyr package, and use the following dplyr commands to read the file and filter columns.

    # Open the data file
    data <- read.table(“all_protein_k.txt”, header = 1, fill = 1)

    # Load the dplyr library
    install.packages(“dplyr”)
    library(“dplyr”)

    # Filter by the uniprot, strain, and group columns using specific values
    dplyr::filter(data, uniprot ==“A2AIL4”, strain ==“aj”, group ==“ctrl”) %>%

    # Selecting only the med.k column
    dplyr::select(med.k)

    This will return the median turnover rate Ndufaf5: 0.066 per day (Fig.1.4c) in the interactive console. Altering the query parameters, e.g., uniprot == “A2AGT5” or group == “iso” will retrieve the corresponding turnover rates for a particular protein (uniprot) in a particular genetic background (strain) under a particular condition (group).

    Fig.1.4c - Inputting the command above into the RStudio console and getting the query result

Use Case 2: Analyzing the turnover rate of a protein pathway

The instruction assumes that you are already familiar with the steps required to look up the turnover rates of individual proteins above. We will use Golgi protein as an example:

Step 1

This analysis requires linking the data file to an external annotation. There are multiple methods to achieve this objective. One option is to download all Uniprot proteins that are annotated for the Golgi gene ontology term, then filter the data file with Uniprot entries that belong to the Golgi list using similar methods as described in Use Case 1. Alternatively, a web service can be used to query Biomart to get all annotations from the data set, then filter out only those that say Golgi. Yet alternatively, the data table can be exported to external services such as Reactome which performs functional analysis.

Step 2

This particular example uses R/RStudio and the biomaRt package on BioConductor to retrieve the turnover rates of all Golgi proteins in the dataset.

# Open the data file
data <- read.table(“all_protein_k.txt”, header = 1, fill = 1)

# Load the dplyr and biomaRt packages.
source("https://bioconductor.org/biocLite.R")
biocLite("biomaRt")
library(“dplyr”)

# Getting the list of proteins from the data file
protein_list <- data %>% dplyr::select(uniprot) %>% distinct()

# Specify the Biomart to query
ensembl = useMart("ENSEMBL_MART_ENSEMBL", dataset="mmusculus_gene_ensembl",host = "www.ensembl.org")

# Getting all GO terms associated with the protein list
protein_go <- getBM(attributes=c("uniprot_swissprot","external_gene_name","go_id","name_1006","namespace_1003"), filters = "uniprot_swissprot", values=protein_list, mart= ensembl)

This will use the R BioMart interface to retrieve all Gene Ontology Terms associated with the protein list.

This object can be inspected in the RStudio viewer (Fig.2.2a):

# Viewing the list of all GO terms associated with the protein list
protein_go %>% View()

Fig.2.2a - Viewing retrieved Gene Ontology terms

The RStudio viewer has a built-in search and filter box on the upper right (text box with magnifying glass icon), which can be used to locate terms of interest, e.g., typing in “Golgi apparatus” filters the object to display only the rows of data associated with the Golgi (Fig.2.2b). This also reveals the GO ID associated with “Cellular Compartment: Golgi apparatus” (GO:0005794) which can be used to retrieve only the proteins (Uniprot ID) that are linked to the term.

Fig.2.2b - Filtering the list of retrieved GO terms with the search term “Golgi apparatus”

# Find the uniprot IDs associated with the Golgi GO ID
golgi_protein <- protein_go %>% filter(go_id == "GO:0005794")

# Filter the turnover data to retrieve only the Golgi proteins
golgi_turnover <- data %>% filter(uniprot %in% golgi_protein$uniprot_swissprot)

The Golgi-only data object can now be inspected and summarized as described above and in Use Case 1. An example is given below on getting the five-number summary of the turnover rates associated with Golgi proteins in all six mouse strains.

# Getting the five-number summary of Golgi proteins
golgi_turnover <- data %>% filter(uniprot %in% golgi_protein$uniprot_swissprot)

Step 3 (Optional)

The above steps in Use Case 1 and Use Case 2 make use of only filtered results following the criteria as described in the manuscript (Sci Data 2016). To retrieve turnover rates using a specific set of filtering criteria, e.g., by only accepting peptides whose goodness-of-fit in the model are 0.95 or above in the normal AJ heart, follow the example below:

Go to the tidy folder and download all_hl.txt. This file contains all the ProTurn output columns, including the following:

ID: Internal ID for referring to corresponding data in hl.out for a particular sample
UniProt: UniProt ID of the protein the peptide was assigned to in database search
Peptide: Sequence (with modifications)
DP: Number of data points - corresponding to the number of time points in most runs
z: charge state
mi: Isotopomer index - 0 in most runs
SS: Residual sum of squares of fitting
a: Initial isotopomer fractional abundance prior to labeling based on peptide sequence
pss: Experimental steady-state relative label enrichment level
kp: Experimental rate constant of label enrichment (1/d)
N: Number of accessible labeling sites based on sequence
k: Fitted rate constant of peptide turnover (1/d)
dk: Fitting error
R2: Goodness-of-fit
sample: Sample identifier (concatenation of strain and treatment)
strain: Mouse strain in which the peptide is quantified (aj: A/J; balbc: BALB/cJ; c57: C57BL/6J; cej: CE/J; dba: DBA/2J; fvb: FVB/NJ)
treatment: Sample group in which the peptide is quantified (ctrl: Normal hearts; iso: Hypertrophic hearts)

Example code to re-filter the data using R and RStudio:

Set the working directory to the directory where the all_k.txt file is located using the setwd() command, e.g., setwd(“~/downloads”). Then install and load the dplyr package as in Use Case 1, and use the following dplyr commands to read the file and filter columns.

# Opening the data file
data <- read.table(“all_k.txt”, header = 1, fill = 1)

# Filter the data by the R2 column, taking only values 0.95 or greater
filtered <- dplyr::filter(R2 >= 0.95) %>%

# Group data frame by protein, strain, and treatment variables
dplyr::group_by(Uniprot, strain, treatment) %>%

# Create a summary variable that is the median of k values within each group
dplyr::summarize(med.k = median(k))

Step 4

Any analysis results may be uploaded back to Synapse using the Synapse R or Python client. Refer to the latest Synapse documentation on how to upload files to the project: (https://www.synapse.org/#!Help:GettingStarted).

Use Case 3: Re-analyzing mass spectrometry raw files

The instructions below assumes that you are familiar with how to download the data files from Synapse (above) and also have basic knowledge on proteomics identification workflows. Refer to documentations from common search engines for step-by-step instructions on how to perform protein identification tasks from raw mass spectrometry files.

Step 1

Access the ProteomeXchange/PRIDE page for the dataset Protein Turnover Rates in Normal and Hypertrophy Mouse Hearts at http://www.ebi.ac.uk/pride/archive/projects/PXD002870. To view a list of all files in your browser, access http://ftp.pride.ebi.ac.uk/pride/data/archive/2015/09/PXD002870/.

To download a list of all contents, enter the command:
curl ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2015/09/PXD002870/ > pxd002870.txt

To download a particular file, enter the command:
wget ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2015/09/PXD002870/[[filename]]

To download all files in the dataset, enter the command:
wget -r -nH --cut-dirs=5 --no-parent ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2015/09/PXD002870/

Step 2

The raw mass spectrometry files can be re-searched using the database, search engine, and parameters of your choice.

An example suite that may be used would be MaxQuant, the documentations of which can be accessed on http://maxquant.org. Other suites include Mascot (http://matrixscience.com) and ProLuCID (http://integratedproteomics.com)

The MS files were acquired in FT/IT mode with 60,000 resolution on an Orbitrap Elite instrument. The appropriate mass tolerance for searching the dataset should be between 10 to 50 ppm for MS1, and approximately 600 ppm for MS2. Individual strains, experimental groups, and time points will need to be searched independently because they will provide independent data points and isotope enrichment evidence in a ProTurn analysis. Subcellular fractions and liquid chromatography elution fractions within a sample and time point may be searched together.

Step 3

To analyze the turnover rates of newly identified peptides, supply the raw mass spectrometry files as well as the new database search results to ProTurn. ProTurn (v.2.0.5) currently supports the following database search result formats: Mascot, SEQUEST/ProLuCID, MaxQuant, and ProteomeDiscoverer. For detailed instructions please refer to the ProTurn documentation here.

Note that ProTurn requires the raw mass spectrometry files to be first converted from Thermo binary (.raw) format into the community-standard .mzml format. This may be done using a variety of tools including ProteoWizard, the documentations for which may be found on http://proteowizard.sourceforge.net.

Use Case 4: Develop new methods for turnover data analysis

Step 1

Access the raw MS files and ProTurn following the instructions in Use Case 3.

The ID column in hl-data.out corresponds to the ID column in hl.out. The A0 column in hl-data.out gives the relative abundance of the zeroth mass-isotopomer at the time point (in days since the start of the experiment) indicated in the t column, for the particular peptide indicated in the ID column. These integration results may be used directly for other curve-fitting models and algorithms.

For information on the kinetic model used to calculate turnover rates in ProTurn, see the Supplemental Data of our previous publication: Protein kinetic signatures of the remodeling heart following isoproterenol stimulation.

Refer to the latest ProTurn documentation for further details on inputs and outputs: (http://heartproteome.org/proturn)

Getting Help

Additional descriptions of the dataset can be found on our publication (Sci Data 2016). Please feel free to contact us with additional questions:


Edward Lau
edwardlau@ucla.edu