Mapping the Proteoform Landscape of Five Human Tissues

A functional understanding of the human body requires structure–function studies of proteins at scale. The chemical structure of proteins is controlled at the transcriptional, translational, and post-translational levels, creating a variety of products with modulated functions within the cell. The term “proteoform” encapsulates this complexity at the level of chemical composition. Comprehensive mapping of the proteoform landscape in human tissues necessitates analytical techniques with increased sensitivity and depth of coverage. Here, we took a top-down proteomics approach, combining data generated using capillary zone electrophoresis (CZE) and nanoflow reversed-phase liquid chromatography (RPLC) hyphenated to mass spectrometry to identify and characterize proteoforms from the human lungs, heart, spleen, small intestine, and kidneys. CZE and RPLC provided complementary post-translational modification and proteoform selectivity, thereby enhancing the overall proteome coverage when used in combination. Of the 11,466 proteoforms identified in this study, 7373 (64%) were not reported previously. Large differences in the protein and proteoform level were readily quantified, with initial inferences about proteoform biology operative in the analyzed organs. Differential proteoform regulation of defensins, glutathione transferases, and sarcomeric proteins across tissues generate hypotheses about how they function and are regulated in human health and disease.


Introduction
Mapping the human body is critical to improving our understanding by setting definitive reference points for organs, tissues, and cells of diverse types. In proteomics, a complete understanding of proteoform 1 diversity requires measurements that systematically capture proteinlevel complexity. In projects like the Human Biomolecular Atlas Program (HuBMAP) 2 and Human Cell Atlas, 3 the resolution of mapping can handle single cells in tissues, with several highly multiplexed methods enabled by antibody-based affinity reagents: CODEX, 4 Immuno-SABER, 5 CyTOF, 6 and MIBI, 7, 8 among others. These methods measure the expression of particular epitopes on proteins, though they still fail to capture the full complexity of the proteoforms present.
Proteoform-level measurements are more specific for a particular biological state compared to measurements on the gene or even protein level. 9, 10 While our long-term goal is to develop new technologies that deliver spatial proteoform analysis and build a comprehensive atlas of human proteoforms, 11 our goal here is to identify proteoforms present in primary human tissue and provide an initial assessment of their PTMs across tissue types.
Top-down proteomics (TDP), where intact proteins are isolated and fragmented by mass spectrometry (MS), is well suited for the identification and characterization of tissue-specific proteoforms. For the analysis of complex proteome samples, upfront separation and/or fractionation represents a crucial part in TDP workflows to reduce complexity prior to MS.
Reversed-phase liquid chromatography (RPLC) is traditionally employed as the method of choice in TDP, i.a. due to its reproducibility, separation capacity, and MS compability, though capillary zone electrophoresis (CZE) represents an alternative for online MS. In particular, the separation principle of CZE is based on differences in electrophoretic mobilities (charge-to-size ratio) and is considered largely "orthogonal" to RPLC, where separation is driven by the hydrophobicity of analyte molecules. For this reason, the combination of information generated by both techniques is anticipated to increase the number of identified proteins and proteoforms.
Here, we report results from two workflows for mapping the proteoform landscape of solid tissues and present the first iteration with five commonly studied human tissues (heart, lung, kidney, small intestines, and spleen). Initially, the extracted proteoforms were pre-fractionated using Gel-Eluted Liquid Fraction Entrapment Electrophoresis (GELFrEE), 12 followed by subsequent CZE-MS and nano RPLC-MS analysis. This study contributes 7,373 proteoforms to the Human Proteoform Atlas (HPfA) a FAIR 13 knowledgebase that now contains approximately 60,000 unique proteoforms linked to their biological context. 14

Reagents
All reagents were purchased from Thermo Fisher Scientific at the highest available purity unless otherwise specified.

Tissue Lysate Preparation
Fresh-frozen tissue samples of human heart, lung, small intestine, and spleen were obtained from HuBMAP Tissue Mapping Centers (Table S1). Tissue samples were collected under IRB approved protocols at each institution. Kidney samples were received as 10 µm microtome scrolls embedded in methylcellulose (each ~5 mg). All other tissue types were cut into small pieces (~5 mm) by specimen preparer at Mapping Centers. Kidney scrolls were cryopulverized in 2 mL Eppendorf Protein Lo-Bind tubes containing a 5-mm stainless steel ball (Qiagen, cat. no. 69989) with a Cryomill (Retsch, cat. no. 20.749.001) equipped with a tube adaptor. Non-kidney tissue specimen (50-100 mg) were cryopulverized with the cryomill equipped with a 25 mL grinding jar containing a 1-inch stainless steel ball. Three cycles of precooling with liquid nitrogen at 1 Hz for 3 min and grinding at 30 Hz for 1 min were performed. Pulverized tissue was transferred to a 15 mL conical tube and resuspended in 2 mL cold RIPA lysis buffer (50 mM Tris, 150 mM NaCl, 1% NP-40 (v/v), 0.5% sodium deoxycholate (w/v), 0.1% sodium dodecyl sulfate (w/v), pH 7.4, 1X Halt Protease and Phosphotase Inhibitor Cocktail (Thermo Scientific)). The suspension was further disrupted by sonication on ice (40% power, cycle 2 s on, 3 s off, for 30 s total) with a probe sonicator (FisherBrand Model 120 with 1/8 inch probe) and then clarified by centrifugation (3234 × g, 30 min, 4 °C).

Sample Prefractionation and Preparation for Mass Spectrometry
Kidney lysates were studied with a 5x4x1x2 design: five biospecimen from separate donors were GELFrEE-fractionated into four fractions, analyzed by RPLC-MS/MS, and injected in duplicate.
Lung lysates were studied in a 3x6x1x3 design: three samples from a single donor, six fractions, only RPLC, and three injections. Heart lysates were studied in a 2x6x2x3 design: two donors, six fractions, both CZE and RPLC, and three injections. Small intestine and spleen were studied in a 1x6x2x3 design: one sample, six fractions, both CZE and RPLC, and three injections. Lysates were fractionated and prepared for mass spectrometry as described previously. 15 Briefly, lysates were precipitated by adding four volumes of cold acetone and incubating at -80 °C for 1 hour. The precipitate was collected by centrifugation (20,000 × g, 30 min, 4 °C), and proteins were resolubilized in 1% sodium dodecyl sulfate (w/v). Total protein content was determined by BCA assay (Thermo Scientific). Samples were fractioned with the GELFrEE 8100 Fractionation Station (Expedeon). Protein samples (300 µg in 150 µL) were combined with 30 µL GELFrEE running buffer, and 8 µL 1 M DTT. The samples were incubated at 95 °C for 5 minutes, cooled to room temperature, and separated with a 10% GELFrEE cartridge following manufacturer's protocol. Six (four in the case of kidney samples) 150 µL fractions were collected and stored at -80 °C until immediately prior to analysis. On the day of analysis, fractions were thawed on ice and precipitated with methanol-chloroform-water as described. 16 Table S2. Overnight, the capillary was rinsed alternating between high flow (100 psi, 2 min)and low flow (10 psi, 120 min) steps with water. For long-term storage, both separation and conductive lines were rinsed (100 psi) with water for 5 min, respectively, and the cartridge was stored at 4 ºC.
Following a valve switch, proteins are separated on the analytical column according to the following gradient: 5% B at 10 min, 15% B at 13 min, 45% B at 70 min, 95% B at 72 min, 95% B at 76 min, 5% B at 80 min, 5% B from 80 to 90 min. For fractions 5 and 6 proteins were separated according to the following gradient: 5% B at 10 min, 15% B at 13 min, 50% B at 70 min, 95% B at 72 min, 95% B at 76 min, 5% B at 80 min, 5% B from 80 to 90 min. Eluted proteins were ionized in positive ion mode nanoelectrospray ionization using a pulled tip nanospray emitter (15 µm i.d. x 125 mm, New Objective) packed with 1mm of PLRP-S 5 µm 1000-Å pore size with a custom nano-source.

Top-down Mass Spectrometry
Mass spectrometry was performed either using a Thermo Scientific Orbitrap Eclipse Tribrid mass spectrometer or a Thermo Scientific Fusion Lumos Orbitrap Tribrid mass spectrometer. For analysis on Eclipse MS, data was acquired with the following global parameters spray voltage: 1600 V, sweep gas: 0, ion transfer tube temperature: 320 ºC, application mode: Intact Protein, pressure mode: Low Pressure (2 mTorr), Advanced Peak Determination: True, default charge state: 15, S-lens RF: 30%, source collision induced dissociation: 15 eV. Precursor spectra were acquired at 120,000 resolving power, detect type: Orbitrap, scan range: 600-2000 m/z, mass range: normal, AGC target 2E6, normalized AGC target: 500%, max injection time: 50 ms, microscans: 1. The mass spectrometer was operated using a TopN 3 s data-dependent acquisition mode.

Protein and Proteoform Identification
The raw data files were processed with the publicly available workflow on TDPortal (https://portal.nrtdp.northwestern.edu, Code Set 4.0.0) that performed mass inference, searched a database of human proteoforms derived from Swiss-Prot (June 2020) with curated histones, and estimated conservative, context-dependent 1% FDR at the protein, isoform, and proteoform levels. 18 Each tissue type was searched separately with its own FDR context. Aggregated search results were used in further data analysis.

Code and Data Availability
Raw files, mzIdentML, and tdReport files were deposited in Massive (Accession MSV000088565). Search results in tdReport format are viewable using TDViewer -a freeware from Northwestern University (http://topdownviewer.northwestern.edu). Search results were further analyzed, and figures were generated with custom code written for R 4.1.0. Source code for data analysis is available at https://github.com/bdrown/rplc-cze-tissues.

Results and Discussion
Samples were obtained from HuBMAP Tissue Mapping Centers from ten human donors. Tissue was cryopulverized, lysed, and proteins precipitated (Figure 1). To increase the depth of proteome coverage, proteins were fractionated with GELFrEE prior to MS analysis. Since we intended to analyze each sample by both CZE and RPLC, we setup two Orbitrap tribrid MS instruments configured with either CZE or RPLC, acquired data for a sample on one system, and immediately acquired data for the same sample on the second one. CZE substantially benefits from a higher scan rate due to generally narrower peak widths. Consequently, the CESI 8000 Plus was hyphenated to the Orbitrap Eclipse while a Dionex nanoLC was coupled to the Orbitrap Fusion Lumos. Three tissue types (heart, small intestine, and spleen) were analyzed by this paired analysis while two tissues (lung and kidney) were analyzed solely by RPLC-MS on the Orbitrap Eclipse ( Table 1).

Discovery of New Human Proteoforms
By searching the TDP data against a database of human proteoforms using TDPortal and 1% conservative false discovery rate (FDR), a total of 11,466 proteoforms from 740 proteins were identified ( Table 1) We also sought to compare the proteoforms identified in this work to those reported in prior studies. The Human Proteoform Atlas (HPfA, http://human-proteoform-atlas.org/) is the most comprehensive collection of characterized proteoforms. 14 The HPfA consists of 48 datasets which include numerous studies on immortalized cell lines, one study on healthy human solid tissue, 19 two studies on human cancer tissue, 20,21 and the Blood Proteoform Atlas. 22  A "bird's-eye" view of the physicochemical properties of proteoforms identified in the five different tissue types, including hydrophobicity, monoisotopic mass, and pI value, can be found in Figure 3A and S3. While kidney, lung, and spleen tissue proteoforms show similar distributions in their violin plots regarding all three investigated characteristics, distinct differences for heart and especially small intestine tissue were detected. For example, in the case of the small intestine, a high number of proteoforms in the pI range of 10.5 to 12.0 was observed, which can be explained by a relative increase in histone proteoforms compared to the other analyzed tissue types. This is also supported by the negative GRAVY score, showing a large distribution at around -0.6. On the other hand, proteoforms observed in heart tissue exhibit a relatively broad distribution of pI values.

Influence of separation technique
While the performance of CZE and RPLC have been compared in numerous contexts, 23-27 the paired analysis of heart, small intestine, and spleen provides an opportunity to explore how proteoforms behave regarding these two separation techniques. Despite requiring similarly long acquisition times, the window of separation for CZE was smaller than RPLC. The difference in separation principle was evident in the relationship between proteoform retention/migration times and mass ( Figure 3B) as well as time and hydrophobicity ( Figure 3C). While there is a strong correlation between mass and retention time with RPLC, no significant correlation was observed between mass and migration time with CZE (Table S3). Both separation methods demonstrate a correlation between hydrophobicity and time, but RPLC has a stronger correlation. While CZE was performed with an acidic background electrolyte (pH 2.4), we observed a positive correlation between proteoform hydrophobicity and mass-to-charge ratio (Figure S3I), which helps to explain the increase in hydrophobicity with migration time (less number of "ionizable" amino acids available per size).
In addition to the physiochemical properties of proteoforms identified using CZE and RPLC differing, the distribution of post-translational modifications (PTMs) was similarly asymmetrical.
Two-by-two Chi-squared tests were performed to determine which PTMs had significant deviations in their identification rates (observed PTM / the sum of all other PTMs) as described previously. 28 Monomethylation, half cystines, and monohydroxylation were elevated on CZE-MS/MS, while on RPLC-MS/MS, detection of monoacetylated and trimethylation proteoforms was enhanced. PTM observation frequencies at the proteoform spectral match level followed the same trends in observation biases (Table S4). Summarized, these observations substantiate the benefit of the combination of CZE and RPLC derived data from increasing the coverage of the proteoform discovery workflow.

Tissue-Specific Proteoforms and Handling of PTM Ambiguity
Uncertainty in exact position of a PTM on a proteoform can arise in cases where SwissProt entries have many recorded modifications and amino acid variants and fragmentation data are incomplete to assert an umambiguous level 1 proteoform. 29 This phenomenon is exemplified by cardiac troponin C (cTnC), which was identified in its canonical form (full length, N-terminal acetylated, PFR55232) as a level 1 proteoform (Figure 4A). Nine additional proteoforms had sufficiently high proteoform-level Q-scores to pass FDR cutoffs due to excellent sequence coverage in regions without modifications and they were classified as level 3 proteoforms with some PTM site ambiguity (Figure 4A). The example of cTnC is not alone; the majority of proteoforms identified in this study are either chemically modified or bear a sequence variant, as only 33% are unmodified ( Figure 4B). While filtering by C-score can help triage level 3 proteoforms for which PTM localization is ambiguous, the C-score does not help in cases where there is only one possible site of modification. 30 To curate a core set of proteoforms uniquely expressed in the five individual tissue types, we implemented a conservative process to select those proteoforms with PTMs with direct fragment ion support (level 1 proteoforms 29 ). To this end, the number of matching fragment ions that bear a PTM (or amino acid variant) was counted for each proteoform spectral match (PrSM). While many mutated and modified proteoforms have supporting fragment ions (level 1), a disproportionate number of modified proteoforms were level 3 with two or fewer (Figure 4C, D). Consequently, the requirement of having >3 supporting fragment ions for modified proteoforms was added in addition to a C-Score >30. This process culled the set of 8784 unique proteoforms in Table 1 down to 2843 level 1 tissue-specific proteoforms (Figure 4E, Supplementary Data 1).
More level 1 tissue-specific proteoforms were identified in a Subsequence search (previously called BioMarker search that identifies portions of full length proteoforms 31,32 ) than in Absolute Mass searches. Specifically, 2,548 proteoforms were identified in Subsequence searching compared to 295 proteoforms identified in Absolute Mass searches. Subsequence searches identify proteolytic fragments that often arise from endogenous proteolytic events and can serve as significant biomarkers. 21 While a portion of these proteoforms may be the product of non-specific proteolysis, the consensus sequence of cleavage sites varied across tissues ( Figure S4). Truncated proteoforms from the heart, kidney, and small intestine showed enrichment of F, Y, W, and L at P1, which suggests chymotrypsin activity. Spleen proteoforms demonstrated enrichment of hydrophobic residues but no apparent sequence specificity. This lack of specificity combined with a high proteoform to protein ratio agrees well with the role of the spleen for scavenging senescent blood cells. 33 Lung proteoforms had a higher propensity of cysteine at P1, which is not commonly observed for specific proteases. This enrichment was driven by 24 of the 715 lung-specific proteoforms with N-terminal cleavage. Nine of these 24 proteoforms originate from collapsing response mediator protein 2 (CRMP-2, Q16555), with cleavage occurring at C439 (Figure S5).
CRMP-2 has largely been studied in the context of neurological diseases due to its role in microtubule assembly and axon growth. 34 Indeed, C-terminal truncation of CRMP-2 has been linked to neurodegeneration, 35 and the cleavage site was later localized to S517. 36 As the function of CRMP-2 in lung tissue has only recently begun to be characterized, 37 this novel truncation at C439 may assist in elucidating its role.
Subsequence searching also identified a proteolytic cleavage site in CDGSH iron-sulfur domaincontaining protein 1 (mitoNEET, Q9NZ45) at L47 (Figure S6). MitoNEET is a mitochondrial outermembrane protein that was initially discovered as an off-target interactor of the PPAR-γ agonist pioglitazone. 38 With its iron-sulfur cluster oriented toward the cytosol, mitoNEET acts as a redox sensor and regulator of mitochondrial iron. [39][40][41] Downregulation of mitNEET has been associated with aging and increased risk of heart failure. 42 The canonical proteoform of mitoNEET was observed in both small intestine and heart tissue, while both proteolytic products were observed solely in heart tissue ( Figure S6). Cleavage at L47 does not disrupt the iron-sulfur cluster binding site but does separate this reactive center from the protein's transmembrane domain. Thus, proteolytic cleavage may act as a means of regulating mitoNEET or a mechanism by which fulllength mitoNEET abundance declines in aging cardiomyocytes.

Unique Proteoforms Are Reflective of Tissue Central Function
Many of the tissue-specific proteoforms originate from genes involved in the core function of these tissues, as indicated by gene ontology enrichment (Figure 2E, Figure S7). The Subsequence proteoform search identified a series of proteoforms associated with defensins with distinct expression patterns ( Figure 4F, Figure S8). Defensins are a family of small cationic host defense proteins characterized by three conserved intramolecular disulfide bonds. 43 Six human alphadefensins have been identified to date and are subdivided into human neutrophil peptides 1 to 4 (HNP1-4) and human (enteric) defensins (HD5-6). HNPs are stored as mature peptides in granules of neutrophils and released upon activation by exocytosis. 44 HNP1 (PFR69106) was identified in both lung and spleen tissue as expected for tissues with high neutrophil content. HNP2 (PFR69109), HNP3 (PFR69079), HNP4 (PFR65983), and truncation products of HNP2 (PFR165182 and PFR165183) were observed exclusively in spleen tissue. No beta-defensin proteoforms were identified. HD5 and HD6 are produced in Paneth cells at the base of small intestinal crypts. 45 Accordingly, HD5 and HD6 were detected exclusively in small intestinal tissue. Unlike other defensins, HD5 is stored as a propeptide, and the fully mature peptides are thought to be produced by intracellular trypsin. 46 Consequently, the HD5 propeptide (PFR165815) and several truncated products were observed. Several of these truncated proteoforms (PFR5737351, PFR97759, and PFR97755) correspond to trypsin cleavage sites (R25, R55, and R62), while others (PFR5741069, PFR5737454, and PFR5737363) seem to correspond to other mechanisms of cleavage considering the residues at the P1 positions (D41, F46, and A61). Despite reducing samples with DTT prior to analysis, several proteoforms were observed with disulfide bridges intact (PFR4919881, PFR4919882, and PFR5026622). The disulfide linkages in these proteoforms are inconsistent with the canonical model of alpha-defensins that includes end-to-end disulfides ( Figure 4G). Although these non-canonical disulfides might be biologically relevant, spontaneous reformation of disulfides in denatured samples is likely. Defensins are important components of the host innate immunity, so observing new proteoforms on mucosal surfaces is important in understanding their regulation and design of therapeutic mimetics. 47,48 Furthermore, these findings are a good showcase for the capabilities of the presented setup to evaluate tissuespecific proteoform-related questions.
Glutathione S-transferases are a family of proteins involved in inflammation and the cellular defense against toxic and carcinogenic compounds. 49,50 Proteoforms from this protein family were broadly observed but with distinct tissue distributions ( Figure S9). Glutathione S-transferase A1 (P08263) and A2 (P09210) were observed primarily in the small intestine and kidney, respectively.
The polymorphism E210A (rs6577) was observed in a single kidney sample (Biorep 3), which was derived from a 53-year-old African American male (Table S1). This coding SNP occurs with much higher frequency in Africa Americans (56.5%) compared to the global population (9.9%). 51 Microsomal glutathione S-transferase (MGST) 1, 2, and 3 were observed in the small intestine and lung (1), small intestine and kidney (2), and heart tissue (3), respectively (Figure S9C & D). These glutathione transferases are polytopic membrane proteins located in the endoplasmic reticulum membrane with both glutathione conjugation and peroxidase activity. 52, 53 A novel MGST3 proteoform (PFR5719232) that lacks the C-terminal cysteine necessary for S-palmitoylation was the predominant form observed in heart tissue. 54 Enrichment of functionally relevant genes from the identified proteoforms was particularly notable for heart tissue, with terms associated with ATP synthesis and muscle contraction leading the list (Figure 2E). Six proteoforms of cardiac phospholamban (PLN), a key regulator of cardiac contraction via inhibition of the sarcoplasmic reticulum calcium pump (SERCA), were identified by RPLC-MS/MS (Figure 5A). 55 While unmodified PLN and palmitoylated PLN have both been reported previously, 56 this study is the first report of phosphorylated PLN and combined phosphorylation and palmitoylation. Phosphorylation and palmitoylation of PLN have both been shown to control the impact localization, complexation, and inhibition of SERCA, so accurate measurement of their combination will help clarify PLN's role in health and disease. 57 We also present evidence for phosphorylation at ~30% stoichiometry of ventricle myosin regulatory light chain (RLCV). Prior reports by the Ge group have established N-terminal trimethylation of RLCV and phosphorylation of swine RLCV, but phosphorylation of human RLCV was unlocalized and observed at <10% stoichiometry. 58,59 The removal of N-terminal methionine and trimethylation was confirmed by tandem HCD fragmentation, and the site of phosphorylation was localized to S15, which is analogous to the site identified on swine RLCV (Figure 5B). On a last analytical note, phosproteoforms of cardiac troponin I (cTnI) 60 were not separated by RPLC but were at baseline by CZE ( Figure 5C); proteoform quantitation by both techniques showed <10% coefficient of variation between them. Better separation of the CZE should translate into better on-the-fly sequence coverage and proteoform characterization with tandem MS scan speeds.

Conclusions
We have described the combination of TDP data collected with online separation by RPLC and CZE to expand the depth of human proteome coverage. All proteomics methods face the challenge of measuring low-abundance analytes, so identifying robust approaches that introduce new proteoform selectivity are highly sought. RPLC and CZE were shown to possess differential proteoform selectivity that manifests as different physiochemical properties and PTM profiles. In a TDP study of five human tissues, we dramatically expanded the number of proteoforms associated with these tissues by combining the two methods.
Confident assignment of proteoforms bearing PTMs or sequence variations becomes more challenging as query proteoforms get larger and the search databases contain more candidate PTM sites. Unambiguous level 1 proteoform assignments are particularly troublesome when seeking proteoforms specific to a particular biological context (e.g., tissue types), but this can be significantly mitigated with the inclusion of fragment-ion data quality standards. Even at current levels of proteoform characterization quality, organ-specific proteoforms achieve robust tissue type identification.
The genes from the tissue-specific proteoforms identified in this study were tied to the core function of the tissues as broadly indicated by GEO analysis. This is further supported by specific examples such as proteins that regulate muscle contractility (PLN, RLCV, cardiac troponins), hostpathogen interaction (defensins), cytoskeletal reorganization (CRMP-2), and metabolic detoxification (family of glutathione transferases). In many cases, these unique proteoforms were detected with only one of the upfront separation methods. Thus, proper exploration of our hypothesis that proteoform-level measurements more fully capture biological context than proteinlevel measurement requires an increased depth of proteome coverage.
ASSOCIATED CONTENT Supporting Information.
The following files are available free of charge.

Additional experimental results and figures (PDF)
List of tissue-specific proteoform identified in this study (XLSX) identified phosphorylation of cardiac troponin I as a candidate biomarker for chronic heart failure. J Proteome Res 2011, 10 (9), 4054-65.  Table S1. b The term 'protein' refers to that SwissProt entry mapping to a single human gene c Unique identifications refer to proteins or proteoforms that were only identified in the tissue type indicated. d Proteins and proteoforms that were observed in more than one human tissue type are counted once in non-redundant totals.    Table S3.  Sequential filtering of proteoforms to identify high-confidence tissue-specific proteoforms. F.