Background: The olfactory stimulus-percept problem has been studied for more than a century, yet it is still hard to precisely predict the odor given the large-scale chemoinformatic features of an odorant molecule. A major challenge is that the perceived qualities vary greatly among individuals due to different genetic and cultural backgrounds. Moreover, the combinatorial interactions between multiple odorant receptors and diverse molecules significantly complicate the olfaction prediction. Many attempts have been made to establish structure-odor relationships for intensity and pleasantness, but no models are available to predict the personalized multi-odor attributes of molecules. In this study, we describe our winning algorithm for predicting individual and population perceptual responses to various odorants in the DREAM Olfaction Prediction Challenge. Results: We find that random forest model consisting of multiple decision trees is well suited to this prediction problem, given the large feature spaces and high variability of perceptual ratings among individuals. Integrating both population and individual perceptions into our model effectively reduces the influence of noise and outliers. By analyzing the importance of each chemical feature, we find that a small set of low- and nondegenerative features is sufficient for accurate prediction. Conclusions: Our random forest model successfully predicts personalized odor attributes of structurally diverse molecules. This model together with the top discriminative features has the potential to extend our understanding of olfactory perception mechanisms and provide an alternative for rational odorant design. Keywords: olfactory perception; structure-odor relationships; random forest; chemoinformatics Background odor molecules; conversely, an odorant may interact with many olfactory receptors with different affinities [ 2]. Unlike the well- Olfactory perception is the sense of smell in the presence of defined wavelength of light in vision and frequency of sound odorants. The odorants bind to and activate olfactory receptors in hearing, the size and dimensionality of the olfactory percep- (ORs), which transmit the signal of odor to the brain . The ex- tual space is still unknown . It is not clear how the numer- istence of a large family of olfactory receptors enables humans ous physicochemical properties of a molecule relate to its odor, to perceive an enormous variety of odorants with distinct sen- or how mammals process and detect the broad range of the ol- sory attributes . An olfactory receptor can respond to multiple factory spectrum. Some structurally similar compounds display Received: 12 April 2017; Revised: 6 November 2017; Accepted: 7 December 2017 The Author(s) 2017. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/2/1/4750780 by Ed 'DeepDyve' Gillespie user on 16 March 2018 2 Li et al. distinct odor profiles, whereas some dissimilar molecules ex- folds of odorants observed in the dataset and top discriminative hibit almost the same smell [4–6]. Even for an identical molecule, chemoinformatic features, our model offers an alternative for the perceived quality varies immensely between individuals due rational odorant design. to genetic variation . Therefore, accurate prediction of per- sonalized olfactory perception from the chemical features of a Data Description molecule is highly challenging. In the past, many attempts have been made to establish Psychophysical dataset structure-odor relationships and predict the odor from the The DREAM organizers provided psychophysical data that were physicochemical properties of a molecule . An early study originally collected between February 2013 and July 2014 as part showed that volatile and lipophilic molecules fulfill the require- of the Rockefeller University Smell Study . The data were col- ments to be odorants . The correlation of odor intensities with lected from 61 ethnically diverse healthy men and women be- different structural, topological, and electronic descriptors was tween the ages of 18 and 50. These subjects volunteered and calculated for 58 different odorants; molecular weight, partial gave their written informed consent to smell the stimuli used charge on most negative atoms, quantum chemical polarity pa- in this study . They were na¨ıve and didn’t receive any kind rameter, average distance sum connectivity, and a measure of of olfaction training. In the DREAM olfaction prediction chal- the degree of unsaturation were particularly important descrip- lenge, the data of only 49 subjects were provided because some tors . Multidimensional scaling and self-organizing maps subjects didn’t give permission to use their data. The percep- were used to produce 2-dimensional maps of the Euclidean ap- tual ratings of 476 different molecules were assigned by these proximation of olfactory perception space . A principal com- 49 subjects at 2 different concentrations (high and low); in ad- ponent analysis identified the latent variables in a semantic odor dition, 20 molecules were tested twice. Each subject rated the profile database of 881 perfume materials with semantic pro- perception of 992 stimuli (476 plus 20 replicated molecules at 2 files of 82 odor descriptors and classified odors into 17 different different concentrations). Twenty-one perceptual attributes (in- classes . Although it is not possible to predict the odor profile tensity, pleasantness, and 19 semantic attributes) were used to of a molecule, some progress has been achieved for predicting describe the odor profile of a molecule. The semantic attributes the intensity  and pleasantness of an odorant. Methods for are bakery, sweet, fruit, fish, garlic, spices, cold, sour, burnt, acid, predicting the perceived pleasantness of an odorant have uti- warm, musky, sweaty, ammonia/urinous, decayed, wood, grass, lized the most correlated physical features of molecular com- flower, and chemical. Subjects used a scale from 0 to 100 where 0 plexity  and molecular size [15, 16]. A major challenge is is “extremely weak” and 100 is “extremely strong” for intensity; that different individuals perceive odorants with different sets 0 is “extremely unpleasant” and 100 is “extremely pleasant” for of odorant receptors [17, 18], and perception is also strongly pleasantness; and 0 is “not at all” and 100 is “very much” for shaped by learning and experience . Different cultures have semantic attributes. This dataset of 476 chemicals was divided different linguistic descriptions of smells [20–22], so generat- into 3 subsets by the organizers: 338 for the training set, 69 for ing olfaction datasets is tedious work. Many computational the leaderboard, and 69 for the test set. We combined the 338 methods have been developed to relate chemical structure to training and 69 leaderboard molecules (407 molecules in total) percept [4, 10, 15, 16, 23–26], but most of them are based on sin- as our final training set. gle and very old psychophysical datasets . Therefore, a rig- orous quantitative structure-activity relationship (QSAR) model [28, 29] of personalized olfactory perception is needed for accu- Chemoinformatic features of molecules rate predictions. A total of 476 structurally diverse odorant molecules were used The Dialogue on Reverse Engineering Assessment and Meth- in this study, including 249 cyclic molecules, 52 organosulfur ods (DREAM) organized the olfaction prediction challenge . molecules, and 165 ester molecules (Supplementary Fig. S1). The DREAM is a leader in organizing crowdsourcing challenges to participating investigators were encouraged to use any kind of evaluate model predictions and algorithms in systems biology chemical and physical properties of the molecules for develop- and medicine . Here we describe our winning algorithm, ing prediction models. By default, the organizers provided 4884 the best performer of subchallenge 1, for predicting individ- different chemical features for each of the 476 molecules, cal- ual responses and the second best performer of subchallenge culated by a commercial chemoinformatics software package 2 for predicting population responses. As olfactory perception is knownasDragon(version6). Features were divided into inherently a complex nonlinear process, decision tree–based al- 29 different logical molecular descriptor blocks including con- gorithms are well suited to this problem. Particularly, a random stitutional descriptors, topological indices, 2D autocorrelations, forest (RF) consisting of multiple decision trees addresses the etc. These chemoinformatic features are useful in establishing overfitting issue when the feature space is much larger than structure-odor relationships and further developing machine the sample space. Moreover, random forest is relatively robust learning prediction models. The compound identification num- to noise and outliers , especially when a large variability ber (CID) for each molecule was also provided so participating in- of individual perceptual responses is observed. To further re- vestigators could obtain more information about the molecules duce the effects of large variability, noise, and outliers, we in- from other resources (e.g., PubChem) . tegrated the average rating of individuals (population response) into our model. Our final model succeeds in predicting olfac- tory perception using only a small set of chemical features. Results These features are likely to be low- and nondegenerative molec- ular descriptors, indicating that traditional simple descriptors The overall workflow of the olfaction prediction is shown in like functional groups are less effective in distinguishing the Fig. 1. The organizers provided an unpublished large psy- odor profiles of structurally similar molecules. Meanwhile, our chophysical dataset of 476 structurally and perceptually diverse model potentially provides useful insights on the basic molecu- molecules sensed by 49 different individuals . Twenty-one lar mechanisms of olfactory perception. Together with new scaf- perceptual attributes were collected, including odor intensity Downloaded from https://academic.oup.com/gigascience/article-abstract/7/2/1/4750780 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Prediction of olfactory perception 3 Figure 1: The overview of the olfaction prediction. The observed perceptions form a 3-dimensional array, where the 3 dimensions are 476 molecules, 49 individuals, and 21 olfactory attributes. The input chemoinformatic features form a 2-dimensional matrix, where the rows are 476 molecules and columns are 4884 molecular descriptors. Our random forest model is built on the training set (407 molecules), and the individual responses for the test set (69 molecules) are predicted. The final evaluation is based on the Pearson’s correlation between observed and predicted perceptions. and pleasantness and 19 semantic descriptors. A subset of 407 it 100. Similarly, 2-acetylpyridine (CID: 14 286) showed great molecules (338 training and 69 leaderboard molecules) was used variability in “warm” ratings—about half of the subjects rated as the final training set in our random forest model, and the it 0, and the other half perceived “warm” from it. The large other 69 held-out molecules formed the test set. The organiz- differences may result from the relative ambiguity of the word ers provided the Dragon software –based large-scale molec- “warm” to describe odor. The average variance of 21 attribute ular descriptors, containing 4884 chemical features for each ratings across all individuals is shown in Fig. 2D. Compared with molecule. Models were evaluated based on the Pearson’s corre- intensity and pleasantness, the 19 semantic qualities display lation between the observed and predicted perceptions. much larger coefficients of variation. Thus, the diversity of perceptual ratings between subjects considerably complicates the prediction challenge. Variability of olfactory perception among individuals Strategies for accurate personalized olfaction The intensity perceptions of 476 molecules at high and low con- predictions centrations vary tremendously among individuals. For example, individuals 10, 29, and 46 exhibit entirely different perceptual Considering the large variability of the perceived ratings, we profiles for intensity (Fig. 2). Ideally the perceptual rating for propose that random forest could be an excellent choice as a intensity should increase as the measuring concentration rises base learner because it applies the strategy of training on dif- (blue lines in Fig. 2A), while it is commonly observed that the in- ferent parts of the dataset and averaging multiple decision trees tensity rating of some molecules decreases (red lines in Fig. 2A to reduce the variance and avoid overfitting. We compared dif- and Supplementary Fig. S2). In fact, the 49 subjects were lack- ferent machine learning algorithms (linear, ridge, support vec- ing any kind of professional training, and they were biased in tor regression, random forest) using 5-fold cross-validations and assigning the perceptual rating value between 0 and 100 (Fig. 2B found that random forest outperforms other base learners in and Supplementary Fig. S3). Except for molecules without odor predicting individual responses for “intensity,” “pleasantness,” rated near 0, individual 10 tended to assign ratings uniformly, and 19 semantic descriptors (Fig. 3A). Given a small sample size whereas individual 29 preferred to rate at 100 and individual 46 of 407 training molecules, random forest identifies and utilizes was inclined to rate around 50. the most discriminative features out of 4884 molecular descrip- In addition to intensity, the other perceived attributes were tors to make decisions. Clearly, a simple linear regression model rated differently among individuals (Fig. 2C). Even for the fails when the dimension of the feature space is too large. There- same molecule, cyclopentanethiol (CID: 15 510), 16 subjects fore, random forest was selected as our base-learner and used in did not apply the descriptor “garlic,” whereas 9 subjects rated the follow-up improvements. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/2/1/4750780 by Ed 'DeepDyve' Gillespie user on 16 March 2018 4 Li et al. Figure 2: Variability of olfactory perception among individuals. (A) The intensity ratings for all molecules at low and high concentrations from individuals 10, 29, and 46. Blue lines represent the ideal cases, in which the rating values increase as the concentration becomes higher. Conversely, red lines represent decreased rating values at high concentration. (B) The density distributions of the intensity ratings from these 3 individuals. Blue lines are the fitting curves of the density distribution. The intensity ratings and density distributions from all individuals are shown in Supplementary Figs S1 and S2, respectively. (C) The “garlic” and “warm” rating distributions among 49 individuals for 2-acetylpyridine and cyclopentanethiol, respectively. D) The coefficients of variation of 21 perceptual attributes in increasing order. Recognizing that the population responses (the average per- tions play a crucial role when individual responses display large ceptions of all individuals) are more stable compared with indi- fluctuations. vidual responses, we overcome the variability of individual rat- To further improve the performance, we applied a sliding ings by introducing a weighting factor α. This parameter serves window of 4-letter size to each molecule name, generating a to- as a balance between individual and population ratings. When tal of 11 786 binary name features (see the “Methods”). These α equals 0, only population ratings are considered. Conversely, name features are very similar to the molecular fingerprints, when α equals 1, only individual ratings are used (see the “Meth- providing extra information about the similarities of molecules. ods”). Surprisingly, a small α = 0.2 achieves the largest Pear- The 4-letter window is selected to efficiently capture the chemi- son’s correlation coefficient (Fig. 3B). Without population infor- cal similarity. Larger sliding window sizes greatly increase com- mation (α = 1.0), the correlation of predicting the 19 semantic putational cost, as the size of the feature space is an exponential descriptors is the lowest. This reveals that population percep- function of window size. Although the random forest model Downloaded from https://academic.oup.com/gigascience/article-abstract/7/2/1/4750780 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Prediction of olfactory perception 5 Figure 3: The performance of different models and strategies. From left to right, the Pearson’s correlation coefficients of intensity and pleasantness and 19 s emantic descriptors from 5-fold cross-validations are shown as a boxplot. The red base-learners or strategies are used in our final model. (A) The performance of 4 different base-learners: linear, ridge, SVM, and random forest. (B) The performance of using different values of weighting factor α. (C) The performance of using molecular features alone, name features alone, and both molecular and name features. using only the name features has relatively low correlations, the whereas esters are often smelled as “fruity.” Many models have ensemble model aggregating both molecular and name features been built to correlate molecular size and complexity with the performs the best, ranking first in predicting the 21 personalized “pleasantness” of a compound. However, it remains unclear perceptual attributes (Fig. 3C and Supplementary Fig. S4). what chemical features of a molecule decide its multiple odor attributes. Random forest enables us to estimate the impor- tance of each chemical feature by permuting the values of a Discriminative chemoinformatic features for olfaction feature across samples and computing the increase in pre- prediction diction error. We calculate the increased delta error of each chemical feature for all 21 olfactory qualities. Interestingly, top- Our random forest model evaluates the importance of each ranking features used by random forest do not necessarily have molecular descriptor in prediction. It is well known that high linear Pearson’s correlations with observed ratings. For sulfur-containing organic molecules tend to have “garlic” odor, Downloaded from https://academic.oup.com/gigascience/article-abstract/7/2/1/4750780 by Ed 'DeepDyve' Gillespie user on 16 March 2018 6 Li et al. Figure 4: Top discriminative features used in random forest. (A) The word cloud of top 5 features used in predicting 21 perceptual attributes. (B) The pie chart of molecular descriptor categories in the top 5 features. (C) Projection of all molecules onto selected discriminative feature spaces. The color of each spot represents the relative strength of the perceived rating, averaged among 49 individuals. The dashed lines display the possible decision boundaries created by random forest. example, the correlation coefficients of the top 5 features for of the perceived rating. For example, ATS1s and ATS2s are “decayed” prediction are listed in Supplementary Table S1. The the most important features in predicting the “intensity” of a molecular feature P VSA m 4 (interpreted as the presence of molecule. They can be interpreted as the combined information sulfur atoms) ranks first and has the largest correlation. How- of molecule size and the intrinsic state of all atoms. Molecules ever, the second and third features in random forest have nearly with large ATS1s and ATS2s values tend to have low intensity no correlations with observed perceptions. Upon inspecting the (top right green spots in Fig. 4C, left panel). Another example top 5 features that have the largest correlation values (yellow is the “pleasantness” rating of a molecule, for which SssO (pres- columns in Supplementary Table S1), we notice that they are ence of ester or ether) and P VSA i 1 (presence of sulfur or iodine all related to sulfur atoms, leading to high redundancy and atom) are crucial. Clearly, molecules containing sulfur or iodine intercorrelation. atoms have lower “pleasantness” values (green spots above the By analyzing all the top 5 features ranked by the delta error, dashed line in Fig. 4C, middle panel). And it is widely known we find that discriminative molecular features are more likely that ester has a characteristic pleasant odor and lower ethers to be low- or nondegenerative. The complete lists of top 20 fea- can act as anesthetics, whereas presence of sulfur atom leads tures ranked by delta error or Pearson’s correlation are shown to unpleasant “garlic” and “decayed” odor. Therefore, key fea- in Supplementary Tables S2 and S3, respectively. Interestingly, tures of “garlic” odor include MAXDN (presence of ketone or es- simple chemical features (molecular weight, number of sulfur ter) and R3p+ (presence of sulfur atom). Molecules containing atoms, presence of a functional group, etc.) are not very powerful sulfur atoms are more likely to be “garlicky,” whereas ketones in prediction because they display high degeneracy—different and esters seldom have such smells (red spots above the dashed molecules may have identical or similar values [36–38]. Features line in Fig. 4C, right panel). with low- or nondegeneracy more likely play an essential role Rebuilding the random forest model with the top 5, 10, 15, in our random forest model. The frequency of all top 5 molec- or 20 key features, we find that a small set of chemical fea- ular features is represented as the size of words in Fig. 4A. We tures is sufficient for accurate prediction. These top features find that autocorrelation of a topological structure (ATS) and 3D- selected by random forest may have very low linear Pearson’s MoRSE descriptors occur 24 and 23 times, respectively, whereas correlations with perceived qualities, yet they are powerful in simple descriptors such as N% (percentage of N atoms), nRCOOR discriminating different odorants. This is because the relation- (number of aliphatic esters), and NssO (number of ssO atoms) are ship between molecular features and olfactory perception is in- used only once (Fig. 4B). herently nonlinear. Intriguingly, random forest with only the top To understand how the random forest model works, we pro- 5 features achieves similar performance as random forest with jected all molecules onto selected important feature spaces all 4884 features for almost all olfactory qualities (Fig. 5Aand (Fig. 4C). The color of each molecule represents the strength Supplementary Figure S5). The only exception is “intensity,” for Downloaded from https://academic.oup.com/gigascience/article-abstract/7/2/1/4750780 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Prediction of olfactory perception 7 Figure 5: The performance of random forest using top features. From left to right, the Pearson’s correlation coefficients of intensity and pleasantness and 19 s emantic descriptors from 5-fold cross-validations are shown as boxplots. The red model is the random forest using all chemoinformatic features. (A) The performance of random forest using the top 5, 10, 15, and 20 features ranked by delta error. (B) The performance of random forest using the top 5, 10, 15, and 20 features ranked by Pearson’s correlation. which the top 15 features are adequate. This result indicates that “decayed” odor. The second triad of molecules comprises com- a small set of chemical features is often sufficient to predict the mon L-amino acids: alanine (4), leucine (5), and valine (6) odor of a molecule. We also test the performance of random for- (Fig. 6B). Their odor profiles differ a lot, especially between 4 and est using features ranked by Pearson’s correlation (Fig. 5B). The 5; 4 has “fruit,” “sweet,” and “flower” odors, whereas 5 is char- predicting power of these features is lower due to collinearity acterized by “sour,” “decayed,” “sweaty,” and “intensity”; 6 has and redundancy. For example, the top 10 features for “garlic” a relatively similar odor profile as 4. The odd one in structural quality are all related to the number of sulfur atoms, although terms is 4 as it has the smallest side chain, whereas the odd one they display very high correlation values (Supplementary in terms of odor is 5. The last group of molecules includes thia- Table S3). zole (7) and its derivatives (8–10) (Fig. 6C); 9 stands out because of its signature “grass” odor, whereas both 9 and 10 have a very high “chemical” odor. Deciphering the divergent multi-odor profiles Our random forest model distinguishes the multi-odor pro- of structural analogs files of structural analogs using complex molecular features. Al- though these analogs are extremely similar in terms of chem- Structurally similar compounds with distinct odor profiles were ical structure and functional group, the values of their 2- and observed in triads and a tetrad of molecules. The first exam- 3-dimensional molecular descriptors are distinct. The average ple is 3 furoate esters (Fig. 6A). If we compare the functional rating of each molecule is represented by its color, and the struc- groups of these 3 molecules, methyl 2-furoate (1) and ethyl 2- tural analogs mentioned above are shown in a larger size (Fig. 6, furoate (2) are more similar, while allyl 2-furoate (3) has a unique right panels). The top features used by our random forest model alkenyl group. Intriguingly, the pairwise correlations between clearly separate the structurally similar molecules with dissim- them across 21 perceived olfactory qualities reveal that 2 is the ilar odor attributes. For example, 10, with a strong “grass” odor odd one in terms of odor. It is clearly shown in the radar chart (top right orange diamond in Fig. 6C, the 3rd panel), has large of selected odor qualities. Compound 2 has intense “sweet,” SaaS and L3m values, whereas 7 and 8, with a weak “grass” odor “acid,” and “urinous” characters, whereas 1 and 3 display more Downloaded from https://academic.oup.com/gigascience/article-abstract/7/2/1/4750780 by Ed 'DeepDyve' Gillespie user on 16 March 2018 8 Li et al. Figure 6: Distinguishing different odor profiles of structurally similar molecules by random forest. The odor profiles of (A) 3 furoate esters, (B) 3 amino acids, and (C) 4 thiazole derivatives. The left panel shows the pairwise correlations between structurally similar molecules. The color of each edge represents the correlation value across 21 perceptual attributes. The middle panel shows the radar charts of selected odor attributes. The symbol and color correspond to the molecule on the left. The right panel displays the projections of all molecules onto selected discriminative feature spaces. The color of each spot represents the relative strength of the perceived rating, averaged among 49 individuals. The larger symbols correspond to the molecules on the left. (bottom-right green circle and triangle), have relatively small finding suggests that using a low-variant dataset of odorants values; 9 (middle right yellow square), with medium “grass” rated by professional perfumers may further improve the per- odor, has around average values among the tetrad. formance of predictive models. Besides, using semantic descrip- tors itself introduces biases, and alternative approaches such as perceptual similarity rating of odorants should be considered Discussion . Recognizing that extra Morgan-NSPDK features created by matching target molecules against reference odorants increase The complex and sophisticated signaling of diverse odorants has the predicting performance , a larger training set of diverse fascinated scientists for many decades, yet the molecular mech- molecules, including natural odorant products, will be helpful to anisms of olfactory perception are still not fully understood. build more accurate models. One odorant interacts with a broad range of olfactory receptors, Our random forest model potentially provides an alterna- and each olfactory receptor recognizes multiple odorants, lead- tive for rational odorant design [42, 43]. In addition to modifica- ing to the complicated tuning of olfactory perception [39, 40]. In tions of a natural odorant product, the perceptual dataset used addition, neuron firing is intrinsically nonlinear in nature, re- in this study consists of many untested molecules, providing quiring the membrane potential to be raised above threshold. new odorant scaffolds of different semantic qualities. Moreover, Therefore, a nonlinear random forest model is well suited to a small set of top-ranking features estimated by the random the olfactory prediction and avoids overfitting, given a compar- forest model is sufficient to accurately predict human olfac- atively small sample size and much larger feature spaces. More- tory perception, largely reducing the input feature spaces. This over, random forest is relatively robust to label noise and out- model is potentially useful for evaluation of new molecules, and liers , considering the vast variability of odor ratings among modification of these discriminative features provides an alter- individuals. native for rational odorant design. Like the association of func- The linguistic descriptions of smells vary among individu- tional groups with certain odors, this study may link complex als, especially when they lack experience and training . This Downloaded from https://academic.oup.com/gigascience/article-abstract/7/2/1/4750780 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Prediction of olfactory perception 9 chemoinformatic features to a broader range of odors, provid- learners and was used in the follow-up improvement of our ing a useful perspective for understanding olfactory perception model. mechanisms. Integrating individual and population ratings Methods The perceptual rating of attributes varies greatly. To reduce the Random forest model effects of noise and outliers, we introduce a weighting factor, α, as the weight for individual ratings and (1 – α) as the weight for Random forest is an ensemble learning algorithm for regression population ratings. The reweighted target value y is given as: and classification [ 32]. In a random forest, each decision tree is built from a random sampling with replacement (bootstrap sam- y = α × y + (1 − α) × y , ples). Furthermore, a random set of features are used to deter- indi vidual population mine the best split at each node during the construction of a tree. As a result of averaging many trees (100 is used in our model), where y is the rating from an individual and y individual population overfitting is avoided and the effects of outliers and noises are is the average rating from 49 individuals. Different values of α reduced. were tested and evaluated by the correlation of 21 perceptual We use perceptual ratings as targets and chemical descrip- attributes. α = 0.2 had the best performance and was used in tors as features to train random forest models. For the combina- our final model. tion of 49 individuals and 21 perceptual attributes, we have 1029 (49∗21) models in total. We further consider the average ratings Creating name features of molecules among 49 individuals as a population rating and combine it with the individual rating as prediction targets to train our models In the past, sliding window-based (overlapping patterns) strate- (see the “Integrating individual and population ratings” section gies were applied successfully to develop residue-level predic- below). tions [44, 45]. We used a sliding window of 4-letter size to extract features from the molecule names. For example, 4-letter Preprocessing of the dataset indexing generated a total of 7 sliding windows from “acetic acid” (ACET, CETI, ETIC, TIC ,IC A, ACI, ACID). We created There were many cases in which subjects indicated that they 11 786 binary name features from all molecule names using this smelled nothing, so the intensity rating was automatically set sliding window approach. If a window pattern is present in the to “0” and the ratings for other perceptual attributes were left molecule name, “1” was assigned to that feature; otherwise, “0” blank (NaN); therefore, we have removed all the “NaN” entries. was used while creating input name features. For the intensity attribute, we used the target values at “1/1000” dilution. For pleasantness and 19 semantic attributes, we used Evaluation of the importance of each feature target values at “high” concentration as a set of examples, and by random forest the average value at both “high” and “low” concentrations as another set of examples. As the original number of the sample The importance of each feature was evaluated by permuting the (407) is relatively small, combining high and low concentrations values across observations and computing the increase in pre- doubles the sample size, and this step is crucial to achieve high diction error by random forest. The increased delta error of each performance. There are 20 replicated molecules, which were se- chemical feature for all 21 olfactory attributes was calculated lected in the original Rockefeller University Smell Study. In gen- and ranked. Larger delta error implies that the feature is more eral, the ratings were consistent between the 2 replicates . In important and discriminative in prediction. our study, we treat replicates as separate examples as the frac- tion is small (20/407 = 0.049), which doesn’t affect the results. The input molecular features were scaled to values between 0 Availability of supporting data and 1. The scaling formula is given as: The DREAM olfaction challenge dataset, model details, and source code are available at https://github.com/Hongyang449/ x − min(x) olfaction prediction manuscript. x = max(x) − min(x) Snapshots of the code, molecular descriptors, and the ol- factory perception data and chemoinformatic features of odor- ant molecules are also available from the GigaScience database, where x is the original value and x is the scaled value. GigaDB . Selection of base-learner Additional files To address the large variability of perceived odor qualities among individuals, we tried a range of different machine learn- Table S1. The top 5 features ranked by random forest delta error ing algorithms (linear, ridge, SVM with rbf kernel, and random or Pearson’s correlation. forest with 100 trees) to find the best-performing base-learner. Table S2. The top 20 features of 21 perceptual attributes The regularization alpha in the ridge is 10. The penalty param- ranked by random forest delta error. eter C and coefficient gamma of the SVM rbf kernel are 1000 Table S3. The top 20 features of 21 perceptual attributes and 0.01, respectively. All other parameters are the default ones. ranked by Pearson’s correlation. We applied a 5-fold cross-validation to the training data (407 Table S4. The Pearson’s and Spearman’s correlations of top molecules) and evaluated the performance based on the cor- 500 features ranked by Pearson’s correlation. relations of the 21 perceptual attributes between the predicted Figure S1. The 2D chemical structures of 476 odorant and observed ratings. Random forest outperformed other base- molecules. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/2/1/4750780 by Ed 'DeepDyve' Gillespie user on 16 March 2018 10 Li et al. Figure S2. The intensity ratings for all molecules at low and 14. Kermen F, Chakirian A, Sezille C et al. Molecular complexity high concentrations from 49 individuals. determines the number of olfactory notes and the pleasant- Figure S3. The density distributions of the intensity ratings ness of smells. Sci Rep 2011;1(1):206. for all molecules from 49 individuals. 15. Zarzo M. Hedonic judgments of chemical compounds are Figure S4. The performance of random forest using chemical correlated with molecular size. Sensors 2011;11(12):3667– and name features. 86. Figure S5. The performance of random forest using top fea- 16. Khan RM, Luk C-H, Flinker A et al. Predicting odor pleasant- tures ranked by delta error. ness from odorant structure: pleasantness as a reflection of the physical world. J Neurosci 2007;27(37):10015–23. 17. Menashe I, Man O, Lancet D et al. Different noses for differ- Completing interests ent people. Nat Genet 2003;34(2):143–4. 18. Keydar I, Ben-Asher E, Feldmesser E et al. General olfactory The authors declare that they have no competing interests. sensitivity database (GOSdb): candidate genes and their ge- nomic variations. Hum Mutat 2013;34(1):32–41. Funding 19. Perez M, Nowotny T, d’Ettorre P et al. Olfactory experi- ence shapes the evaluation of odour similarity in ants: This work is supported by National Science Founda- a behavioural and computational analysis. Proc R Soc B tion 1452656 and Alzheimer’s Association BAND-15-367116 2016;283(1837):20160551. (Biomarkers Across Neurodegenerative Diseases Grant 2016). 20. Chrea C, Valentin D, Sulmont-Rosse´ C et al. Culture and odor categorization: agreement between cultures depends upon Author Contributions the odors. Food Qual Preference 2004;15(7–8):669–79. 21. Ayabe-Kanamura S, Schicker I, Laska M et al. Differences Y.G. conceived and designed the prediction algorithm. Y.G. and in perception of everyday odors: a Japanese-German cross- H.L. performed computational analysis of the observed and pre- cultural study. Chem Senses 1998;23(1):31–38. dicted data. H.L. analyzed the discriminative chemoinformatic 22. Levitan CA, Ren J, Woods AT et al. Cross-cultural color-odor features and prepared figures. H.L., B.P., G.O., and Y.G. con- associations. PLoS One 2014;9(7):e101651. tributed to the writing of the manuscript. All authors read and 23. Haddad R, Khan R, Takahashi YK et al. A metric for odorant approved the final manuscript. comparison. Nat Methods 2008;5(5):425–9. 24. Koulakov AA, Kolterman BE, Enikolopov AG et al. In search of the structure of human olfactory space. Front Syst Neurosci References 2011;5:65. 1. Gaillard I, Rouquier S, Giorgi D. Olfactory receptors. Cell Mol 25. Castro JB, Ramanathan A, Chennubhotla CS. Categori- Life Sci 2004;61(4):456–69. cal dimensions of human odor descriptor space revealed 2. Buck LB. Olfactory receptors and odor coding in mammals. by non-negative matrix factorization. PLoS One 2013;8(9): Nutr Rev 2004;62(11):184–188. e73289. 3. Read JCA. The place of human psychophysics in modern 26. Snitz K, Yablonka A, Weiss T et al. Predicting odor per- neuroscience. Neuroscience 2015;296:116–29. ceptual similarity from odor structure. PLoS Comput Biol 4. Sell CS. On the unpredictability of odor. Angew Chem Int Ed 2013;9(9):e1003184. 2006;45(38):6254–61. 27. Dravnieks A. Odor quality: semantically generated multidi- 5. Laska M, Teubner P. Olfactory discrimination ability for ho- mensional profiles are stable. Science 1982; 218(4574):799– mologous series of aliphatic alcohols and aldehydes. Chem Senses 1999;24(3):263–70. 28. Dudek AZ, Arodz T, Galv ´ ez J. Computational methods 6. Boesveldt S, Olsson MJ, Lundstrom ¨ JN. Carbon chain length in developing quantitative structure-activity relationships and the stimulus problem in olfaction. Behav Brain Res (QSAR): a review. CCHTS 2006;9(3):213–28. 2010;215(1):110–3. 29. Nantasenamat C, Isarankura-Na-Ayudhya C, Prachayasit- 7. Keller A, Zhuang H, Chi Q et al. Genetic variation in a tikul V. Advances in computational methods to predict the human odorant receptor alters odour perception. Nature biological activity of compounds. Expert Opin Drug Discov- 2007;449(7161):468–72. ery 2010;5(7):633–54. 8. Chastrette M. Trends in structure-odor relationship. SAR 30. Keller A, Gerkin RC, Guan Y et al. Predicting human olfac- QSAR Environ Res 1997;6(3–4):215–54. tory perception from chemical features of odor molecules. 9. Boelens H. Structure–activity relationships in chemorecep- Science 2017;355(6327):820–6. tion by human olfaction. Trends Pharmacol Sci 1983;4:421–6. 31. Saez-Rodriguez J, Costello JC, Friend SH et al. Crowdsourcing 10. Edwards PA, Jurs PC. Correlation of odor intensities biomedical research: leveraging communities as innovation with structural properties of odorants. Chem Senses engines. Nat Rev Genet 2016;17(8):470–86. 1989;14(2):281–91. 32. Breiman L. Machine Learning. 2001;45:5, https://doi.org/ 11. Mamlouk AM, Chee-Ruiter C, Hofmann UG et al. Quantifying 10.1023/A:1010933404324, Accessed 7 April, 2017. olfactory perception: mapping olfactory perception space by 33. Keller A, Vosshall LB. Olfactory perception of chemically di- using multidimensional scaling and self-organizing maps. verse molecules. BMC Neurosci 2016;17(1):55. Neurocomputing 2003;52:591–7. 34. Todeschini R, Consonni V, eds. Molecular Descriptors for 12. Zarzo M, Stanton DT. Identification of latent variables in a Chemoinformatics. Weinheim, Germany: Wiley-VCH Ver- semantic odor profile database using principal component lag GmbH & Co. KGaA; 2009. http://doi.wiley.com/10.1002/ analysis. Chem Senses 2006;31(8):713–24. 9783527628766, Accessed 22 March, 2017. 13. Mainland JD, Lundstrom ¨ JN, Reisert J et al. From molecule to 35. The complete list of molecular descriptors calculated mind: an integrative perspective on odor intensity. Trends by Dragon 6, http://www.talete.mi.it/products/dragon Neurosci 2014;37(8):443–54. molecular descriptor list.pdf, Accessed 7 April, 2017. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/2/1/4750780 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Prediction of olfactory perception 11 36. Godden JW, Bajorath J. Shannon entropy–a novel concept 11–25, https://www.researchgate.net/profile/Fumiko Yoshii/ in molecular descriptor and diversity analysis. J Mol Graph publication/241197841 Structure-odor relations A modern Model 2000;18(1):73–6. perspective/links/02e7e52c762d501279000000.pdf, Accessed 37. Godden JW, Bajorath J. Chemical descriptors with distinct 19 January, 2018. levels of information content and varying sensitivity to dif- 43. Turin L. Chemistry and Technology of Flavors and Fra- ferences between selected compound databases identified grances. Rowe DJ, ed. Oxford, UK: Blackwell Publishing Ltd.; by SE-DSE analysis. J Chem Inf Comput Sci 2002;42(1):87–93. 2004. http://doi.wiley.com/10.1002/9781444305517, Accessed 38. Godden JW, Bajorath J. An information-theoretic approach to 22 March, 2017. descriptor selection for database profiling and QSAR model- 44. Panwar B, Gupta S, Raghava GP. Prediction of vitamin ing. QSAR Comb Sci 2003;22(5):487–97. interacting residues in a vitamin binding protein us- 39. Zhao H. Functional expression of a mammalian odorant re- ing evolutionary information. BMC Bioinformatics 2013; ceptor. Science 1998;279(5348):237–42. 14:44. 40. Malnic B, Hirono J, Sato T et al. Combinatorial receptor codes 45. Panwar B, Raghava GP. Prediction of uridine modifications in for odors. Cell 1999;96(5):713–23. tRNA sequences. BMC Bioinformatics 2014;15:326. 41. Livermore A, Laing DG. Influence of training and experience 46. Li H, Panwar B, Omenn GS et al. Supporting data for on the perception of multicomponent odor mixtures. J Exp “Accurate prediction of personalized olfactory perception Psychol Hum Percept Perform 1996;22(2):267–77. from large-scale chemoinformatic features.” GigaScience 42. Turin L, Yoshii F Structure-odor relations: a modern per- Database 2017. http://dx.doi.org/10.5524/100384, Accessed spective. Handbook of Olfaction and Gustation 2003; 15 January, 2018. Downloaded from https://academic.oup.com/gigascience/article-abstract/7/2/1/4750780 by Ed 'DeepDyve' Gillespie user on 16 March 2018
GigaScience – Oxford University Press
Published: Feb 1, 2018
It’s your single place to instantly
discover and read the research
that matters to you.
Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.
All for just $49/month
Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly
Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.
All the latest content is available, no embargo periods.
“Whoa! It’s like Spotify but for academic articles.”@Phil_Robichaud