Investigating transcription factor synergism in humans

Investigating transcription factor synergism in humans Proteins are the core and the engine of every process in cells thus the study of mechanisms that drive the regulation of protein expression, is essential. Transcription factors play a central role in this extremely complex task and they synergically co-operate in order to provide a fine tuning of protein expressions. In the present study, we designed a mathematically well-founded procedure to investigate the mutual positioning of transcription factors binding sites related to a given cou- ple of transcription factors in order to evaluate the possible association between them. We obtained a list of highly related transcription factors couples, whose binding site occurrences significantly group together for a given set of gene promoters, identifying the biological contexts in which the couples are involved in and the processes they should contribute to regulate. Key words: transcription factors, gene regulation, biological process, computational biology 1. Introduction nucleotides for each position of the experimentally found binding sites. The regulation of gene expression is a fundamental mechanism driving Several databases based on experimental evidences were designed col- biological processes. Transcriptional regulation rules the access of pol- lecting information about TFs and the corresponding TFBSs, 6,7 8 ymerase complex to the gene, activating or repressing the transcription TRANSFAC and JASPAR are often used as a reference for process. Transcription Factors (TF) play a central role in this context; Eukariotes. In the review different algorithmic approaches to predict they are proteins able to bind specific DNA regions recognizing a short TFBSs, starting from PWMs, are presented and compared. Despite the sequence of nucleotides called Transcription Factor Binding Sites large number of genes in higher animals (30,000 genes) the number (TFBS). There is a vast literature concerning TFs, starting from seminal of known TFs is one order of magnitude smaller, indicating a neces- 1,2 and general papers to the most specific ranging from the study of sary interplay between different TFs in order to regulate the activity of the mechanism that allows TFs to efficiently and rapidly find the target all genes. TFBSs are often clustered in peculiar families and patterns 3,4 along the DNA helix, to the study of the roles that specific TFs play made of several TFBSs can be found in the promoter region, i.e. thou- in given biological tasks and how they influence the regulation of the sands bases upstream the Transcription Starting Site (TSS), or also transcription of their target genes. TFBS occurrences along a DNA very far from the TSS in the Cis-regulatory modules (CRM). The inter- sequence can be statistically determined via a Position-specific Weight ested reader can find in Ref. 10 (see also references therein) a compre- Matrix (PWM), that is essentially a matrix reporting the frequency of hensive review of the update knowledge about CRM. In this work, we V C The Author 2017. Published by Oxford University Press on behalf of Kazusa DNA Research Institute. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com 103 Downloaded from https://academic.oup.com/dnaresearch/article-abstract/25/1/103/4562529 by guest on 16 March 2018 104 Transcription factor synergism will focus on the synergism of different TFs in the promoter region, trying to identify those TFs couples able to co-operate, and to discover the context in which those couples are involved in. The synergism between TFs in human DNA has been investigated by several studies focused on the analysis of mutual positioning of TFBSs along the DNA regions. Some studies, taking into consideration cross-species sequence comparison trying to identify possible regulatory modules in 11–13 14 Figure 1. Workflow for the extraction of human TFBSs. human DNA. Other works focus on a specific class of TFs or on a single TF trying to identify the regulatory modules to which the we select, in the promoter P,all theTFBSs for TF and TF (see the pre- i j considered TF belongs to. Finally in Ref. 16, the whole human genome vious Subsection for details) and we collect them in two sets, {TFBS } has been investigated in order to reveal the couples of TFs that synergi- and {TFBS }, respectively. The idea is to compare the number of couples cally act in the transcription regulation. In the present work, we devel- of elements from {TFBS } and {TFBS } whose distance is closer than ‘— i j oped a new computational method able to predict the possible where we set ‘ ¼ 80 bp—with the expected number of couples interaction between couples of TFs analysing the statistical properties obtained by setting at random the occurrences of {TFBS } and {TFBS }. i j of the distribution of couples of TFBSs in the promoter region of As first we consider the case of random distribution of elements in human DNA, and comparing them with a random distribution of {TFBS } and {TFBS } and we compute the expected number of joined i j TFBSs. We obtained a list of highly related TFs couples, whose TFBSs occurrences within a distance ‘. If there is just one binding site for fac- occurrences significantly group together for a given set of promoter tors TF and TF (i.e. there is just an element x 2fTFBS g and i j i genes, identifying the biological context in which the couples are y 2fTFBS g) put at random in a promoter of length L, the probability involved in and the process they should contribute to regulate. that their distance is exactly ‘ is L  ‘ 2. Materials and methods PxðÞ  y ¼ ‘¼ 2 (1) jj LðL  1Þ 2.1. Extracting TFBSs of human genes A simple computation gives the probability that their distance is The list of all human genes was extracted from the Genome Browser less than or equal to ‘ (UCSC) and all the associated 104,178 transcript sequences (genome version hg38) were retrieved by a proper R library querying the UCSC ðÞ 2L  1 ‘  ‘ data repository. For each transcript a promoter region, made of a PxðÞ jj  y  ‘¼ (2) LðL  1Þ sequence of 2,000 bp upstream the TSS, was extracted. In order to remove redundant information, the set of all promoter sequences was Going further, the probability that one element in {TFBS } and sev- processed filtering out redundant overlapped transcripts, reducing their eral element {TFBS }, say n , put at random in a promoter of length L, j j number (and the related promoter sequences) to 36,830. In this study, have a distance less than or equal to ‘ is well-approximated by consid- we used a set of 194 most studied and experimentally validated Human 19 ering the probability of a single matching PxðÞ jj  y  ‘¼ p as a Transcription Factors according to MAPPER. We associated to each Bernoulli event, and the probability of having k elements of {TFBS }in TF the corresponding Position-specific Weight Matrix (PWM), i.e. the an interval of size ‘ around the element x 2fTFBS g is matrix reporting the frequency of nucleotides for each position of the experimentally validated binding sites. It is worth noting that a TF can n k k j be associated to different PWMs and, on the other hand, a PWM can be PkðÞ ;jj x  y  ‘¼ p ð1  pÞ (3) associated to more than one TF. A total number of 192 PWMs was con- sidered in this work and each of them was associated in a biunivocal Finally, if there is more than one element in {TFBS }, say n , the i i relation to one TF with only a few exceptions. For the sake of simplicity expected number of occurrences of couple of elements in {TFBS } and and readability, we refer to PWM models as TFs and vice versa. The {TFBS } within a distance ‘ can be approximated by PWMs matrices were retrieved from an open access repository available j at the following address http://cistrome.org/jian/motif_collection/data bases/Transfac/pwm/ (12 October 2017, date last accessed). We applied n TF ; TF ¼ n kP k; TFBS  TFBS  ‘ ¼ i j i i j the matchPWM() function, a sequence-based transcription factor bind- k¼0 ing site search algorithm, integrated into the Biostrings R library, to ðÞ 2L  1 ‘  ‘ extract and collect the positions list of all binding sites for both all con- ¼ n n p ¼ n n i j i j LðL  1Þ sidered transcripts and PWMs. For our purposes, we have considered (4) an acceptance threshold probability of 0.9. The described procedure is clearly summarized in the workflow in Fig. 1. while its variance is "# 2.2. Transcription factor co-localization and deviation 2 2 r TF ; TF ¼ n k Pk; TFBS  TFBS  ‘  n TF ; TF i j i i j i j from randomness k¼0 2 2 In this Subsection, we present the indicator we have developed in order ðÞ 2L  1 ‘  ‘ ½ðÞ L  ‘ ðÞ L  ‘ ¼ n n pðÞ 1  p¼ n n i j i j to determine whether two different TFs are linked with respect to a L ðL  1Þ given promoter. Let P ¼fj x x ... x x 2 fA; C; G; Tg be the pro- 1 2 L i (5) moter region of a given transcript, i.e. the set of L ¼ 2,000 bases upstream the TSS of that transcript. Using PWM and PWM ,i.e. the The Bernoulli approximations in Equations (4) and (5) are valid i j matrices associated to two different transcription factors TF and TF , when n and n are small with respect to ‘ and ‘ is small with respect i j i j Downloaded from https://academic.oup.com/dnaresearch/article-abstract/25/1/103/4562529 by guest on 16 March 2018 F. Cumbo et al. 105 to L. By computing the actual number of couples (x, y), say ðTF ; TF Þ, with x 2fTFBS g and y 2fTFBS g such that x and y are HXðÞ¼ x logðx Þ i j i j k k k¼1 closer than ‘, and using Equations (4) and (5) we build an indicator evaluating the deviation from randomness of the actual number of Given a couple of models, PWM and PWM , we indicate with 1 2 join occurrences j P and Q the frequency of the nucleotide k (the index k ranges from k k 1 to 4 indicating the four nucleotides A, C, G, T) in the position i of TF ; TF  nðTF ; TF Þ i j i j the model PWM (the index i ranges from 1 to I, where I is the length zTF ; TF ¼ (6) 1 i j rðTF ; TF Þ i j of the model PWM ) and position j of the model PWM (the index j 1 2 ranges from 1 to J, where J is the length of the model PWM ), respec- The above z-score function, measuring how many standard devia- tively. For example, P is the frequency of the nucleotide T in posi- tions the actual number of couples is far from the expected number tion 1 of the first model while Q is the frequency of the nucleotide of couples in the case of a random distribution, provides a reason- G in position 2 of the second model. The Jensen-Shannon divergence able value of the association between TFBS and TFBS . Significantly, i j of the nucleotide distributions at position i of the first model and at high values of z TF ; TF suggest that the two considered TFs are i j position j of the second model is defined as associated, i.e. their TFBSs have a probability to occur as neighbours j j 4 i i higher than the one expected for a random distribution. On the con- P þ Q P þ Q k k k k JSðÞ P; Q; i; j¼ log trary when z TF ; TF is negative, with a high absolute value, ele- i j 2 2 k¼1 ments in the two TFBSs occur as neighbours less frequently than 4 4 X X expected in a random distribution, so that in this case the TFs are 1 j j i i þ P logP þ Q logQ (8) k k k k disassociated. In the following, in order to identify the synergy k¼1 k¼1 between transcription factors, we will focus on the case of high posi- tive values of z TF ; TF . We measure the similarity of two models with respect to a given i j alignment by summing up the Jensen-Shannon divergences as in Equation (8)) of each couple of distributions corresponding to the aligned positions (empty boxes in the overlapping area, see Fig. 2)(i, 2.3. Computing similarity score of PWMs i þ a), where a is the offset between the starting positions of model 1 The prediction algorithm of TFBSs is based on the PWMs obtained and 2 in that alignment, plus a penalty score for the non overlapped by experimental analysis. Since we are interested in identifying cou- positions (“p” marked boxes in the penalty area, see Fig. 2): ples of TFs whose TFBSs occur together in a window of a given size, we have to be sure that the two corresponding models are enough JSðÞ P; Q; a¼ JSðÞ P; Q; i; i þ aþ Kn (9) different since if two models are very similar they are likely to share i2I binding sites or a significant portion of them. We used two different where I is the set of overlapped positions in the given alignment and approaches to evaluate models similarity. The former is a direct n is the number of non overlapped positions of the shortest model. approach aiming to provide a similarity pairwise score of two models The penalty K for each of non overlapped positions is determined by based on the comparison of nucleotide frequencies of the PWMs. computing the probability distribution of the Jensen-Shannon diver- The latter approach can be defined as ‘a posteriori’, since we evaluate gence in the case of two random nucleotide distributions. The value the closeness of models by computing the number of overlapped K ’ 0:23 is obtained by summing up 2 S.D. to the mean value of the binding sites with respect to the total number of couples whose dis- random-nucleotide Jensen-Shannon divergence distributions. Finally, tance is closer than ‘. Using a linear combination of the above the model similarity score of two models JSD(P, Q) is defined as the described similarity scores we determine if two models are ‘structur- smallest score JS(P, Q; a) out of all the possible alignments: ally’ far from each other and we can significantly evaluate the associ- ation between two truly different models according to the algorithm JSDðÞ P; Q ¼ minfJSðP; Q; aÞg (10) a2AL described in the previous section. where AL is the set of all possible alignments with a non-empty superposition. The need of a ‘strong’’ non-overlapping penalty comes out in order to select alignment with a significant overlapping 2.4. PWMs distance based on Jensen-Shannon between the two models (i.e. to avoid bias due to a good affinity divergence between too small part of the two models). The Jensen-Shannon divergence (JS) is a symmetrized and smoothed version of the Kullback-Leibler divergence and it is often used to tell 21 2.5. PWMs distance based on TFBS overlapping how two frequency distributions are close to each other. Here, we Here, we introduce a different similarity function between pairs of use Jensen-Shannon divergence to identify similar PWMs using the TFs based on the percentage of overlapped binding sites they share. nucleotide frequencies as probability distribution. The Jensen- First of all, we consider two TFBSs to be overlapped if their distance Shannon divergence of two frequency distributions, P and Q, can be is smaller than the half of the size of the smaller PWM, computed using the Shannon entropy according to the following minfPWM ; PWM g 1 2 i.e. TFBS  TFBS < . If the TFBSs of two TSs i j equation: are often overlapped it is likely that they result to be associated, according to our algorithm, showing a high Z-score. They are indeed 1 1 1 1 JSðÞ P; Q ¼ H P þ Q  HPðÞ HQðÞ (7) often close because the TFs recognize the same binding sites (or a 2 2 2 2 large portion of them). Given a couple of TFs, TF and TF , the per- i j where H(X) is the Shannon entropy of the distribution X centage of all overlapping occurrences of TFBSs with respect to all Downloaded from https://academic.oup.com/dnaresearch/article-abstract/25/1/103/4562529 by guest on 16 March 2018 106 Transcription factor synergism Figure 2. Scheme of the algorithm for computing similarity score of models. the joined occurrences in a radius ‘ in all the transcript promoter 0.5 sequences, namely OVD(TF ,TF ), is defined as follows: i j 0.4 Selected l ðTF ; TF Þ i j Tot OVD TF ; TF ¼ (11) i j ðTF ; TF Þ i j Tot 0.3 Discarded where  ðTF ; TF Þ is the total number of the TFBSs couples (one Tot i j from TF and the other from TF ) falling together in the considered i j radius of size ‘ and l ðTF ; TF Þ is the total number of TFBSs over- 0.2 i j Tot lapped couples (one from TF and the other from TF ), for all the i j transcript promoter sequences. In the next section (as reported in 0.1 Fig. 3), OVD scores are plotted against JSD scores in order to select significant couples. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2.6. Protein-protein interaction network analysis OVD In order to validate the synergy of TFs couples identified by the algo- Figure 3. Each point represents a TF couple whose Z-scores are significant rithm here presented we analysed Protein-Protein Interaction net- (Z > 5) for at least 20 transcripts. The plane is divided into two regions: the work (PPI) to test the interaction of this couples in the network by upper-left sector, where Overlapping Score is higher than Similarity Score, using shortest path. provides selected couples. Lower-right sector, where Overlapping Score is Protein-Protein Interaction network was downloaded from lower than Similarity Score, provides Discarded couples. STRING database. We considered all the possible interaction sour- ces between proteins giving a score threshold of 0.7 (high confidence value of zðTF ; TF ; TR Þ, the higher the association between TF and i j k i interaction). The obtained network is made of 719,288 edges (inter- TF in the transcript promoter TR sequence will be. Since for each j k actions) and 14,932 nodes (proteins). The iGraph R package (https:// transcript, we have to test a considerably large number of couples cran.r-project.org/web/packages/igraph/index.html (12 October (TF ,TF ) and we test the association between couples in a large i j 2017, date last accessed)) was used to compute the shortest path number of transcripts, we select only those couples for which the from the adjacency matrix of the graph. value of zðTF ; TF ; TR Þ is higher than 5 (meaning that the number i j of occurrence of couples TF TF in a radius ‘ is far from that i j expected by chance more than 5 S.D.) and we indicate with NZ ðTF 5 i 3. Results and discussion ; TF Þ the number of transcripts TR for which the couple (TF ,TF ) j k i j 3.1. Selecting significant TF couples has a z-score greater than 5. Given a couple (TF ,TF ) it is reasonable i j In the previous Section, Equation (6) has been derived in order to to expect that a large value of NZ ðTF ; TF Þ indicates that the two 5 i j provide a measure of association between each couple of transcrip- models TF ,TF are functionally associated. However, a high value i j tion factors (TF ,TF ) with respect to a given transcript TR . Our of NZ ðTF ; TF Þ could refer to a couple of models with a high simi- i j k 5 i j goal is to find out couples of TFs that are significantly related in larity score rather than a functional association between TF and TF . i j gene-regulation mechanism. As previously mentioned, the higher the Therefore, to correctly identify relevant associated transcription Downloaded from https://academic.oup.com/dnaresearch/article-abstract/25/1/103/4562529 by guest on 16 March 2018 JSD F. Cumbo et al. 107 Table 1. Categories and terms for the enrichment analysis Category Number Number of terms of genes OMIM expanded 187 2,178 Tissue protein expression 207 62,307 from proteomicsDB KEGG 2015 179 3,800 OMIM disease 90 1,759 GO molecular function 1,136 12,753 Figure 4. Workflow for the analysis of TF couples. In panel A, the procedure GO cellular component 641 13,236 for the selection of relevant couples is depicted. In panel B, the analysis of GO biological process 5,192 14,264 significant couples is described both in terms of enrichment tests and sort- Chromosome location 386 32,740 ing of best couples. factors, we will use the overlapping score OVD(TF ,TF ) and the i j similarity score JSD(TF ,TF ). In Fig. 3, OVD(TF ,TF ) and JSD(TF , i j i j i TF ) values are shown for all the TF couples which have NZ TF ; TF > 20. Since the higher the value of JSD the lower the 5 i j similarity between two models is, while the higher the values of OVD the higher the similarity between binding sites is, as selection criterion to identify significant couples we used the inequality JSD > OVD. In such a way, the plane (OVD, JSD) is divided into two regions: the upper-left sector, where models are not similar and the overlapping is low (selected couples), while in the lower-right sec- tor, where models are similar and the overlapping is high (discarded couples). It is worth remarking that even if measures JSD and OVD could appear redundant, Fig. 3 clearly stated that the two measures are quite different: there are couples with similar JSD and very differ- -16 -14 -12 -10 -8 -6 -4 -2 ent OVD and vice versa. In fact JSD measure only depends on the 10 10 10 10 10 10 10 10 P-value intervals related PWM while OVD measure depends on the considered data- set, e.g. two models can bind the same sequence even if they are dif- Figure 5. Number of couples enriched in at least one term as a function of ferent. Considering a combination of both measures we obtain a their smallest P-value. Solid plot shows the enrichment of the couples identi- more robust selection criterion. fied by our algorithm, dashed plots show the enrichment of random couples In Fig. 4 (panel A) is sketched the procedure we use to select the for comparison. 547 couples identifying pairs of transcription factors that are likely to synergistically regulate gene transcription. In the following sec- the black plot, there is only one couple (M00106—M00967) having a tions, we will analyse the set of selected couples in order to associate term [ubiquitinyl hydrolase activity (GO: 0036459)] enriched with a to them a biological task and to find the most relevant associated 18 P-value smaller than 10 (3.63E–19), another couple (M00739— couples according to their statistical features (see Fig. 4 panel B). 14 M00799) for P-value smaller than 10 and three different couples for P-value smaller than 10 andsoon(see Table 2 whereall theenriched 3.2. Enrichment analysis and validation terms are ordered according to their P-values). It is worth noting that some couples have more than one term with a significant P-value, for In order to evaluate the biological significance of the obtained TF example couple (M00106—M00967) shows several terms with a couples, we performed hypergeometric enrichment tests. Given a remarkable enrichment: GO Molecular function—ubiquitinyl hydrolase couple of models, we want to test whether the set of genes potentially activity (GO: 0036459)—P-value 3.63E–19, Go Molecular function— regulated by the synergy of the corresponding two TFs, is enriched in cysteine-type peptidase activity (GO: 0008234)—P-value 4.56E–17, Go a particular category-term. We wondered whether those genes are Biological Process—Ubiquitin-dependent protein catabolic process (GO: linked in a given biological task in order to associate the TF couple 0006511), and so on. The comparison of enrichment P-values, related to a given biological process. We selected a list of categories and related terms, reported in Table 1, that identify different biological to couples identified by our algorithm (black plot), with those related to tasks and contexts. We then performed for each identified TF couple random sets (ten coloured plots) clearly highlights the significance and hypergeometric enrichment tests related to all the considered terms. consistency of our analysis, since none of the random sets shows P-val- For each TF couple we obtained a list of P-values linked to the corre- uessmallerthan10 . Also analysing the plots for larger P-values inter- sponding terms. The enrichment results were then compared with those vals the number of enriched couples is significantly higher than all the of ten random sets of genes. Each set was generated preserving the same corresponding numbers of random sets (taking into account the log- number of sample couples (i.e. 547) and the same number of genes for scale units in y-axis). each couple. The results are reported in Fig. 5 where solid plot represents 3.3. Analysis of relevant transcription factor couples the enrichment of couples identified by our algorithm and dashed plots show enrichment of random sets. Each point of the plots shows in log- In this section, we analyse the most significant couples identified by scale the number of TF couples (y-axis) with at least one term whose P- our algorithm. Among the 547 selected couples, we consider as partic- value falls in the corresponding P-value interval (x-axis). For example in ularly relevant those couples with the smallest enrichment P-value or Downloaded from https://academic.oup.com/dnaresearch/article-abstract/25/1/103/4562529 by guest on 16 March 2018 Number of couples 108 Transcription factor synergism Downloaded from https://academic.oup.com/dnaresearch/article-abstract/25/1/103/4562529 by guest on 16 March 2018 Table 2. The PWM model IDs and the corresponding TF names are reported in columns 1–4 for each couple; the category and the related term the couple resulted enriched in are reported in columns 5–6; column 7 reports the related P-value of the hypergeometric test Model 1 TF 1 Model 2 TF 2 Category Term P-value M00106 CDP CR3þHD M00967 HNF4 COUP GO Molecular function ubiquitinyl hydrolase activity (GO: 0036459) 3, 63E–19 M00106 CDP CR3þHD M00967 HNF4 COUP GO Molecular function cysteine-type peptidase activity (GO: 0008234) 4, 56E–17 M00739 E2F-4: DP-2 M00799 Myc GO Biological process Gonadal mesoderm development (GO: 0007506) 6, 60E–15 M00106 CDP CR3þHD M00967 HNF4 COUP GO Biological process Ubiquitin-dependent protein catabolic process (GO: 0006511) 3, 59E–14 M00736 E2F-1: DP-1 M00799 Myc GO Biological process Gonadal mesoderm development (GO: 0007506) 4, 09E–14 M00106 CDP CR3þHD M00967 HNF4 COUP GO Biological process Modification-dependent protein catabolic process (GO: 0019941) 4, 42E–14 M00106 CDP CR3þHD M00967 HNF4 COUP GO Biological process Modification-dependent macromolecule catabolic process (GO: 0043632) 4, 90E–14 M00106 CDP CR3þHD M00967 HNF4 COUP GO Biological process Proteolysis involved in cellular protein catabolic process (GO: 0051603) 8, 43E–14 M00736 E2F-1: DP-1 M00799 Myc GO Biological process Nucleosome organization (GO: 0034728) 1, 42E–13 M00777 STAT M00980 TBP Chromosome location Chr5q13 1, 54E–13 M00736 E2F-1: DP-1 M00799 Myc GO Biological process Protein-DNA complex subunit organization (GO: 0071824) 4, 01E–13 M00736 E2F-1: DP-1 M00799 Myc GO Biological process Nucleosome assembly (GO: 0006334) 2, 81E–12 M00457 STAT5A M00980 TBP Chromosome location Chr5q13 6, 40E–12 M00736 E2F-1: DP-1 M00799 Myc GO Biological process Protein-DNA complex assembly (GO: 0065004) 8, 69E–12 M00223 STATx M00980 TBP Chromosome location Chr5q13 4, 82E–11 M00799 Myc M00927 AP-4 GO Biological process Gonadal mesoderm development (GO: 0007506) 1, 12E–10 M00739 E2F-4: DP-2 M00799 Myc GO Biological process Mesoderm development (GO: 0007498) 2, 71E–10 M00739 E2F-4: DP-2 M00799 Myc GO Biological process Mesenchyme development (GO: 0060485) 4, 02E–10 M00736 E2F-1: DP-1 M00799 Myc GO Biological process Mesoderm development (GO: 0007498) 1, 66E–09 M00736 E2F-1: DP-1 M00799 Myc GO Biological process Mesenchyme development (GO: 0060485) 2, 46E–09 M00739 E2F-4: DP-2 M00799 Myc GO Biological process Nucleosome assembly (GO: 0006334) 4, 46E–09 M00462 GATA-6 M00921 GR GO Molecular function FK506 binding (GO: 0005528) 7, 13E–09 M00462 GATA-6 M00921 GR GO Molecular function Macrolide binding (GO: 0005527) 7, 13E–09 M00739 E2F-4: DP-2 M00799 Myc GO Biological process Protein-DNA complex assembly (GO: 0065004) 9, 90E–09 M00739 E2F-4: DP-2 M00799 Myc GO Biological process Nucleosome organization (GO: 0034728) 1, 44E–08 M00655 PEA3 M00803 E2F Chromosome location Chr5q13 1, 50E–08 M00799 Myc M00803 E2F GO Biological process Protein autophosphorylation (GO: 0046777) 2, 05E–08 M00739 E2F-4: DP-2 M00799 Myc GO Biological process Protein-DNA complex subunit organization (GO: 0071824) 2, 73E–08 M00059 YY1 M00148 SRY GO Molecular function Sodium ion transmembrane transporter activity (GO: 0015081) 2, 96E–08 M00415 AREB6 M00706 TFII-I KEGG Valine leucine and isoleucine biosynthesis 4, 10E–08 F. Cumbo et al. 109 with the best associated parameters (highest number of transcripts, histone deacetylase (HDAC) enzymatic activity. As reported in Ref. smallest OVD and highest JSD) as reported in Fig. 4 panel B. The list 25, it has been identified a subunit of the complex NuA4 as the prod- of couples with the smallest P-values (smaller than 10 ) is reported uct of TRA1, an ATM-related gene homologous to human TRRAP, in Table 2. The analysis of the couples reported in the table suggests an essential cofactor for c-Myc- and E2F-mediated oncogenic trans- several observations, providing and/or confirming evidence of the syn- formation. A further link between Myc and E2F was studied in Ref. ergism of several TF couples and the related biological contexts they 26 highlighting their relationships with chromatin structure and are involved in. The smallest P-values are reported for the couple stability. It is remarkable and can be source of further investigations CDP (also known as Cutl1) (M00106)–HNF4 (M00967), as already the association of the couple AREB (M00415)–TFII-I (M00706) (also mentioned above. The related enriched terms are mainly associated (4 known as GTF2I) with KEGG pathway Valine leucine and isoleucine out of 6) with catabolic processes while the other two terms are ubiq- biosynthesis (the couple showing a P-value of 4.10E–8). Interestingly, uitinyl hydrolase activity and cysteine-type peptidase activity. transcription factors of STAT family (STAT—M00777, STATx— Concerning the couple Myc (M00799)–E2F (M00803) (and its cluster M00223 and STAT5A–M00457) strongly interact with TBP E2F–1 and E2F–4) it can be observed that it appears several times in (M00980) in the promoters of genes belonging to the region of the table associated to different terms. These terms can be grouped chr5q13, that is known to be associated with various neurological dis- into main activities/contexts regarding: (1) gonadal mesoderm devel- orders and pathologies such as Spinal Muscular Atrophy, Hairy Cell 27–29 opment and mesenchyme development, (2) Nucleosome assembly and Leukemia and it is also connected to Alcohol Dependence. organization, Protein-DNA complex assembly and Protein-DNA Finally, it could deserve attention the couple Areb6 (M00414)–E2F Complex Subunit Organization and (3) Protein (M00803) associated to the term Mental retardation, even if not Autophosphorylation. It is worth noting that in literature there are reported in Table 2, it shows a remarkable P-value (610 ). Areb6 many references associated to the couple Myc–E2F. In Ref. 24, it has belongs to the Zeb transcription factor family that has been shown to been shown how the two TFs are linked via mSin3A, a core compo- be involved in mental retardation syndromes. In Table 3, the best nent of a large multiprotein co-repressor complex with associated couples, in terms of association, are reported ordered by the number Table 3. The PWM model IDs and the corresponding TF names are reported in columns 1–4 for each couple; Column 5 reports the number of transcripts whose Z-scores, related to the couple, are higher than 5; the average Z-score of all the significant (Z > 5) transcripts is reported in column 6; Overlapping and Similarity Scores are reported in columns 7 and 8, respectively Model 1 TF 1 Model 2 TF 2 Number of transcripts Average Z-score Overlapping score Similarity score M00083 MZF1 M00649 MAZ 1,014 7, 77 0, 19 0, 21 M00083 MZF1 M00803 E2F 456 6, 77 0, 09 0, 21 M00803 E2F M00976 AhR Arnt HIF-1 338 6, 63 0, 03 0, 21 M00706 TFII-I M00971 Ets 317 7, 71 0, 16 0, 24 M00799 Myc M00803 E2F 293 7, 16 0, 00 0, 23 M00706 TFII-I M00803 E2F 275 6, 61 0, 00 0, 27 M00148 SRY M00747 IRF-1 254 7, 77 0, 00 0, 20 M00148 SRY M00471 TBP 239 6, 78 0, 02 0, 21 M00148 SRY M00980 TBP 203 6, 45 0, 00 0, 23 M00698 HEB M00803 E2F 196 6, 35 0, 02 0, 29 M00649 MAZ M00658 PU.1 187 7, 49 0, 03 0, 22 M00649 MAZ M00799 Myc 184 6, 85 0, 00 0, 33 M00799 Myc M00933 Sp1 182 7, 06 0, 01 0, 28 M00462 GATA-6 M00471 TBP 182 6, 47 0, 08 0, 21 M00799 Myc M00931 Sp1 167 6, 97 0, 00 0, 30 M00803 E2F M00927 AP-4 160 6, 29 0, 01 0, 24 M00801 CREB M00803 E2F 153 6, 43 0, 00 0, 28 M00706 TFII-I M00931 Sp1 141 6, 58 0, 04 0, 22 M00649 MAZ M00971 Ets 132 7, 32 0, 00 0, 23 M00933 Sp1 M00976 AhR Arnt HIF-1 127 6, 65 0, 02 0, 21 M00803 E2F M00981 CREB ATF 127 6, 72 0, 00 0, 23 M00799 Myc M00932 Sp1 121 7, 17 0, 00 0, 28 M00148 SRY M00706 TFII-I 120 8, 58 0, 00 0, 31 M00931 Sp1 M00976 AhR Arnt HIF-1 117 6, 47 0, 01 0, 20 M00803 E2F M00917 CREB 115 6, 81 0, 00 0, 25 M00008 Sp1 M00706 TFII-I 114 6, 36 0, 02 0, 21 M00791 HNF3 M00975 RFX 107 6, 58 0, 00 0, 20 M00471 TBP M00747 IRF-1 107 6, 32 0, 02 0, 25 M00649 MAZ M00976 AhR Arnt HIF-1 106 6, 28 0, 00 0, 26 M00148 SRY M00962 AR 106 6, 04 0, 00 0, 20 M00148 SRY M00789 GATA 106 6, 08 0, 00 0, 23 M00148 SRY M00975 RFX 104 6, 20 0, 00 0, 21 M00008 Sp1 M00799 Myc 102 6, 77 0, 00 0, 29 M00775 NF-Y M00803 E2F 101 6, 24 0, 03 0, 20 The couple Myc, E2F is reported in bold to highlight it is also included in Table 2. Downloaded from https://academic.oup.com/dnaresearch/article-abstract/25/1/103/4562529 by guest on 16 March 2018 110 Transcription factor synergism Table 4. Frequencies of couples as a function of the shortest path (SP) distance for three classes of protein couples: (i) the selected 547 TF couples (namely BEST, first row), (ii) all the couples of TFs (namely ALL, second row) and iii) 10 random sample sets made of 547 randomly picked protein couples from the whole PPI (namely RANDOM, mean and standard deviation in the third and fourth row, respectively) SP 1 SP 2 SP 3 SP 4 SP 5 SP 6 SP 7 BEST 0.117318 0.478585 0.284916 0.10987 0.009311 0 0 ALL 0.056789 0.463694 0.376242 0.084080 0.014167 0.004661 0.000368 RANDOM (mean) 0.003291 0.065601 0.377048 0.385820 0.134512 0.027972 0.005346 RANDOM (stdv) 0.000943 0.006363 0.016408 0.008298 0.009771 0.003329 0.002414 of significant Transcripts. We selected those couples with a number of from String database. We found a significant overall difference significant transcripts (Z-score higher than 5) higher than 100, with between shortest path distribution of the set of couples selected by an Overlapping score (OVD) smaller than 0.2 and a Similarity score our method and the set of all TFs couples. In Table 4, we report the (JSD) higher than 0.2 (it is also reported in the fifth column the aver- percentages of couples as a function of the Shortest Path (SP) distan- age Z-Score). The couple MZF1 (M00083)—MAZ (M00649) shows ces for three classes of protein couples: (i) the selected 547 TF cou- a Z-score higher than 5 in 1,014 transcript promoter sequences (with ples (namely BEST, first row), (ii) all the couples of TFs (namely an average z-score of 7.77—more than 7 S.D. far from the expected ALL, second row) and (iii) 10 random sample sets made of 547 ran- average value). Myc-associated zinc finger protein (MAZ) and domly picked protein couples from the whole PPI (namely Myeloid zinc finger 1 (MZF1) are both transcription factors charac- RANDOM, mean and standard deviation in the third and fourth terized by a zinc finger small protein structural motif. The associated row, respectively). The majority of couples (around 50%) related to PWMs show a similar common sub-motif characterized by a GGGA TFs shows a SP equal to 2 both for couples obtained by our algo- sequence. Nevertheless the PWMs have different length (6 and 8) and rithm and for all TF couples, while only around 6% of protein ran- the correspondent similarity score, JSD, results to be 0.21 meaning dom couples has a SP distance equal to 2. The most relevant result is that the two PWMs are similar but globally not so close to each other. in the difference of the frequency for SP ¼ 1 (indicating a direct inter- Significantly, the OVD overlapping score is quite low 0.188 meaning action) that is 0.12 for our best couples and 0.057 for all TF cou- that, on average, among all the identified TFBSs couple (both in a ples (i.e. the ratio is greater than 2), and also for SP ¼ 3(0.28 our window of 80 bp) of the two TFs, less than 1 out of 5 couples are best couples versus 0.37 for all TF couples). The distribution for overlapped. This is why we included this couple in the selected set of the random protein data shows, not surprisingly, a complete differ- TFs couples. The surprisingly high number of transcripts for which ent pattern (the most part of couples having SP equal to 3 and 4). the two TFs co-occur should deserve deeper investigations. The cou- Even considering the limitations of this kind of analysis—a PPI is a ple MZF1 (M00083)–E2F (M00803) shows 456 transcripts with a Z- global representation of potential interactions between proteins (and score higher than 5 (with an average z-score of 6, 77, OVD ¼ 0.09 consequently TFs) that not always (referring to time and space) is an and JSD ¼ 0.21). Interestingly they are involved in several diseases, in actual interaction and, moreover, the results depend on the threshold particular, as reported in Ref. 31, they are both potential key regula- chosen to select significant interactions—it reveals that TFs couples tors of PKD1 and PKD2 whose mutations are linked with autosomal selected by our method show an overall stronger relationship than dominant polycystic kidney disease (ADPKD). We found one tran- those in the set of all TFs couples. In particular, we chose a quite script of PKD1 uc002cos.1 with a Z-score for the couple equal to 2 strict threshold of 0.7 (as reported in the Material and methods sec- and two transcripts of PKD2 uc003hre.3 and uc011cdg.2 with Z- tion), so the significance of results has to be found in the ratio score equal to 5.80 and 3.42, respectively. It is interesting that a between the number of our selected couples and all TFs couples at SP minor groove binding protein SRY (M00148) is associated to both 1 (ratio equal to 2) rather than in the percentages (12% and 6%, the models of TBP (M00471 and M00980) with a number of signifi- approximately) that could significantly change since they depend on cant transcripts equal to 239 and 203, respectively, a very small OVD the threshold. close to 0 for both and JSD 0.21 and 0.23, respectively. It is worth noting that the couple Myc (M00799)–E2F (M00803) (also included 4. Conclusions in Table 2) shows a number of significant transcripts equal to 293, A better understanding of the mechanisms driving the regulation of pro- meaning that, besides a clear association with given biological con- tein expression is an essential requisite to shed light on the behaviour of texts, as discussed in the previous section, there is a strong synergism cells. Transcription factors play a central role in this extremely complex between the two TFs confirmed by the values reported in the table. task and it has been shown that they synergically co-operate in order to Concerning the couple Maz (M00649)–Pu.1 (M00658) (number of significant transcripts equal to 187) it has been shown that three tran- provide a fine tuning of protein expressions. Among the different meth- scription factors Maz, PU.1 and ARNT show significant recognition ods able to detect and identify TFs interplay, a very important resource elements among similarly up or down-regulated genes involved in is the computational methods to which our work mainly refers. In this hematopoietic differentiation or leukemogenesis. We note that also work, we present a mathematically well-founded procedure able to identify TFs couples that act together, inferring for several of those cou- the couple Maz (M00649)–ARNT (M00976) is included in the table ples the biological context they are involved in. We introduced a new with a number of significant transcripts equal to 106. In order to further validate and provide significance to obtained and robust statistical method based on the use of a good Bernoulli results, we computed the shortest paths between couples of TFs approximation and on a z-score measures able to discriminate between related to the Protein-Protein Interaction (PPI) network downloaded random and non-random co-occurrences of couple of TFs. We Downloaded from https://academic.oup.com/dnaresearch/article-abstract/25/1/103/4562529 by guest on 16 March 2018 F. Cumbo et al. 111 7. Wingender, E. 2008, The TRANSFAC project as an example of frame- extended previous methods based on randomize sequences as in Refs. work technology that supports the analysis of genomic regulation, Brief. 15 and 16. In order to avoid biases due to the structural similarity of Bioinform., 9, 326–32. different models Overlapping and Similarity scores were designed to 8. Sandelin, A., Alkema, W., Engstro ¨ m, P., Wasserman, W.W. and Lenhard, select TF couples that are significant for the analysis. We used such a B. 2004, JASPAR: an open-access database for eukaryotic transcription method to find pairs of associated transcription factors in the set of all factor binding profiles, Nucleic Acids Res., 32, D91–4. human promoter sequences (selected with a careful analysis from all 9. Tompa, M., Li, N., Bailey, L.T., Church, G.M., et al. 2005, Assessing human transcripts) and we also performed enrichment analysis on the computational tools for the discovery of transcription factor binding sites, set of genes regulated by identified couples. This analysis provides con- Nat. Biotechnol., 23, 137–44. sistence to our results but also provides a biological context to be asso- 10. Suryamohan, K. and Halfon, M.S. 2015, Identifying transcriptional cis-regulatory modules in animal genomes, Wires. Dev. Biol., 4, 59–84. ciated to the couples. Moreover, we also performed network analysis 11. Blanchette, M., Bataille, A.R., Chen, X., et al. 2006, Genome-wide showing that TFs couples identified by the algorithm here presented are computational prediction of transcriptional regulatory modules closer than expected in the protein-protein interaction network in terms reveals new insights into human gene expression. Genome. Res., 16, of shortest path. 656–68. To our knowledge this is the only work concerning synergy of 12. Prabhakar, S., Poulin, F., Shoukry, M., et al. 2006, Close sequence com- Transcription Factors taking into account all those features. Some of parisons are sufficient to identify human cis-regulatory elements, Genome the couples emerging in this study are already known to be linked in Res., 16, 855–63. 24–26 several biological contexts (such as Myc–E2F, while in other 13. Hu, Z., Hu, B. and Collins, J.F. 2007, Prediction of synergistic transcrip- cases our results can lead to hypothesize links between TFs couples tion factors by function conservation, Genome Biol., 8, R257. 14. Yu, X., Lin, J., Zack, D.J. and Qian, J. 2006, Computational analysis of and diseases (as for the STAT family that strongly interact with TBM tissue-specific combinatorial gene regulation: predicting interaction in the promoters of genes belonging to the region of chr5q13 27–29 between transcription factors in human tissues, Nucleic Acids Res., 34, involved in several diseases. Finally, some other couples were 4925–36. not previously identified (such as the couple MZF1–MAZ) that 15. Nandi, S., Blais, A. and Ioshikhes, I. 2013, Identification of cis-regulatory would deserves further investigation. modules in promoters of human genes exploiting mutual positioning of According to our algorithm two TFs could be associated in differ- transcription factors, Nucleic Acids Res., 41, 8822–41. ent ways, for example they could be co-operative in a strict sense or 16. Hannenhalli, S. and Levy, S. 2002, Predicting transcription factor syner- concurrent in the sense that the presence of one of the two impedes gism, Nucleic Acids Res., 30, 4278–84. the presence of the other. A small size window, as the one we use, 17. Kent, W.J., Sugnet, C.W., Furey, T.S., et al. 2002, The human genome leads us to hypothesize that two transcription factor proteins that browser at UCSC, Genome Res., 12, 996–1006. 18. Gentleman, R.C., Carey, V.J., Bates, D.M., et al. 2004, Bioconductor: could bind sites within the window, either are sufficiently close to open software development for computational biology and bioinfor- each other to physically interact, or only one of the two is able to matics, Genome Biol., 5,1. bind its TFBS because of the steric hindrance. 19. Marinescu, V.D., Hohane, I.S. and Riva, A. 2005, The MAPPER data- This work represents a step in the direction of designing complex base: a multi-genome catalog of putative transcription factor binding sites, gene regulatory networks, and it provides information on TFs associ- Nucleic Acids Res., 33 (suppl 1), D91–7. [CVOCROSSCVO] ation that could be useful in this context. The identification of signifi- 20. Page ` s, H., Aboyoun, P., Gentleman, R. and DebRoy, S. 2009, String cant TFs couples could be of help in the view of artificially altering objects representing biological sequences, and matching algorithms. R the regulation of genes by inhibiting the interaction between given package version 2.2. TFs couples. 21. Lin, J. 1991, Divergence measures based on the shannon entropy, IEEE Trans. Inform. Theory, 37, 145–51. 22. Szklarczyk, D., Morris, J.H., Cook, H., et al. 2017, The STRING data- base in 2017: quality-controlled protein-protein association networks, Conflict of interest made broadly accessible, Nucleic Acids Res., 45, D362–8. None declared. 23. Lee, J.K., ed. 2010, Statistical Bioinformatics for Biomedical and Life Science Researchers. Wiley-Blackwell, Hoboken, New Jersey. 24. Dannenberg, J.H., David, G., Zhong, S., van der Torre, J., Wong, W.H. References and Depinho, R.A. 2005, mSin3A corepressor regulates diverse transcrip- tional networks governing normal and neoplastic growth and survival, 1. Mitchell, P.J. and Tjian, R. 1989, Transcriptional regulation in mamma- lian cells by sequence-specific DNA binding proteins, Science, 245, 371–8. Genes Dev., 19, 1581–95. 2. Latchman, D.S. 1997, Transcription factors: an overview, Int. J. Biochem. 25. Allard, S., Utley, R.T., Savard, J., et al. 1999, NuA4, an essential tran- Cell Biol., 29, 1305–12. scription adaptor/histone H4 acetyltransferase complex containing Esa1p 3. Zhou, H.X. 2011, Rapid search for specific sites on DNA through confor- and the ATM-related cofactor Tra1p, Embo J., 18, 5108–19. mational switch of nonspecifically bound proteins, Proc. Nat. Acad. Sci. 26. Albert, T., Wells, J., Funk, J.O., et al. 2001, The chromatin structure of U.S.A., 108, 8651–6. the dual c-myc promoter P1/P2 is regulated by separate elements, J. Biol. 4. Berg, O.G., Winter, R.B. and Von Hippel, P.H. 1981, Diffusion-driven Chem., 276, 20482–90. 27. Lin, P., Hartz, S.M., Wang, J.C., et al. 2012, Copy number variations in mechanisms of protein translocation on nucleic acids. 1. Models and theory, Biochemistry, 20, 6929–48. 6q14. 1 and 5q13. 2 are associated with alcohol dependence, Alcohol. 5. Pahl, H.L. 1999, Activators and target genes of Rel/NF-kB transcription Clin. Exp. Res., 36, 1512–8. factors, Oncogene, 18, 6853–66. 28. Also-Rallo, E., Alias, L., Martinez-Hernandez, R., et al. 2011, Treatment 6. Wingender, E., Chen, X., Hehl, R., et al. 2000, TRANSFAC: an inte- of spinal muscular atrophy cells with drugs that upregulate SMN expres- grated system for gene expression regulation, Nucleic Acids Res., 28, sion reveals inter-and intra-patient variability, Eur. J. Hum. Genet., 19, 316–9. 1059–65. Downloaded from https://academic.oup.com/dnaresearch/article-abstract/25/1/103/4562529 by guest on 16 March 2018 112 Transcription factor synergism 29. Wu, X., Ivanova, G., Merup, M., et al. 1999, Molecular analysis of the mutations in the zinc finger homeo box 1B, Am. J. Med. Genet., 108, human chromosome 5q13. 3 region in patients with hairy cell leukemia 177–81. and identification of tumor suppressor gene candidates, Genomics, 60, 31. Lantinga-van Leeuwen, I.S., Leonhard, W.N., Dauwerse, H., et al. 2005, 161–71. Common regulatory elements in the polycystic kidney disease 1 and 2 pro- 30. Zweier, C., Albrecht, B., Mitulla, B., et al. 2002, ‘Mowat-Wilson’ syn- moter regions, Eur. J. Hum. Genet., 13, 649–59. drome with and without hirschsprung disease is a distinct, recognizable 32. Bogni, A., Cheng, C., Liu, W., et al. 2006, Genome-wide approach to identify multiple congenital anomalies-mental retardation syndrome caused by risk factors for therapy-related myeloid leukemia, Leukemia, 20, 239–46. Downloaded from https://academic.oup.com/dnaresearch/article-abstract/25/1/103/4562529 by guest on 16 March 2018 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png DNA Research Oxford University Press

Investigating transcription factor synergism in humans

Free
10 pages

Loading next page...
 
/lp/ou_press/investigating-transcription-factor-synergism-in-humans-Rj9tB0a2CJ
Publisher
Oxford University Press
Copyright
© The Author 2017. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.
ISSN
1340-2838
eISSN
1756-1663
D.O.I.
10.1093/dnares/dsx041
Publisher site
See Article on Publisher Site

Abstract

Proteins are the core and the engine of every process in cells thus the study of mechanisms that drive the regulation of protein expression, is essential. Transcription factors play a central role in this extremely complex task and they synergically co-operate in order to provide a fine tuning of protein expressions. In the present study, we designed a mathematically well-founded procedure to investigate the mutual positioning of transcription factors binding sites related to a given cou- ple of transcription factors in order to evaluate the possible association between them. We obtained a list of highly related transcription factors couples, whose binding site occurrences significantly group together for a given set of gene promoters, identifying the biological contexts in which the couples are involved in and the processes they should contribute to regulate. Key words: transcription factors, gene regulation, biological process, computational biology 1. Introduction nucleotides for each position of the experimentally found binding sites. The regulation of gene expression is a fundamental mechanism driving Several databases based on experimental evidences were designed col- biological processes. Transcriptional regulation rules the access of pol- lecting information about TFs and the corresponding TFBSs, 6,7 8 ymerase complex to the gene, activating or repressing the transcription TRANSFAC and JASPAR are often used as a reference for process. Transcription Factors (TF) play a central role in this context; Eukariotes. In the review different algorithmic approaches to predict they are proteins able to bind specific DNA regions recognizing a short TFBSs, starting from PWMs, are presented and compared. Despite the sequence of nucleotides called Transcription Factor Binding Sites large number of genes in higher animals (30,000 genes) the number (TFBS). There is a vast literature concerning TFs, starting from seminal of known TFs is one order of magnitude smaller, indicating a neces- 1,2 and general papers to the most specific ranging from the study of sary interplay between different TFs in order to regulate the activity of the mechanism that allows TFs to efficiently and rapidly find the target all genes. TFBSs are often clustered in peculiar families and patterns 3,4 along the DNA helix, to the study of the roles that specific TFs play made of several TFBSs can be found in the promoter region, i.e. thou- in given biological tasks and how they influence the regulation of the sands bases upstream the Transcription Starting Site (TSS), or also transcription of their target genes. TFBS occurrences along a DNA very far from the TSS in the Cis-regulatory modules (CRM). The inter- sequence can be statistically determined via a Position-specific Weight ested reader can find in Ref. 10 (see also references therein) a compre- Matrix (PWM), that is essentially a matrix reporting the frequency of hensive review of the update knowledge about CRM. In this work, we V C The Author 2017. Published by Oxford University Press on behalf of Kazusa DNA Research Institute. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com 103 Downloaded from https://academic.oup.com/dnaresearch/article-abstract/25/1/103/4562529 by guest on 16 March 2018 104 Transcription factor synergism will focus on the synergism of different TFs in the promoter region, trying to identify those TFs couples able to co-operate, and to discover the context in which those couples are involved in. The synergism between TFs in human DNA has been investigated by several studies focused on the analysis of mutual positioning of TFBSs along the DNA regions. Some studies, taking into consideration cross-species sequence comparison trying to identify possible regulatory modules in 11–13 14 Figure 1. Workflow for the extraction of human TFBSs. human DNA. Other works focus on a specific class of TFs or on a single TF trying to identify the regulatory modules to which the we select, in the promoter P,all theTFBSs for TF and TF (see the pre- i j considered TF belongs to. Finally in Ref. 16, the whole human genome vious Subsection for details) and we collect them in two sets, {TFBS } has been investigated in order to reveal the couples of TFs that synergi- and {TFBS }, respectively. The idea is to compare the number of couples cally act in the transcription regulation. In the present work, we devel- of elements from {TFBS } and {TFBS } whose distance is closer than ‘— i j oped a new computational method able to predict the possible where we set ‘ ¼ 80 bp—with the expected number of couples interaction between couples of TFs analysing the statistical properties obtained by setting at random the occurrences of {TFBS } and {TFBS }. i j of the distribution of couples of TFBSs in the promoter region of As first we consider the case of random distribution of elements in human DNA, and comparing them with a random distribution of {TFBS } and {TFBS } and we compute the expected number of joined i j TFBSs. We obtained a list of highly related TFs couples, whose TFBSs occurrences within a distance ‘. If there is just one binding site for fac- occurrences significantly group together for a given set of promoter tors TF and TF (i.e. there is just an element x 2fTFBS g and i j i genes, identifying the biological context in which the couples are y 2fTFBS g) put at random in a promoter of length L, the probability involved in and the process they should contribute to regulate. that their distance is exactly ‘ is L  ‘ 2. Materials and methods PxðÞ  y ¼ ‘¼ 2 (1) jj LðL  1Þ 2.1. Extracting TFBSs of human genes A simple computation gives the probability that their distance is The list of all human genes was extracted from the Genome Browser less than or equal to ‘ (UCSC) and all the associated 104,178 transcript sequences (genome version hg38) were retrieved by a proper R library querying the UCSC ðÞ 2L  1 ‘  ‘ data repository. For each transcript a promoter region, made of a PxðÞ jj  y  ‘¼ (2) LðL  1Þ sequence of 2,000 bp upstream the TSS, was extracted. In order to remove redundant information, the set of all promoter sequences was Going further, the probability that one element in {TFBS } and sev- processed filtering out redundant overlapped transcripts, reducing their eral element {TFBS }, say n , put at random in a promoter of length L, j j number (and the related promoter sequences) to 36,830. In this study, have a distance less than or equal to ‘ is well-approximated by consid- we used a set of 194 most studied and experimentally validated Human 19 ering the probability of a single matching PxðÞ jj  y  ‘¼ p as a Transcription Factors according to MAPPER. We associated to each Bernoulli event, and the probability of having k elements of {TFBS }in TF the corresponding Position-specific Weight Matrix (PWM), i.e. the an interval of size ‘ around the element x 2fTFBS g is matrix reporting the frequency of nucleotides for each position of the experimentally validated binding sites. It is worth noting that a TF can n k k j be associated to different PWMs and, on the other hand, a PWM can be PkðÞ ;jj x  y  ‘¼ p ð1  pÞ (3) associated to more than one TF. A total number of 192 PWMs was con- sidered in this work and each of them was associated in a biunivocal Finally, if there is more than one element in {TFBS }, say n , the i i relation to one TF with only a few exceptions. For the sake of simplicity expected number of occurrences of couple of elements in {TFBS } and and readability, we refer to PWM models as TFs and vice versa. The {TFBS } within a distance ‘ can be approximated by PWMs matrices were retrieved from an open access repository available j at the following address http://cistrome.org/jian/motif_collection/data bases/Transfac/pwm/ (12 October 2017, date last accessed). We applied n TF ; TF ¼ n kP k; TFBS  TFBS  ‘ ¼ i j i i j the matchPWM() function, a sequence-based transcription factor bind- k¼0 ing site search algorithm, integrated into the Biostrings R library, to ðÞ 2L  1 ‘  ‘ extract and collect the positions list of all binding sites for both all con- ¼ n n p ¼ n n i j i j LðL  1Þ sidered transcripts and PWMs. For our purposes, we have considered (4) an acceptance threshold probability of 0.9. The described procedure is clearly summarized in the workflow in Fig. 1. while its variance is "# 2.2. Transcription factor co-localization and deviation 2 2 r TF ; TF ¼ n k Pk; TFBS  TFBS  ‘  n TF ; TF i j i i j i j from randomness k¼0 2 2 In this Subsection, we present the indicator we have developed in order ðÞ 2L  1 ‘  ‘ ½ðÞ L  ‘ ðÞ L  ‘ ¼ n n pðÞ 1  p¼ n n i j i j to determine whether two different TFs are linked with respect to a L ðL  1Þ given promoter. Let P ¼fj x x ... x x 2 fA; C; G; Tg be the pro- 1 2 L i (5) moter region of a given transcript, i.e. the set of L ¼ 2,000 bases upstream the TSS of that transcript. Using PWM and PWM ,i.e. the The Bernoulli approximations in Equations (4) and (5) are valid i j matrices associated to two different transcription factors TF and TF , when n and n are small with respect to ‘ and ‘ is small with respect i j i j Downloaded from https://academic.oup.com/dnaresearch/article-abstract/25/1/103/4562529 by guest on 16 March 2018 F. Cumbo et al. 105 to L. By computing the actual number of couples (x, y), say ðTF ; TF Þ, with x 2fTFBS g and y 2fTFBS g such that x and y are HXðÞ¼ x logðx Þ i j i j k k k¼1 closer than ‘, and using Equations (4) and (5) we build an indicator evaluating the deviation from randomness of the actual number of Given a couple of models, PWM and PWM , we indicate with 1 2 join occurrences j P and Q the frequency of the nucleotide k (the index k ranges from k k 1 to 4 indicating the four nucleotides A, C, G, T) in the position i of TF ; TF  nðTF ; TF Þ i j i j the model PWM (the index i ranges from 1 to I, where I is the length zTF ; TF ¼ (6) 1 i j rðTF ; TF Þ i j of the model PWM ) and position j of the model PWM (the index j 1 2 ranges from 1 to J, where J is the length of the model PWM ), respec- The above z-score function, measuring how many standard devia- tively. For example, P is the frequency of the nucleotide T in posi- tions the actual number of couples is far from the expected number tion 1 of the first model while Q is the frequency of the nucleotide of couples in the case of a random distribution, provides a reason- G in position 2 of the second model. The Jensen-Shannon divergence able value of the association between TFBS and TFBS . Significantly, i j of the nucleotide distributions at position i of the first model and at high values of z TF ; TF suggest that the two considered TFs are i j position j of the second model is defined as associated, i.e. their TFBSs have a probability to occur as neighbours j j 4 i i higher than the one expected for a random distribution. On the con- P þ Q P þ Q k k k k JSðÞ P; Q; i; j¼ log trary when z TF ; TF is negative, with a high absolute value, ele- i j 2 2 k¼1 ments in the two TFBSs occur as neighbours less frequently than 4 4 X X expected in a random distribution, so that in this case the TFs are 1 j j i i þ P logP þ Q logQ (8) k k k k disassociated. In the following, in order to identify the synergy k¼1 k¼1 between transcription factors, we will focus on the case of high posi- tive values of z TF ; TF . We measure the similarity of two models with respect to a given i j alignment by summing up the Jensen-Shannon divergences as in Equation (8)) of each couple of distributions corresponding to the aligned positions (empty boxes in the overlapping area, see Fig. 2)(i, 2.3. Computing similarity score of PWMs i þ a), where a is the offset between the starting positions of model 1 The prediction algorithm of TFBSs is based on the PWMs obtained and 2 in that alignment, plus a penalty score for the non overlapped by experimental analysis. Since we are interested in identifying cou- positions (“p” marked boxes in the penalty area, see Fig. 2): ples of TFs whose TFBSs occur together in a window of a given size, we have to be sure that the two corresponding models are enough JSðÞ P; Q; a¼ JSðÞ P; Q; i; i þ aþ Kn (9) different since if two models are very similar they are likely to share i2I binding sites or a significant portion of them. We used two different where I is the set of overlapped positions in the given alignment and approaches to evaluate models similarity. The former is a direct n is the number of non overlapped positions of the shortest model. approach aiming to provide a similarity pairwise score of two models The penalty K for each of non overlapped positions is determined by based on the comparison of nucleotide frequencies of the PWMs. computing the probability distribution of the Jensen-Shannon diver- The latter approach can be defined as ‘a posteriori’, since we evaluate gence in the case of two random nucleotide distributions. The value the closeness of models by computing the number of overlapped K ’ 0:23 is obtained by summing up 2 S.D. to the mean value of the binding sites with respect to the total number of couples whose dis- random-nucleotide Jensen-Shannon divergence distributions. Finally, tance is closer than ‘. Using a linear combination of the above the model similarity score of two models JSD(P, Q) is defined as the described similarity scores we determine if two models are ‘structur- smallest score JS(P, Q; a) out of all the possible alignments: ally’ far from each other and we can significantly evaluate the associ- ation between two truly different models according to the algorithm JSDðÞ P; Q ¼ minfJSðP; Q; aÞg (10) a2AL described in the previous section. where AL is the set of all possible alignments with a non-empty superposition. The need of a ‘strong’’ non-overlapping penalty comes out in order to select alignment with a significant overlapping 2.4. PWMs distance based on Jensen-Shannon between the two models (i.e. to avoid bias due to a good affinity divergence between too small part of the two models). The Jensen-Shannon divergence (JS) is a symmetrized and smoothed version of the Kullback-Leibler divergence and it is often used to tell 21 2.5. PWMs distance based on TFBS overlapping how two frequency distributions are close to each other. Here, we Here, we introduce a different similarity function between pairs of use Jensen-Shannon divergence to identify similar PWMs using the TFs based on the percentage of overlapped binding sites they share. nucleotide frequencies as probability distribution. The Jensen- First of all, we consider two TFBSs to be overlapped if their distance Shannon divergence of two frequency distributions, P and Q, can be is smaller than the half of the size of the smaller PWM, computed using the Shannon entropy according to the following minfPWM ; PWM g 1 2 i.e. TFBS  TFBS < . If the TFBSs of two TSs i j equation: are often overlapped it is likely that they result to be associated, according to our algorithm, showing a high Z-score. They are indeed 1 1 1 1 JSðÞ P; Q ¼ H P þ Q  HPðÞ HQðÞ (7) often close because the TFs recognize the same binding sites (or a 2 2 2 2 large portion of them). Given a couple of TFs, TF and TF , the per- i j where H(X) is the Shannon entropy of the distribution X centage of all overlapping occurrences of TFBSs with respect to all Downloaded from https://academic.oup.com/dnaresearch/article-abstract/25/1/103/4562529 by guest on 16 March 2018 106 Transcription factor synergism Figure 2. Scheme of the algorithm for computing similarity score of models. the joined occurrences in a radius ‘ in all the transcript promoter 0.5 sequences, namely OVD(TF ,TF ), is defined as follows: i j 0.4 Selected l ðTF ; TF Þ i j Tot OVD TF ; TF ¼ (11) i j ðTF ; TF Þ i j Tot 0.3 Discarded where  ðTF ; TF Þ is the total number of the TFBSs couples (one Tot i j from TF and the other from TF ) falling together in the considered i j radius of size ‘ and l ðTF ; TF Þ is the total number of TFBSs over- 0.2 i j Tot lapped couples (one from TF and the other from TF ), for all the i j transcript promoter sequences. In the next section (as reported in 0.1 Fig. 3), OVD scores are plotted against JSD scores in order to select significant couples. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2.6. Protein-protein interaction network analysis OVD In order to validate the synergy of TFs couples identified by the algo- Figure 3. Each point represents a TF couple whose Z-scores are significant rithm here presented we analysed Protein-Protein Interaction net- (Z > 5) for at least 20 transcripts. The plane is divided into two regions: the work (PPI) to test the interaction of this couples in the network by upper-left sector, where Overlapping Score is higher than Similarity Score, using shortest path. provides selected couples. Lower-right sector, where Overlapping Score is Protein-Protein Interaction network was downloaded from lower than Similarity Score, provides Discarded couples. STRING database. We considered all the possible interaction sour- ces between proteins giving a score threshold of 0.7 (high confidence value of zðTF ; TF ; TR Þ, the higher the association between TF and i j k i interaction). The obtained network is made of 719,288 edges (inter- TF in the transcript promoter TR sequence will be. Since for each j k actions) and 14,932 nodes (proteins). The iGraph R package (https:// transcript, we have to test a considerably large number of couples cran.r-project.org/web/packages/igraph/index.html (12 October (TF ,TF ) and we test the association between couples in a large i j 2017, date last accessed)) was used to compute the shortest path number of transcripts, we select only those couples for which the from the adjacency matrix of the graph. value of zðTF ; TF ; TR Þ is higher than 5 (meaning that the number i j of occurrence of couples TF TF in a radius ‘ is far from that i j expected by chance more than 5 S.D.) and we indicate with NZ ðTF 5 i 3. Results and discussion ; TF Þ the number of transcripts TR for which the couple (TF ,TF ) j k i j 3.1. Selecting significant TF couples has a z-score greater than 5. Given a couple (TF ,TF ) it is reasonable i j In the previous Section, Equation (6) has been derived in order to to expect that a large value of NZ ðTF ; TF Þ indicates that the two 5 i j provide a measure of association between each couple of transcrip- models TF ,TF are functionally associated. However, a high value i j tion factors (TF ,TF ) with respect to a given transcript TR . Our of NZ ðTF ; TF Þ could refer to a couple of models with a high simi- i j k 5 i j goal is to find out couples of TFs that are significantly related in larity score rather than a functional association between TF and TF . i j gene-regulation mechanism. As previously mentioned, the higher the Therefore, to correctly identify relevant associated transcription Downloaded from https://academic.oup.com/dnaresearch/article-abstract/25/1/103/4562529 by guest on 16 March 2018 JSD F. Cumbo et al. 107 Table 1. Categories and terms for the enrichment analysis Category Number Number of terms of genes OMIM expanded 187 2,178 Tissue protein expression 207 62,307 from proteomicsDB KEGG 2015 179 3,800 OMIM disease 90 1,759 GO molecular function 1,136 12,753 Figure 4. Workflow for the analysis of TF couples. In panel A, the procedure GO cellular component 641 13,236 for the selection of relevant couples is depicted. In panel B, the analysis of GO biological process 5,192 14,264 significant couples is described both in terms of enrichment tests and sort- Chromosome location 386 32,740 ing of best couples. factors, we will use the overlapping score OVD(TF ,TF ) and the i j similarity score JSD(TF ,TF ). In Fig. 3, OVD(TF ,TF ) and JSD(TF , i j i j i TF ) values are shown for all the TF couples which have NZ TF ; TF > 20. Since the higher the value of JSD the lower the 5 i j similarity between two models is, while the higher the values of OVD the higher the similarity between binding sites is, as selection criterion to identify significant couples we used the inequality JSD > OVD. In such a way, the plane (OVD, JSD) is divided into two regions: the upper-left sector, where models are not similar and the overlapping is low (selected couples), while in the lower-right sec- tor, where models are similar and the overlapping is high (discarded couples). It is worth remarking that even if measures JSD and OVD could appear redundant, Fig. 3 clearly stated that the two measures are quite different: there are couples with similar JSD and very differ- -16 -14 -12 -10 -8 -6 -4 -2 ent OVD and vice versa. In fact JSD measure only depends on the 10 10 10 10 10 10 10 10 P-value intervals related PWM while OVD measure depends on the considered data- set, e.g. two models can bind the same sequence even if they are dif- Figure 5. Number of couples enriched in at least one term as a function of ferent. Considering a combination of both measures we obtain a their smallest P-value. Solid plot shows the enrichment of the couples identi- more robust selection criterion. fied by our algorithm, dashed plots show the enrichment of random couples In Fig. 4 (panel A) is sketched the procedure we use to select the for comparison. 547 couples identifying pairs of transcription factors that are likely to synergistically regulate gene transcription. In the following sec- the black plot, there is only one couple (M00106—M00967) having a tions, we will analyse the set of selected couples in order to associate term [ubiquitinyl hydrolase activity (GO: 0036459)] enriched with a to them a biological task and to find the most relevant associated 18 P-value smaller than 10 (3.63E–19), another couple (M00739— couples according to their statistical features (see Fig. 4 panel B). 14 M00799) for P-value smaller than 10 and three different couples for P-value smaller than 10 andsoon(see Table 2 whereall theenriched 3.2. Enrichment analysis and validation terms are ordered according to their P-values). It is worth noting that some couples have more than one term with a significant P-value, for In order to evaluate the biological significance of the obtained TF example couple (M00106—M00967) shows several terms with a couples, we performed hypergeometric enrichment tests. Given a remarkable enrichment: GO Molecular function—ubiquitinyl hydrolase couple of models, we want to test whether the set of genes potentially activity (GO: 0036459)—P-value 3.63E–19, Go Molecular function— regulated by the synergy of the corresponding two TFs, is enriched in cysteine-type peptidase activity (GO: 0008234)—P-value 4.56E–17, Go a particular category-term. We wondered whether those genes are Biological Process—Ubiquitin-dependent protein catabolic process (GO: linked in a given biological task in order to associate the TF couple 0006511), and so on. The comparison of enrichment P-values, related to a given biological process. We selected a list of categories and related terms, reported in Table 1, that identify different biological to couples identified by our algorithm (black plot), with those related to tasks and contexts. We then performed for each identified TF couple random sets (ten coloured plots) clearly highlights the significance and hypergeometric enrichment tests related to all the considered terms. consistency of our analysis, since none of the random sets shows P-val- For each TF couple we obtained a list of P-values linked to the corre- uessmallerthan10 . Also analysing the plots for larger P-values inter- sponding terms. The enrichment results were then compared with those vals the number of enriched couples is significantly higher than all the of ten random sets of genes. Each set was generated preserving the same corresponding numbers of random sets (taking into account the log- number of sample couples (i.e. 547) and the same number of genes for scale units in y-axis). each couple. The results are reported in Fig. 5 where solid plot represents 3.3. Analysis of relevant transcription factor couples the enrichment of couples identified by our algorithm and dashed plots show enrichment of random sets. Each point of the plots shows in log- In this section, we analyse the most significant couples identified by scale the number of TF couples (y-axis) with at least one term whose P- our algorithm. Among the 547 selected couples, we consider as partic- value falls in the corresponding P-value interval (x-axis). For example in ularly relevant those couples with the smallest enrichment P-value or Downloaded from https://academic.oup.com/dnaresearch/article-abstract/25/1/103/4562529 by guest on 16 March 2018 Number of couples 108 Transcription factor synergism Downloaded from https://academic.oup.com/dnaresearch/article-abstract/25/1/103/4562529 by guest on 16 March 2018 Table 2. The PWM model IDs and the corresponding TF names are reported in columns 1–4 for each couple; the category and the related term the couple resulted enriched in are reported in columns 5–6; column 7 reports the related P-value of the hypergeometric test Model 1 TF 1 Model 2 TF 2 Category Term P-value M00106 CDP CR3þHD M00967 HNF4 COUP GO Molecular function ubiquitinyl hydrolase activity (GO: 0036459) 3, 63E–19 M00106 CDP CR3þHD M00967 HNF4 COUP GO Molecular function cysteine-type peptidase activity (GO: 0008234) 4, 56E–17 M00739 E2F-4: DP-2 M00799 Myc GO Biological process Gonadal mesoderm development (GO: 0007506) 6, 60E–15 M00106 CDP CR3þHD M00967 HNF4 COUP GO Biological process Ubiquitin-dependent protein catabolic process (GO: 0006511) 3, 59E–14 M00736 E2F-1: DP-1 M00799 Myc GO Biological process Gonadal mesoderm development (GO: 0007506) 4, 09E–14 M00106 CDP CR3þHD M00967 HNF4 COUP GO Biological process Modification-dependent protein catabolic process (GO: 0019941) 4, 42E–14 M00106 CDP CR3þHD M00967 HNF4 COUP GO Biological process Modification-dependent macromolecule catabolic process (GO: 0043632) 4, 90E–14 M00106 CDP CR3þHD M00967 HNF4 COUP GO Biological process Proteolysis involved in cellular protein catabolic process (GO: 0051603) 8, 43E–14 M00736 E2F-1: DP-1 M00799 Myc GO Biological process Nucleosome organization (GO: 0034728) 1, 42E–13 M00777 STAT M00980 TBP Chromosome location Chr5q13 1, 54E–13 M00736 E2F-1: DP-1 M00799 Myc GO Biological process Protein-DNA complex subunit organization (GO: 0071824) 4, 01E–13 M00736 E2F-1: DP-1 M00799 Myc GO Biological process Nucleosome assembly (GO: 0006334) 2, 81E–12 M00457 STAT5A M00980 TBP Chromosome location Chr5q13 6, 40E–12 M00736 E2F-1: DP-1 M00799 Myc GO Biological process Protein-DNA complex assembly (GO: 0065004) 8, 69E–12 M00223 STATx M00980 TBP Chromosome location Chr5q13 4, 82E–11 M00799 Myc M00927 AP-4 GO Biological process Gonadal mesoderm development (GO: 0007506) 1, 12E–10 M00739 E2F-4: DP-2 M00799 Myc GO Biological process Mesoderm development (GO: 0007498) 2, 71E–10 M00739 E2F-4: DP-2 M00799 Myc GO Biological process Mesenchyme development (GO: 0060485) 4, 02E–10 M00736 E2F-1: DP-1 M00799 Myc GO Biological process Mesoderm development (GO: 0007498) 1, 66E–09 M00736 E2F-1: DP-1 M00799 Myc GO Biological process Mesenchyme development (GO: 0060485) 2, 46E–09 M00739 E2F-4: DP-2 M00799 Myc GO Biological process Nucleosome assembly (GO: 0006334) 4, 46E–09 M00462 GATA-6 M00921 GR GO Molecular function FK506 binding (GO: 0005528) 7, 13E–09 M00462 GATA-6 M00921 GR GO Molecular function Macrolide binding (GO: 0005527) 7, 13E–09 M00739 E2F-4: DP-2 M00799 Myc GO Biological process Protein-DNA complex assembly (GO: 0065004) 9, 90E–09 M00739 E2F-4: DP-2 M00799 Myc GO Biological process Nucleosome organization (GO: 0034728) 1, 44E–08 M00655 PEA3 M00803 E2F Chromosome location Chr5q13 1, 50E–08 M00799 Myc M00803 E2F GO Biological process Protein autophosphorylation (GO: 0046777) 2, 05E–08 M00739 E2F-4: DP-2 M00799 Myc GO Biological process Protein-DNA complex subunit organization (GO: 0071824) 2, 73E–08 M00059 YY1 M00148 SRY GO Molecular function Sodium ion transmembrane transporter activity (GO: 0015081) 2, 96E–08 M00415 AREB6 M00706 TFII-I KEGG Valine leucine and isoleucine biosynthesis 4, 10E–08 F. Cumbo et al. 109 with the best associated parameters (highest number of transcripts, histone deacetylase (HDAC) enzymatic activity. As reported in Ref. smallest OVD and highest JSD) as reported in Fig. 4 panel B. The list 25, it has been identified a subunit of the complex NuA4 as the prod- of couples with the smallest P-values (smaller than 10 ) is reported uct of TRA1, an ATM-related gene homologous to human TRRAP, in Table 2. The analysis of the couples reported in the table suggests an essential cofactor for c-Myc- and E2F-mediated oncogenic trans- several observations, providing and/or confirming evidence of the syn- formation. A further link between Myc and E2F was studied in Ref. ergism of several TF couples and the related biological contexts they 26 highlighting their relationships with chromatin structure and are involved in. The smallest P-values are reported for the couple stability. It is remarkable and can be source of further investigations CDP (also known as Cutl1) (M00106)–HNF4 (M00967), as already the association of the couple AREB (M00415)–TFII-I (M00706) (also mentioned above. The related enriched terms are mainly associated (4 known as GTF2I) with KEGG pathway Valine leucine and isoleucine out of 6) with catabolic processes while the other two terms are ubiq- biosynthesis (the couple showing a P-value of 4.10E–8). Interestingly, uitinyl hydrolase activity and cysteine-type peptidase activity. transcription factors of STAT family (STAT—M00777, STATx— Concerning the couple Myc (M00799)–E2F (M00803) (and its cluster M00223 and STAT5A–M00457) strongly interact with TBP E2F–1 and E2F–4) it can be observed that it appears several times in (M00980) in the promoters of genes belonging to the region of the table associated to different terms. These terms can be grouped chr5q13, that is known to be associated with various neurological dis- into main activities/contexts regarding: (1) gonadal mesoderm devel- orders and pathologies such as Spinal Muscular Atrophy, Hairy Cell 27–29 opment and mesenchyme development, (2) Nucleosome assembly and Leukemia and it is also connected to Alcohol Dependence. organization, Protein-DNA complex assembly and Protein-DNA Finally, it could deserve attention the couple Areb6 (M00414)–E2F Complex Subunit Organization and (3) Protein (M00803) associated to the term Mental retardation, even if not Autophosphorylation. It is worth noting that in literature there are reported in Table 2, it shows a remarkable P-value (610 ). Areb6 many references associated to the couple Myc–E2F. In Ref. 24, it has belongs to the Zeb transcription factor family that has been shown to been shown how the two TFs are linked via mSin3A, a core compo- be involved in mental retardation syndromes. In Table 3, the best nent of a large multiprotein co-repressor complex with associated couples, in terms of association, are reported ordered by the number Table 3. The PWM model IDs and the corresponding TF names are reported in columns 1–4 for each couple; Column 5 reports the number of transcripts whose Z-scores, related to the couple, are higher than 5; the average Z-score of all the significant (Z > 5) transcripts is reported in column 6; Overlapping and Similarity Scores are reported in columns 7 and 8, respectively Model 1 TF 1 Model 2 TF 2 Number of transcripts Average Z-score Overlapping score Similarity score M00083 MZF1 M00649 MAZ 1,014 7, 77 0, 19 0, 21 M00083 MZF1 M00803 E2F 456 6, 77 0, 09 0, 21 M00803 E2F M00976 AhR Arnt HIF-1 338 6, 63 0, 03 0, 21 M00706 TFII-I M00971 Ets 317 7, 71 0, 16 0, 24 M00799 Myc M00803 E2F 293 7, 16 0, 00 0, 23 M00706 TFII-I M00803 E2F 275 6, 61 0, 00 0, 27 M00148 SRY M00747 IRF-1 254 7, 77 0, 00 0, 20 M00148 SRY M00471 TBP 239 6, 78 0, 02 0, 21 M00148 SRY M00980 TBP 203 6, 45 0, 00 0, 23 M00698 HEB M00803 E2F 196 6, 35 0, 02 0, 29 M00649 MAZ M00658 PU.1 187 7, 49 0, 03 0, 22 M00649 MAZ M00799 Myc 184 6, 85 0, 00 0, 33 M00799 Myc M00933 Sp1 182 7, 06 0, 01 0, 28 M00462 GATA-6 M00471 TBP 182 6, 47 0, 08 0, 21 M00799 Myc M00931 Sp1 167 6, 97 0, 00 0, 30 M00803 E2F M00927 AP-4 160 6, 29 0, 01 0, 24 M00801 CREB M00803 E2F 153 6, 43 0, 00 0, 28 M00706 TFII-I M00931 Sp1 141 6, 58 0, 04 0, 22 M00649 MAZ M00971 Ets 132 7, 32 0, 00 0, 23 M00933 Sp1 M00976 AhR Arnt HIF-1 127 6, 65 0, 02 0, 21 M00803 E2F M00981 CREB ATF 127 6, 72 0, 00 0, 23 M00799 Myc M00932 Sp1 121 7, 17 0, 00 0, 28 M00148 SRY M00706 TFII-I 120 8, 58 0, 00 0, 31 M00931 Sp1 M00976 AhR Arnt HIF-1 117 6, 47 0, 01 0, 20 M00803 E2F M00917 CREB 115 6, 81 0, 00 0, 25 M00008 Sp1 M00706 TFII-I 114 6, 36 0, 02 0, 21 M00791 HNF3 M00975 RFX 107 6, 58 0, 00 0, 20 M00471 TBP M00747 IRF-1 107 6, 32 0, 02 0, 25 M00649 MAZ M00976 AhR Arnt HIF-1 106 6, 28 0, 00 0, 26 M00148 SRY M00962 AR 106 6, 04 0, 00 0, 20 M00148 SRY M00789 GATA 106 6, 08 0, 00 0, 23 M00148 SRY M00975 RFX 104 6, 20 0, 00 0, 21 M00008 Sp1 M00799 Myc 102 6, 77 0, 00 0, 29 M00775 NF-Y M00803 E2F 101 6, 24 0, 03 0, 20 The couple Myc, E2F is reported in bold to highlight it is also included in Table 2. Downloaded from https://academic.oup.com/dnaresearch/article-abstract/25/1/103/4562529 by guest on 16 March 2018 110 Transcription factor synergism Table 4. Frequencies of couples as a function of the shortest path (SP) distance for three classes of protein couples: (i) the selected 547 TF couples (namely BEST, first row), (ii) all the couples of TFs (namely ALL, second row) and iii) 10 random sample sets made of 547 randomly picked protein couples from the whole PPI (namely RANDOM, mean and standard deviation in the third and fourth row, respectively) SP 1 SP 2 SP 3 SP 4 SP 5 SP 6 SP 7 BEST 0.117318 0.478585 0.284916 0.10987 0.009311 0 0 ALL 0.056789 0.463694 0.376242 0.084080 0.014167 0.004661 0.000368 RANDOM (mean) 0.003291 0.065601 0.377048 0.385820 0.134512 0.027972 0.005346 RANDOM (stdv) 0.000943 0.006363 0.016408 0.008298 0.009771 0.003329 0.002414 of significant Transcripts. We selected those couples with a number of from String database. We found a significant overall difference significant transcripts (Z-score higher than 5) higher than 100, with between shortest path distribution of the set of couples selected by an Overlapping score (OVD) smaller than 0.2 and a Similarity score our method and the set of all TFs couples. In Table 4, we report the (JSD) higher than 0.2 (it is also reported in the fifth column the aver- percentages of couples as a function of the Shortest Path (SP) distan- age Z-Score). The couple MZF1 (M00083)—MAZ (M00649) shows ces for three classes of protein couples: (i) the selected 547 TF cou- a Z-score higher than 5 in 1,014 transcript promoter sequences (with ples (namely BEST, first row), (ii) all the couples of TFs (namely an average z-score of 7.77—more than 7 S.D. far from the expected ALL, second row) and (iii) 10 random sample sets made of 547 ran- average value). Myc-associated zinc finger protein (MAZ) and domly picked protein couples from the whole PPI (namely Myeloid zinc finger 1 (MZF1) are both transcription factors charac- RANDOM, mean and standard deviation in the third and fourth terized by a zinc finger small protein structural motif. The associated row, respectively). The majority of couples (around 50%) related to PWMs show a similar common sub-motif characterized by a GGGA TFs shows a SP equal to 2 both for couples obtained by our algo- sequence. Nevertheless the PWMs have different length (6 and 8) and rithm and for all TF couples, while only around 6% of protein ran- the correspondent similarity score, JSD, results to be 0.21 meaning dom couples has a SP distance equal to 2. The most relevant result is that the two PWMs are similar but globally not so close to each other. in the difference of the frequency for SP ¼ 1 (indicating a direct inter- Significantly, the OVD overlapping score is quite low 0.188 meaning action) that is 0.12 for our best couples and 0.057 for all TF cou- that, on average, among all the identified TFBSs couple (both in a ples (i.e. the ratio is greater than 2), and also for SP ¼ 3(0.28 our window of 80 bp) of the two TFs, less than 1 out of 5 couples are best couples versus 0.37 for all TF couples). The distribution for overlapped. This is why we included this couple in the selected set of the random protein data shows, not surprisingly, a complete differ- TFs couples. The surprisingly high number of transcripts for which ent pattern (the most part of couples having SP equal to 3 and 4). the two TFs co-occur should deserve deeper investigations. The cou- Even considering the limitations of this kind of analysis—a PPI is a ple MZF1 (M00083)–E2F (M00803) shows 456 transcripts with a Z- global representation of potential interactions between proteins (and score higher than 5 (with an average z-score of 6, 77, OVD ¼ 0.09 consequently TFs) that not always (referring to time and space) is an and JSD ¼ 0.21). Interestingly they are involved in several diseases, in actual interaction and, moreover, the results depend on the threshold particular, as reported in Ref. 31, they are both potential key regula- chosen to select significant interactions—it reveals that TFs couples tors of PKD1 and PKD2 whose mutations are linked with autosomal selected by our method show an overall stronger relationship than dominant polycystic kidney disease (ADPKD). We found one tran- those in the set of all TFs couples. In particular, we chose a quite script of PKD1 uc002cos.1 with a Z-score for the couple equal to 2 strict threshold of 0.7 (as reported in the Material and methods sec- and two transcripts of PKD2 uc003hre.3 and uc011cdg.2 with Z- tion), so the significance of results has to be found in the ratio score equal to 5.80 and 3.42, respectively. It is interesting that a between the number of our selected couples and all TFs couples at SP minor groove binding protein SRY (M00148) is associated to both 1 (ratio equal to 2) rather than in the percentages (12% and 6%, the models of TBP (M00471 and M00980) with a number of signifi- approximately) that could significantly change since they depend on cant transcripts equal to 239 and 203, respectively, a very small OVD the threshold. close to 0 for both and JSD 0.21 and 0.23, respectively. It is worth noting that the couple Myc (M00799)–E2F (M00803) (also included 4. Conclusions in Table 2) shows a number of significant transcripts equal to 293, A better understanding of the mechanisms driving the regulation of pro- meaning that, besides a clear association with given biological con- tein expression is an essential requisite to shed light on the behaviour of texts, as discussed in the previous section, there is a strong synergism cells. Transcription factors play a central role in this extremely complex between the two TFs confirmed by the values reported in the table. task and it has been shown that they synergically co-operate in order to Concerning the couple Maz (M00649)–Pu.1 (M00658) (number of significant transcripts equal to 187) it has been shown that three tran- provide a fine tuning of protein expressions. Among the different meth- scription factors Maz, PU.1 and ARNT show significant recognition ods able to detect and identify TFs interplay, a very important resource elements among similarly up or down-regulated genes involved in is the computational methods to which our work mainly refers. In this hematopoietic differentiation or leukemogenesis. We note that also work, we present a mathematically well-founded procedure able to identify TFs couples that act together, inferring for several of those cou- the couple Maz (M00649)–ARNT (M00976) is included in the table ples the biological context they are involved in. We introduced a new with a number of significant transcripts equal to 106. In order to further validate and provide significance to obtained and robust statistical method based on the use of a good Bernoulli results, we computed the shortest paths between couples of TFs approximation and on a z-score measures able to discriminate between related to the Protein-Protein Interaction (PPI) network downloaded random and non-random co-occurrences of couple of TFs. We Downloaded from https://academic.oup.com/dnaresearch/article-abstract/25/1/103/4562529 by guest on 16 March 2018 F. Cumbo et al. 111 7. Wingender, E. 2008, The TRANSFAC project as an example of frame- extended previous methods based on randomize sequences as in Refs. work technology that supports the analysis of genomic regulation, Brief. 15 and 16. In order to avoid biases due to the structural similarity of Bioinform., 9, 326–32. different models Overlapping and Similarity scores were designed to 8. Sandelin, A., Alkema, W., Engstro ¨ m, P., Wasserman, W.W. and Lenhard, select TF couples that are significant for the analysis. We used such a B. 2004, JASPAR: an open-access database for eukaryotic transcription method to find pairs of associated transcription factors in the set of all factor binding profiles, Nucleic Acids Res., 32, D91–4. human promoter sequences (selected with a careful analysis from all 9. Tompa, M., Li, N., Bailey, L.T., Church, G.M., et al. 2005, Assessing human transcripts) and we also performed enrichment analysis on the computational tools for the discovery of transcription factor binding sites, set of genes regulated by identified couples. This analysis provides con- Nat. Biotechnol., 23, 137–44. sistence to our results but also provides a biological context to be asso- 10. Suryamohan, K. and Halfon, M.S. 2015, Identifying transcriptional cis-regulatory modules in animal genomes, Wires. Dev. Biol., 4, 59–84. ciated to the couples. Moreover, we also performed network analysis 11. Blanchette, M., Bataille, A.R., Chen, X., et al. 2006, Genome-wide showing that TFs couples identified by the algorithm here presented are computational prediction of transcriptional regulatory modules closer than expected in the protein-protein interaction network in terms reveals new insights into human gene expression. Genome. Res., 16, of shortest path. 656–68. To our knowledge this is the only work concerning synergy of 12. Prabhakar, S., Poulin, F., Shoukry, M., et al. 2006, Close sequence com- Transcription Factors taking into account all those features. Some of parisons are sufficient to identify human cis-regulatory elements, Genome the couples emerging in this study are already known to be linked in Res., 16, 855–63. 24–26 several biological contexts (such as Myc–E2F, while in other 13. Hu, Z., Hu, B. and Collins, J.F. 2007, Prediction of synergistic transcrip- cases our results can lead to hypothesize links between TFs couples tion factors by function conservation, Genome Biol., 8, R257. 14. Yu, X., Lin, J., Zack, D.J. and Qian, J. 2006, Computational analysis of and diseases (as for the STAT family that strongly interact with TBM tissue-specific combinatorial gene regulation: predicting interaction in the promoters of genes belonging to the region of chr5q13 27–29 between transcription factors in human tissues, Nucleic Acids Res., 34, involved in several diseases. Finally, some other couples were 4925–36. not previously identified (such as the couple MZF1–MAZ) that 15. Nandi, S., Blais, A. and Ioshikhes, I. 2013, Identification of cis-regulatory would deserves further investigation. modules in promoters of human genes exploiting mutual positioning of According to our algorithm two TFs could be associated in differ- transcription factors, Nucleic Acids Res., 41, 8822–41. ent ways, for example they could be co-operative in a strict sense or 16. Hannenhalli, S. and Levy, S. 2002, Predicting transcription factor syner- concurrent in the sense that the presence of one of the two impedes gism, Nucleic Acids Res., 30, 4278–84. the presence of the other. A small size window, as the one we use, 17. Kent, W.J., Sugnet, C.W., Furey, T.S., et al. 2002, The human genome leads us to hypothesize that two transcription factor proteins that browser at UCSC, Genome Res., 12, 996–1006. 18. Gentleman, R.C., Carey, V.J., Bates, D.M., et al. 2004, Bioconductor: could bind sites within the window, either are sufficiently close to open software development for computational biology and bioinfor- each other to physically interact, or only one of the two is able to matics, Genome Biol., 5,1. bind its TFBS because of the steric hindrance. 19. Marinescu, V.D., Hohane, I.S. and Riva, A. 2005, The MAPPER data- This work represents a step in the direction of designing complex base: a multi-genome catalog of putative transcription factor binding sites, gene regulatory networks, and it provides information on TFs associ- Nucleic Acids Res., 33 (suppl 1), D91–7. [CVOCROSSCVO] ation that could be useful in this context. The identification of signifi- 20. Page ` s, H., Aboyoun, P., Gentleman, R. and DebRoy, S. 2009, String cant TFs couples could be of help in the view of artificially altering objects representing biological sequences, and matching algorithms. R the regulation of genes by inhibiting the interaction between given package version 2.2. TFs couples. 21. Lin, J. 1991, Divergence measures based on the shannon entropy, IEEE Trans. Inform. Theory, 37, 145–51. 22. Szklarczyk, D., Morris, J.H., Cook, H., et al. 2017, The STRING data- base in 2017: quality-controlled protein-protein association networks, Conflict of interest made broadly accessible, Nucleic Acids Res., 45, D362–8. None declared. 23. Lee, J.K., ed. 2010, Statistical Bioinformatics for Biomedical and Life Science Researchers. Wiley-Blackwell, Hoboken, New Jersey. 24. Dannenberg, J.H., David, G., Zhong, S., van der Torre, J., Wong, W.H. References and Depinho, R.A. 2005, mSin3A corepressor regulates diverse transcrip- tional networks governing normal and neoplastic growth and survival, 1. Mitchell, P.J. and Tjian, R. 1989, Transcriptional regulation in mamma- lian cells by sequence-specific DNA binding proteins, Science, 245, 371–8. Genes Dev., 19, 1581–95. 2. Latchman, D.S. 1997, Transcription factors: an overview, Int. J. Biochem. 25. Allard, S., Utley, R.T., Savard, J., et al. 1999, NuA4, an essential tran- Cell Biol., 29, 1305–12. scription adaptor/histone H4 acetyltransferase complex containing Esa1p 3. Zhou, H.X. 2011, Rapid search for specific sites on DNA through confor- and the ATM-related cofactor Tra1p, Embo J., 18, 5108–19. mational switch of nonspecifically bound proteins, Proc. Nat. Acad. Sci. 26. Albert, T., Wells, J., Funk, J.O., et al. 2001, The chromatin structure of U.S.A., 108, 8651–6. the dual c-myc promoter P1/P2 is regulated by separate elements, J. Biol. 4. Berg, O.G., Winter, R.B. and Von Hippel, P.H. 1981, Diffusion-driven Chem., 276, 20482–90. 27. Lin, P., Hartz, S.M., Wang, J.C., et al. 2012, Copy number variations in mechanisms of protein translocation on nucleic acids. 1. Models and theory, Biochemistry, 20, 6929–48. 6q14. 1 and 5q13. 2 are associated with alcohol dependence, Alcohol. 5. Pahl, H.L. 1999, Activators and target genes of Rel/NF-kB transcription Clin. Exp. Res., 36, 1512–8. factors, Oncogene, 18, 6853–66. 28. Also-Rallo, E., Alias, L., Martinez-Hernandez, R., et al. 2011, Treatment 6. Wingender, E., Chen, X., Hehl, R., et al. 2000, TRANSFAC: an inte- of spinal muscular atrophy cells with drugs that upregulate SMN expres- grated system for gene expression regulation, Nucleic Acids Res., 28, sion reveals inter-and intra-patient variability, Eur. J. Hum. Genet., 19, 316–9. 1059–65. Downloaded from https://academic.oup.com/dnaresearch/article-abstract/25/1/103/4562529 by guest on 16 March 2018 112 Transcription factor synergism 29. Wu, X., Ivanova, G., Merup, M., et al. 1999, Molecular analysis of the mutations in the zinc finger homeo box 1B, Am. J. Med. Genet., 108, human chromosome 5q13. 3 region in patients with hairy cell leukemia 177–81. and identification of tumor suppressor gene candidates, Genomics, 60, 31. Lantinga-van Leeuwen, I.S., Leonhard, W.N., Dauwerse, H., et al. 2005, 161–71. Common regulatory elements in the polycystic kidney disease 1 and 2 pro- 30. Zweier, C., Albrecht, B., Mitulla, B., et al. 2002, ‘Mowat-Wilson’ syn- moter regions, Eur. J. Hum. Genet., 13, 649–59. drome with and without hirschsprung disease is a distinct, recognizable 32. Bogni, A., Cheng, C., Liu, W., et al. 2006, Genome-wide approach to identify multiple congenital anomalies-mental retardation syndrome caused by risk factors for therapy-related myeloid leukemia, Leukemia, 20, 239–46. Downloaded from https://academic.oup.com/dnaresearch/article-abstract/25/1/103/4562529 by guest on 16 March 2018

Journal

DNA ResearchOxford University Press

Published: Feb 1, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 12 million articles from more than
10,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Unlimited reading

Read as many articles as you need. Full articles with original layout, charts and figures. Read online, from anywhere.

Stay up to date

Keep up with your field with Personalized Recommendations and Follow Journals to get automatic updates.

Organize your research

It’s easy to organize your research with our built-in tools.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve Freelancer

DeepDyve Pro

Price
FREE
$49/month

$360/year
Save searches from
Google Scholar,
PubMed
Create lists to
organize your research
Export lists, citations
Read DeepDyve articles
Abstract access only
Unlimited access to over
18 million full-text articles
Print
20 pages/month
PDF Discount
20% off