Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Survival analysis for user disengagement prediction: question-and-answering communities’ case

Survival analysis for user disengagement prediction: question-and-answering communities’ case We used survival analysis to model user disengagement in three distinct questions-and-answering communities in this work. We used the complete historical data from domains including Politics, Data Science, and Computer Science from Stack Exchange communities from their inception until May 2021, including information about all users who were members of one of these three communities. Furthermore, in formulating the user disengagement prediction as a survival analysis task, we employed two survival analysis techniques (Kaplan–Meier and random survival forests) to model and predicted the probabilities of members of each community becoming disengaged. Our main finding is that the likelihood of users with even a few contributions staying active is noticeably higher than those who were making no contributions; this distinction may widen as time passes. Moreover, the results of our experiments indicate that users with more favourable views toward the content shared on the platform may stay engaged longer. Finally, regardless of their themes, the observed pattern holds for all three communities. Keywords Question-and-answering platforms · User disengagement · Survival analysis · Stack exchange 1 Introduction of time (e.g., more than a year). In this context, disengage- ment might have happened for various reasons; e.g., it might Online question-and-answering (QA) social networks like have occurred because disengaged users believed that the 2 3 Stack Overflow and Quora are dependent on their users’ platform had an elitist or even toxic culture. Another reason contributions for proper functioning. Arguably, the main could have been that user interests changed drastically over functionality of a QA platform is to connect two types time, and the platform hosting the QA community could not of users (Kuzmeski 2009); on one side, people who seek adapt to the change in an agile way. answers to their questions and on the other side, people who At the very least, a high disengagement rate has adverse are willing to share their knowledge and expertise with oth- effects on the overall quality of the service of a QA social ers (Guan et al. 2018). Nevertheless, a user who joined and network and platform. For example, suppose all the experts made many contributions to the community may become (i.e., users who post answers perceived as high quality by uninterested and then gets disengaged after a while. By dis- the community) become disengaged within a few months of engaged, we mean the situation where users—as individuals joining and being active. In that case, the quality of answers who previously made contributions (e.g., answered ques- might plummet, which may increase the rate of users’ disen- tions and participated in debates)—suddenly stopped their gagement from the community (Pudipeddi et al. 2014; Dror activities (i.e., there is no sign of them even visiting the et al. 2012). In the worst-case scenario, one could expect platform’s web pages). Moreover, it is not known whether the situation where the QA platform loses the bulk of its these users left the community or not, but they did not per- contributors, which in turn would lead to its demise. form any activity on the platform for a relatively long period Survival analysis (Cox and Oakes 2018; Wang et  al. 2019) is a family of statistical methods and techniques that * Hassan Abedi Firouzjaei In this work, we use the terms question-and-answering platform, hassan.abedi@ntnu.no social network, and community interchangeably. https:// stack over flow. com. Department of Computer Science, Norwegian University of Science and Technology (NTNU), Trondheim, Norway https:// www. quora. com. Vol.:(0123456789) 1 3 86 Page 2 of 13 Social Network Analysis and Mining (2022) 12:86 used to model and evaluate the problem of user disengage- ment prediction. Section 4 gives an overview of the dataset and the methodology used to represent users and the engage- ment time. Section 5 presents the results of the experiments and Sect. 6 discusses the results. Section 7 discusses the limitations of our work and gives an outline for the direction of future work. Finally, Sect. 8 concludes the paper. 2 Related work Fig. 1 User A and user B joined the platform in the past; during QA platforms like Stack Exchange and Quora provide an a period of observation which started at t and ended at t , B start end accessible knowledge-sharing environment. Due to the became disengaged at t . A did not become disengaged during disengaged importance of this role which earlier was played by mailing the observation, but it is not known that he will become disengaged in lists, newsgroups and IRC channels, the interest in study- future or not; information about A’s disengagement is censored ing phenomena on these platforms has exploded lately. For example, in Joyce and Kraut (2006), authors studied the can help model and predict the time of the occurrence of an continued user participation in newsgroups. They used the event of interest. Initially, it emerged out of medical research posts from six public newsgroups to test whether answers to find the probability of a patient surviving a disease such that users receive to their first few questions are crucial for as cancer—hence the term survival analysis. More recently, prolongation of user participation. Their findings suggest survival analysis methods have found widespread use in new that longer questions are more likely to receive a response. areas such as customer churn analysis (Dias et al. 2020; Furthermore, the quality and emotional tone, and whether Rothmeier et al. 2021) and credit risk scoring (Stepanova the answer was in response to a question from a new user, and Thomas 2002), mainly due to their flexibility and power seems not to influence of likelihood of further participation. in accurately and reliably modelling the problems posed in Authors, in Guan et al. (2018), used the data from the these areas. most popular Chinese social QA platform, Zhihu, to inves- In this work, we used survival analysis to study user dis- tigate the factors related to users’ motivations to participate engagement in three distinct QA social networks, namely, in community activities, especially knowledge contribution. Politics, Data Science, and Computer Science Stack Their findings suggest that social exchange is an important Exchange. Our choice allowed us to pose questions and seek factor influencing users’ continuous knowledge contribution answers based on the data from QA communities with the in social QA communities. Moreover, the findings show that themes mentioned above. To our knowledge, this is the first knowledge exchange based on norms of reciprocity is an work that applied survival analysis to quantify and study important factor affecting users’ continuous contribution. user disengagement using the entire historical data of online For example, a user who frequently seeks knowledge is more QA social networks. Figure 1 illustrates how disengagement likely to contribute knowledge to others, indicating users prediction can be seen and formulated as a survival analysis contribute because they expect that they can get a response task. to their questions in future. Similarly, Jin et al. (2015) stud- Following are the main contributions of our work: ied the elements, based on the data from Zhihu, that were influencing user knowledge contributions in QA platforms, We study the factors likely to be associated with the prob- incorporating three theories of social capital theory, social ability that users of QA communities will stay active for exchange theory, and social cognitive theory in their work. an extended period. For the first time, we analyze the Furthermore, the use of survival analysis methods is also relationships between attributes related to users’ contri- gaining popularity, where an analogy could be made between butions and their engagement time. the problem and the task of survival analysis. In Wang et al. • We propose to exploit behavioural (see Table  4), and (2019), the authors provided a comprehensive review of two content-based user attributes (see Table 5) to estimate major categories of methods and techniques for survival the engagement time on three comprehensive datasets analysis; namely, conventional and various machine learning from distinct QA communities. methods for survival analysis. Their work described and dis- cussed die ff rent related topics, including data transformation The rest of this article is organised as follows. Section 2 dis- cusses the related work. Section 3 presents preliminary con- cepts related to survival analysis and introduces techniques https:// www. zhihu. com. 1 3 Social Network Analysis and Mining (2022) 12:86 Page 3 of 13 86 and early prediction of complex events—along with appro- methods to predict the churn of expert respondents on Stack priate evaluation metrics. Overflow. Their result indicated that the random forest had Yang et al. (2010) used survival analysis methods to ana- the highest classification accuracy of the four machine min- lyze and study user retention in three major QA commu- ing algorithms and the highest values for the other evalua- nities: Baido Knows, Yahoo! Answers, and Naver Knowl- tion metrics. edge-iN. Their findings suggest that users who preferred With the recent success of deep learning methods in tack- answering tend to have a more extended and more active ling problems in domains such as computer vision and natu- engagement period within the platform. Moreover, garnering ral language processing, interest in the use of artificial neural enough questions in order to retain the experts seems essen- networks for handling the censored data used in survival tial. Additionally, users who put more effort into the aver - analysis has drastically increased. For example, in Yao et al. age length of questions they post both tend to receive more (2017), the authors proposed a deep correlational survival answers and stay engaged longer. Finally, for answerers, model (or DeepCorrSurv for short), which, in contrast with acknowledging one’s contribution by having one’s answers traditional survival analysis methods, can handle multimodal selected as best or being commented on, was tied to a more data. In essence, DeepCorrSurv is able to learn the complex extended stay on the platform. Although their work is similar interdependencies on multimodal patient data (e.g., the mix- to our work, we used the data for the whole lifespan of the ture of images and features). Furthermore, recurrent neural communities, where their work mainly focused on a limited network-based approaches also have been successfully com- period. bined with survival analysis techniques to predict the events Arguably, three of the most popular metrics to measure regularly occurring, such as the time to check-in by the user user engagement on a web-based platform are click-through to a venue (Yang et al. 2018), and for content recommenda- rates, page views, and time spent by the user on the web- tion and personalization (Jing and Smola 2017). site (Dupret and Lalmas 2013). The authors, in Dupret and Finally, Table 1 shows the information about the differ - Lalmas (2013), used survival analysis to analyze the user ences between the work described in this paper and in the engagement in a dataset of questions and answers from literature. Based on the information presented in Table 1, the Yahoo! Answers in Japan— utilising user absence time (or topic of user disengagement analysis for QA communities absence time for short) which is the duration between two has been investigated using two main approaches: as a classi- consecutive visits by the user—to measure engagement. The fication or as a survival analysis task. Both approaches have intuition is that if a user finds a website more exciting and three major components: data, model, and disengagement engaging, they will return to it sooner rather than later. The criterion. Furthermore, the approaches mentioned above study’s main goal was to identify observable correlations (i.e., classification and survival analysis) are sufficiently between absence time and user engagement. different in their goals and assumptions. Arguably, the most Most works in this area related to data from QA com- pronounced difference between the two approaches is that munities are mainly focused on the data from a few larger when the disengagement is formulated as a classification communities, such as Stack Overflow (Ortega et al. 2014). task, the time is not considered. In other words, it is assumed For example, in Pudipeddi et al. (2014), authors investigated that the probability that a user gets disengaged is constant the factors that correlate with user churn on Stack Over- and independent of the time. In contrast, the central notion flow, including the time gap between posts, answering speed, behind the survival analysis is that the probability of a user number of answers received by the user, and reputation of becoming disengaged is a function of time. Furthermore, the users who answered to the user’s questions. Their find- the main goal of survival analysis is to find a good estimate ings suggest that the time gap between subsequent posts is of the survival function, which outputs the probability of the most significant indicator of an increase in their inter - the event of interest not happening (in our case, the likeli- est in staying engaged. Additionally, in Adaji and Vassileva hood that the user stays engaged) at a specific time. In this (2015), the authors studied the problem of expert churn regard, the main benefit of the survival analysis approach is on Stack Overflow, formulated as a classification task. To the possibility of taking into account the censoring of the label the users who left the community, the authors used data. These properties make the work presented in this study the definition from Karnstedt et al. (2010), where a churner different from existing works that use a classification-based is defined as a user whose average activity over a specific approach. In addition, the remaining existing works using subsequent period has dropped to less than a fraction of their survival analysis utilised the Cox model. Some recent stud- average activity in a previously observed period. In other ies (e.g., in Miao et al. 2015) suggest that the Cox model, words, if a user has a noticeable drop in his activity fol- compared to a random survival forests model, may have a lowing a period of considerable activity (e.g., answering weaker discriminative power. The main reason for this can multiple questions for a period then stopping) he is consid- be because the Cox model can only infer the linear effects ered a churner. To that aim, they used four machine learning between the target and independent variables, while RFS can 1 3 86 Page 4 of 13 Social Network Analysis and Mining (2022) 12:86 Table 1 Difference with the work in the literature References Data and models Disengament/churn criteriun Yang et al. (2010) Authors used data from three QA communities for a period User inactivity over 100 days of 2 years. The main model used was Cox model (Cox 1972) Dror et al. (2012) Authors used data from Yahoo! Answers for a period of User inactivity after his first week of joining about nine months. The churn prediction was formulated as a binary classification. Altogether, seven learners were used: the majority, naive Bayes, logistic regression, SVM, decision tree, random forests, and KNN Dupret and Lalmas (2013) Authors used data from Yahoo! Answers Japan for a period User absence time in days of two weeks. Cox model (Cox 1972) was used Pudipeddi et al. (2014) Stack Overflow data for a period of 4 years (from 2008 to No new post by user for six months or more 2012) were used. User churn prediction was formulated as a binary classification task, and three types of classi- fiers were used. Namely, SVM, decision tree, and logistic regression Adaji and Vassileva (2015) Authors used the data from Stack Overflow from a period Decrease in user activity during a follow-up period of 6 years (from 2008 to 2014). The problem of predict- relative to his activity during a previous observa- ing the expert users’ churn was formulated as a binary tion period classification task, and four learners were used. Namely, logistic regression, multi-layer perceptron, random forests, and SVM The approach in this work Data from three QA communities were used; namely, Poli- User absence for an extended period of time tics, Data Science, Computer Science Stack Exchange. The data include the entire lifespan of communities. The prediction of user disengagement was formulated as a sur- vival analysis task, and two methods were used. Namely, Kaplan–Meier and random survival forests handle nonlinearity (Wang et al. 2019). Furthermore, RFS is to individuals for whom the event did not take place, i.e., a nonparametric method; compared to the Cox model, which the censored individuals. This difference allows for obtain- is a semi-parametric method, it is more versatile because it ing more accurate estimations. Although survival analysis does not make any assumption about the underlying distribu- originated from the field of medical research, mainly for esti- tion of the data. mating the time a patient would live after being diagnosed having a deadly disease such as breast cancer, it has gained much attention in other areas such as customer churn analy- 3 Preamble sis and prediction (Dias et al. 2020) and time to occurrence of a fault in a system (Widodo and Yang 2011). 3.1 Survival analysis Formally, T ≥ 0 is a random variable that models the time for an event of interest to happen; f(t) and F(t) are its prob- Survival analysis or time-to-event analysis (Cox and Oakes ability distribution and cumulative probability distribution, 2018) is a set of statistical models and methods for esti- respectively. mating the time it takes for a particular event of interest to happen. In a typical survival analysis task, a group of indi- F(t)= f (x)dx (1) viduals (e.g., patients) are observed for a period. For each −∞ individual, the time when the event of interest happened is Furthermore, S(t), called the survival function, is defined recorded. Usually, the event will not occur for all the indi- as the probability that the event did not happen before time viduals in the period of observation. The situation when the t. (Typically, when S(t) is plotted, it is called the survival event of interest did not happen for an individual during the curve.) observation is called censoring. The goal of survival analy- sis is to find the probability of happening of the event of S(t)= P[T > t]= 1 − F(t) (2) interest. In this regard, survival analysis is similar to regres- The hazard function h(t), is the instantaneous occurrence sion analysis but with a major difference, where survival rate of the event of interest, and is defined as: analysis models take into account the information related 1 3 Social Network Analysis and Mining (2022) 12:86 Page 5 of 13 86 1. Bootstrap q samples from the data, where q is the num- P[t ≤ T < T +dtT ≥ t] f (t) h(t)= lim = (3) ber of trees. On average, each sample excludes 37% of dt→0 dt S(t) the original data as out-of-bag (OOB) data. Survival and hazard functions can be connected via the fol- 2. Grow a survival tree for each bootstrap sample. At each lowing formula: node of the tree, select m (i.e., a subset of variables t used during the node split) candidate variables. Then − ∫ h(x)dx (4) S(t)= e split the node using the variable that maximises the sur- vival difference between its children nodes. Given n individual samples, each sample i ∈[1...n] is repre- i i i 3. Furthermore, grow the tree to be full under the constraint sented as triplet (A , E , T ) where: where no leaf node should have less than d > 0 deaths. i d The value of d is a hyperparameter, similar to q, which • A ∈ R is a d-dimensional real-valued vector of indi- is chosen to produce the best results. vidual features (i.e., user attributes in our context); 4. Compute the cumulative hazard function (or the survival • E ∈{0, 1} is the variable indicating the event of inter- i function) for each tree. est happened when E = 1 or not (censored) when i 5. Use the OOB data to calculate the prediction error for E = 0 , for individual i during the observation; the ensemble cumulative hazard function (or the sur- • T = min(t , t ) is the time when the event happened i end vival function). for individual i during the observation period; t is end the time when the observation was ended. T = t (i.e., end Different implementations of RSF mainly differ in their event did not happen) indicates sample i is censored. splitting rule. Ideally, the splitting rule should maximise the survival difference across two dataset partitions. In this The main task of the survival analysis methods is to esti- paper, we used the implementation from PySurvival library mate h(t) and S(t). (Fotso et al. 2019). 3.4 Concordance index 3.2 Kaplan–Meier estimator The concordance index (or C-index for short) is a generalisa- Kaplan–Meier estimator (Kaplan and Meier 1958) is a tion of the area under the ROC curve (AUC), which supports nonparametric model that calculates the survival function censored data (Harrell et al. 1982). The C-index widely is S (t) of a homogeneous cohort, i.e., the individuals in KM used as an evaluation metric of the performance of survival the same cohort (or group) share the same survival func- models. It summarises the model’s discriminatory power, tion. Given N individual samples in a cohort, it assumes which is how well a model can rank the survival times of that there are J distinct actual event times such that samples. Similar to AUC, the value of the C-index ranges t < t < ⋯ < t when J ≤ N , t hen: 1 2 J from 0.5 to 1, where 1 indicates the best performance. More formally, given S(t) be the survival function esti- S (t)= 1 − , KM (5) ∗ ∗ mated by some survival model, let t ,… , t be a set of fixed t ≤t j 1 s time points, e.g., t ,… , t where N is a distinct time index. 1 N where d is the individuals who experienced an event and n Then C-index is defined as: j j is the number of individuals that did not experience the event ∗ ∗ in time interval [t , t ]. C = 1[S(t ) > S(t )], j−1 j i j (6) j∶t <t Kaplan–Meier method only uses the information from i∶E =1 i j i i E and T to estimate the survival function. where M is the total number of comparable pairs and 1[.] is a function that will return 1 if its input argument is true or 0 otherwise. Note that there are slightly different definitions 3.3 Random survival forests for C-index in other works. In this work, we used the defini- tion proposed by Utkin et al. (2019). Ishwaran et al. (2008) proposed the random survival for- ests (RSF) model, which is an extension to the random 3.5 Log‑rank test forests ensemble model (Breiman 2001) for working with censored data. The general idea for creating an RSF model The log-rank test (Mantel 1966) is a nonparametric statis- for a particular dataset is as follows (Utkin et al. 2019; tical test for comparing the hazard functions, i.e., h(t), of Ishwaran et al. 2008): two cohorts/groups of individuals. The null hypothesis is 1 3 86 Page 6 of 13 Social Network Analysis and Mining (2022) 12:86 Table 2 Information about the datasets and the state of human rights. DS SE covers topics concern- ing the widespread field of data science. And CS SE covers Characteristic Dataset topics related to computer science. We chose these three Pol DS CS communities for two reasons. Firstly, although the sizes of these communities are smaller than the sizes of some other Number of questions 12,416 28,950 40,792 QA communities hosted on SE like Stack Overflow, the cho- Number of users 31,242 100,582 113,434 sen communities are thriving in their niche. Secondly, each Number of answers 25,909 32,334 46,785 of these communities is more or less focused on separate Number of comments 135,648 64,244 167,038 fields that, although they might share some topics, are dif- Year founded 2012 2014 2008 ferent enough to be viewed as distinct. It allows us to search for possible patterns related to disengagement, regardless of that the hazard functions of two groups, e.g., group 1 and 2, the specific topics of a field. are equal, i.e., h (t)= h (t) . The Log-rank test assumes that The datasets of the three communities were downloaded 1 2 survival probabilities (i.e., the probabilities of not becom- from the Stack Exchange data dump available on Archive. ing disengaged in our context) stay the same over time. It org. The data included the complete historical information is widely used to check whether the underlying survival about the questions and answers posted on the three QA distributions of two groups are the same or are different, communities from their inception until May 2021. Table 2 essentially. shows the general information about the datasets. 4.2 Community characteristics 4 Data Table 3 includes the summary statistics for users belong- 4.1 Data description ing to three communities whose data are used in this study. The information shown in the table was extracted from As mentioned earlier, we used data from three online QA the corresponding Users table for each community from platforms that are Politics (Pol), Data Science (DS), and the Stack Exchange data dump. Based on the informa- Computer Science (CS) Stack Exchange (SE). Pol SE is an tion presented in Table  3, for all three communities the ad-hoc QA community focused on politically-themed con- distribution of first four attributes (i.e., user reputation, tent, such as questions related to the nature of democracy profile views, upvotes, and downvotes) seems to follow a Table 3 Summary statistics of Community Attribute Statistics users in each community Mean STD Median First quartile Fourth quartile Pol User reputation 160.50 1377.06 101 1 101 Profile views 10.01 107.82 0 0 1 Upvotes 12.38 126.12 0 0 2 Downvotes 2.35 56.70 0 0 0 Year joined 2017 1.94 2018 2016 2019 DS User reputation 50.78 195.70 1 1 101 Profile views 1.58 24.44 0 0 0 Upvotes 1.36 26.76 0 0 0 Downvotes 0.13 10.47 0 0 0 Year joined 2018 1.81 2018 2017 2020 CS User reputation 67.50 962.95 3 1 101 Profile views 3.60 119.03 0 0 1 Upvotes 2.39 94.92 0 0 0 Downvotes 0.31 32.94 0 0 0 Year joined 2017 2.37 2017 2015 2019 https:// archi ve. org/ downl oad/ stack excha nge; the data are available under Creative Commons licences. 1 3 Social Network Analysis and Mining (2022) 12:86 Page 7 of 13 86 Table 4 Information about behavioural user attributes the user contributions and the probability of disengage- ment. Namely, behavioural attributes and content-based User attribute Description attributes. Number of downvotes cast by user i A Number of upvotes cast by user i 4.3.1 Behavioural attributes Number of questions posted by user i Number of answers posted by user i We identified five user attributes in the datasets that Number of comments written by user i directly correspond to the level of user contribution. These attributes primarily are based on the information related to user behaviour that seems crucial to the proper functioning heavy tail distribution due to large size of dispersion (i.e., of the platform. Table 4 includes the name and description STD) around the mean. Moreover, relative to users from of these attributes. the other two communities, users in Pol SE show more intensity of activity on average. This is apparent based on the observation that although the number of registered 4.3.2 Content‑based attributes users on Pol SE is smaller than the number of registered users on the other two communities, nonetheless, the aver- In addition to behavioural attributes, we picked up a set age values of the first four user attributes listed in Table  3 of content-based user attributes. These attributes hint at is noticeably larger. Additionally, regarding the trend of how the contributions made by each user might have been about the increase in the number of registered users in perceived favourably by the community, i.e., other users. each community, DS SE had the fastest growth relative to The primary motivation is that users can indirectly con- two other communities. It took only 4 years for DS SE to tribute to the platform, e.g., by asking a question that starts reach half of its registered users, while in comparison, the a stream of debates over a controversial topic such as refu- number of years it took for Pol SE and CS SE to reach half gee crisis in the context of the Pol SE community. And the of their registered users were 6 and 9 years, respectively. information about this type of indirect user contribution, Furthermore, all three communities show a large increase which is not only limited to the behaviour of a particular in the number of users lately (i.e., around 2018 onwards) user, can be extracted and utilised from user content (e.g., which we suspect to be due to the tipping point phenom- mainly from metadata of users’ posts). Table 5 includes enon (Singh et al. 2020). the name and description of the content-based attributes employed in this work. 4.3 User attributes 4.4 User representation The bulk of users in QA platforms do not make any con- tributions. These users, who are referred to as lurkers Based on the two types of user attributes mentioned ear- in some previous work  (e.g., Tagarelli and Interdonato lier, we constructed a numerical vector for each user, 2018), can be differentiated from the normal users, who called user representation vector (or URV for short). The include the experts, by their level of contribution to the URV of user i is defined as: platform. In this work, two categories of user attributes i i i URV =(A , A , … , A ) i (7) were used in order to investigate the relationship between 1 2 j Table 5 Information related to User attribute Description the content-based attributes Average number of times questions posted by user i were viewed A Average number of comments written for questions posted by user i A Average score (i.e., sum of upvotes and downvotes given to ques- tion) of the questions posted by user i Average score (i.e., sum of upvotes and downvotes given to answer) of the answers posted by user i A Average number of comments written for answers posted by user i Average number of upvotes given to the comments posted by user i 1 3 86 Page 8 of 13 Social Network Analysis and Mining (2022) 12:86 Table 6 Sets of users defined by dichotomizing users based on their Table 7 Sizes of each user set per dataset behavioural attributes corresponding to their contribution User set Dataset User set Definition Pol DS CS Q Users who posted at least one question |Q| 3775 (12%) 16,041 (16%) 20,841 (18%) Users who posted no question 27,466 (88%) 84,540 (84%) 92,592 (82%) A Users who answered at least one question |A| 3743 (12%) 7226 (7%) 7020 (6%) Users who did not answer any question 27,498 (88%) 93,355 (93%) 106,413 (94%) C Users who commented on at least one question/answer |C| 6358 (20%) 12,056 (12%) 16,788 (15%) Users who did not answer any question 24,883 (80%) 88,525 (88%) 96,645 (85%) U Users who upvoted at least one question/answer/comment |U| 11,025 (35%) 16,328 (16%) 20,767 (18%) Users who did not upvoted 20,216 (65%) 84,253 (84%) 92,666 (82%) D Users who downvoted at least one question/answer |D| 1596 (5%) 732 (1%) 1143 (1%) Users who did not downvote 29,645 (95%) 99,849 (99%) 112,290 (99%) where A is the corresponding user attribute from Tables 4 and 5. was censored. More precisely, let t be the last time user i been seen visiting the platform, and  be the threshold value, 4.5 Dichotomizing users then the value of E would be set based on the following relation: Moreover, we dichotomised users into pairs of disjoint sets 0, if d(t , t ) ≤ 𝜃 i i (or groups) using each one of behavioural user attributes. l E = f (t )= (8) 1, otherwise The main idea is that users can be partitioned into two groups naturally, where the criterion for the split is whether where t is the last recorded time in the dataset, and d(t , t ) is d d a user has made a particular type of contribution or not. the time difference between t and t in months. In this work, Categorising users in two disjoint sets based on the value of we used two threshold values (i.e.,  ) of 24 and 36 months. a behavioural attribute allowed us to investigate the impor- Subsequently, users who had not visited the platform for tance of one specific user attribute, e.g., by comparing the more than 2 and 3 years were considered disengaged. survival curves of two disjoint sets of users who posted at least one question and who did not. Table  6 includes the information about each group of users and its counterpart. 5 Results 4.6 Disengagement criterion 5.1 Results from Kaplan–Meier What amounts to the event of a user becoming disengaged Kaplan–Meier method was used to estimate (within the is domain-dependent and thus can vary in different settings. confidence interval of 95%) the survival functions (i.e., For example, normally, in medical research, the event usu- S(t)) of sets of users dichotomised based on the definitions ally is the patient’s death (Cox and Oakes 2018). In this shown in Table 6. The implementation from Lifeline library work, suitable to our need, we opted to use the information (Davidson-Pilon et al. 2021) was used to produce the sur- about the last time a user visited the QA platform to detect vival curves. Figures 2, 3, and 4 show the survival curve of disengagement. The information about the last time a user each pair sets of users for {Pol, DS, CS} SE datasets, respec- visited is available in the LastAccessDate column from the tively. Table 7 includes the information about the size and Users table in each dataset. The activity time of a user was proportion of each user set dichotomised based on a single calculated based on the difference in the number of months behavioural attribute. As mentioned earlier, for each user, since the user joined the platform until the last recorded the label indicating whether the user is disengaged or not activity time of the user (i.e., LastAccessDate associated (i.e., E ) was censored if the difference between the user’s with the user). For user i, if the number of months since his last activity time and t was less than or equal to 24 and 36 last visit to the platform exceeded a certain threshold value, months, respectively. Log-rank test (with p < 0.005 ) was he would be tagged as a disengaged user (i.e., E = 1 ); other- performed on each pair of curves. wise, the user’s state was considered still active (i.e., E = 0 ) which means the information about user’s disengagement 1 3 Social Network Analysis and Mining (2022) 12:86 Page 9 of 13 86 0.9 Users in Q Users in Q 0.8 Users in A Usersin A 0.8 Users in Q Users in Q Users in A Usersin A 0.8 0.7 0.8 0.7 0.6 0.6 0.7 0.7 0.5 0.5 0.4 0.6 0.4 0.6 020406080 100 020406080 100 020406080 100 020406080 100 time (months) time (months) time (months) time (months) (a) θ =24 (b) θ =36 (c) θ =24 (d) θ =36 Users in C Users in C Users in U Users in U 0.9 Users in C 0.9 Users in C Users in U Users in U 0.8 0.8 0.8 0.8 0.6 0.7 0.6 0.7 0.6 0.4 0.6 0.4 0.5 0.2 0.4 0.5 020406080 100 020406080 100 020406080 100 020406080 100 time (months) time (months) time (months) time (months) (e) θ =24 (f) θ =36 (g) θ =24 (h) θ =36 1 1 0.9 0.8 Users in D Users in D 0.8 Users in D Users in D 0.6 0.7 0.4 0.6 020406080 100 020406080 100 time (months) time (months) (i) θ =24 (j) θ =36 Fig. 2 Survival curves for Pol SE dataset estimated using Kaplan–Meier method more significant prediction errors ranked more important. 5.2 Results from RSF We chose permutation importance mainly because of its intuitive definition and subsequent interpretation, which is We used k-fold cross-validation (with k=5) in 30 runs to based on the idea that the importance of a variable is the make the predictions (using RSF models). The values of increase in model error when the variable’s information is model hyperparameters such as the number of trees (i.e., destroyed via value permutation (Molnar 2020). Moreover, q) have been tested in order to choose the ones that lead to it provides a compact global insight into the model’s behav- the best results. We used C-index to evaluate the perfor- iour. Figure 5 shows the per-dataset average permutation mance of the models. Only data for users with contribu- importance of each attribute on the prediction results of the tions were used to train and evaluate the RSF-based models. RSF models used in this work. In other words, only the information of users belonging to Q ∪ A ∪ C ∪ U ∪ D (from Table 6) was used. For each user, three URVs were constructed, using the behavioural user 6 Discussion attributes only, content-based attributes only, and finally, a combination of both. Table 8 includes the average C-indexes Based on results from the Kaplan–Meier method, we computed for the RSF models over the runs. observed: (i) the underlying hazard function of each set of users seems to be different; (ii) the probability that users 5.3 Attribute importance with even a few contributions (e.g., the user asked one ques- tion) are noticeably higher than other users who did not con- We used the permutation importance measure (Breiman tribute to the platform. We observed a distinctive difference 2001; Molnar 2020) present in RSF models to rank each user between the survival functions of the users who contributed attribute. The attributes which permuting their values caused 1 3 probabilityofbeing active probabilityofbeing active probabilityofbeing active probabilityofbeing active probabilityofbeing active probabilityofbeing active probabilityofbeing active probabilityofbeing active probabilityofbeing active probabilityofbeing active 86 Page 10 of 13 Social Network Analysis and Mining (2022) 12:86 0.9 Users in Q 0.95 Users in A Users in Q Usersin A 0.95 0.9 Users in Q Users in A Users in Q Usersin A 0.9 0.8 0.9 0.8 0.85 0.85 0.7 0.7 0.8 0.8 0.6 0.6 0.75 0.75 0.5 0.5 0.7 0.7 020406080 020406080 020406080 020406080 timeline (months) timeline (months) timeline(months) timeline (months) (a) θ =24 (b) θ =36 (c) θ =24 (d) θ =36 Usersin C Usersin C Users in U Usersin U 0.9 Usersin C Usersin C Users in U Usersin U 0.9 0.9 0.8 0.8 0.7 0.8 0.8 0.6 0.6 0.7 0.5 0.7 0.4 020406080 020406080 020406080 020406080 timeline (months) timeline (months) timeline(months) timeline(months) (e) θ =24 (f) θ =36 (g) θ =24 (h) θ =36 1 1 0.9 0.9 0.8 Usersin D Users in D Usersin D Users in D 0.7 0.8 0.6 0.5 0.7 020406080 020406080 timeline (months) timeline(months) (i) θ =24 (j) θ =36 Fig. 3 Survival curves for DS SE dataset estimated using Kaplan–Meier method to the platform and those who did not contribute. This pat- behavioural user attributes. The number of upvotes (i.e., A ) tern, which is present in all three datasets regardless of the received the highest importance in all three datasets. Sub- community niche, confirms the finding from previous related sequently, with a noticeable difference, the average number studies such as the ones reported in Joyce and Kraut (2006) of the times user questions were viewed (i.e., A ) received and Yang et al. (2010) that suggested that users with even relatively high importance. We suspect that the user’s higher a few initial contributions are more likely to stay loyal than upvotes might show that they hold a favourable view of the users without any contributions. The latter make up the bulk community (or platform) in general. On the other hand, the of the users. Furthermore, the gap between the probability information about the number of downvotes did not contain of disengagement of two groups seems to widen over time. much predictive information. We suspect it could be due to Predictions using RSF models show relatively similar the small number of users with downvoting activity in the patterns on all three datasets. Results (shown in Table 8) datasets. indicate that the inclusion of the information of behavioural attributes leads to better predictions compared to the use of content-based user attributes only. Furthermore, using the 7 Limitations and future work mixture of the information of behavioural and content-based attributes yielded a slight improvement. The value of  does There are a few limitations regarding the work done in this not seem to affect the overall results. paper. The datasets were used only from three QA com- Based on the permutation importance of attributes (see munities hosted by (the larger) Stack Exchange platform. Fig.  5), behavioural features play a more salient role in Consequently, this work did not investigate and compare the output of the RSF models. On average, 4 out of 5 top disengagement on other major platforms such as Quora. It attributes with the most permutation importance are from seems interesting to compare our results with the results 1 3 probabilityofbeing active probabilityofbeing active probabilityofbeing active probabilityofbeing active probabilityofbeing active probabilityofbeing active probabilityofbeing active probabilityofbeing active probabilityofbeing active probabilityofbeing active Social Network Analysis and Mining (2022) 12:86 Page 11 of 13 86 0.9 Usersin Q Usersin Q Usersin A Usersin A 0.8 Usersin Q Usersin Q 0.8 Usersin A Usersin A 0.8 0.8 0.7 0.6 0.7 0.6 0.6 0.6 0.4 0.4 0.5 0.5 0.4 020406080 100 020406080 100 020406080 100 020406080 100 timeline (months) timeline (months) timeline (months) timeline (months) (a) θ =24 (b) θ =36 (c) θ =24 (d) θ =36 Usersin C Usersin C Usersin U Usersin U 0.9 Usersin C Usersin C Usersin U Usersin U 0.8 0.8 0.8 0.8 0.7 0.6 0.6 0.6 0.6 0.4 0.4 0.5 0.4 0.2 0.4 0.2 020406080 100 020406080 100 020406080 100 020406080 100 timeline (months) timeline (months) timeline (months) timeline (months) (e) θ =24 (f) θ =36 (g) θ =24 (h) θ =36 1 1 0.8 0.8 Usersin D Usersin D 0.6 Usersin D Usersin D 0.6 0.4 020406080 100 020406080 100 timeline (months) timeline (months) (i) θ =24 (j) θ =36 Fig. 4 Survival curves for CS SE dataset estimated using Kaplan–Meier method obtained with data from other major QA platforms in users are the same can be used, which theoretically should future. Conventional assumptions related to the application lead to better predictions. of survival analysis techniques hold over our results, e.g., Temporal context can tentatively play an essential role the assumption that the probabilities of disengagement of in the intensity of user activities in a community and censored and none censored individuals are essentially the subsequently be an informative factor in the level of user same. We used user inactivity for an extended period (e.g., 2 years passed since the user visited the community web Table 8 Average C-index for RSF models using different attribute pages) to distinguish between disengaged and censored sets; higher C-index indicates better prediction users. This required the use of a time threshold in which Dataset Attributes  = 24  = 36 its value is set experimentally, not based on a well-defined rule. Finally, for most users, their behavioural information Mean STD Mean STD does not exist, making it hard to investigate further the Pol Behavioural only 0.75 0.01 0.76 0.01 survival probabilities of users dichotomised based on the Content-based only 0.68 0.01 0.68 0.01 definitions given in Table  6. Behavioural plus content-based 0.75 0.01 0.76 0.01 Including the data from a more numerous and diverse DS Behavioural only 0.66 0.01 0.66 0.01 set of QA platforms could be interesting for future work. Content-based only 0.61 0.01 0.63 0.01 Furthermore, the information related to the body of the Behavioural plus content-based 0.68 0.01 0.70 0.01 posts (e.g., the text of questions and answers) of each CS Behavioural only 0.68 0.00 0.68 0.01 user could be utilised to find the probabilities of disen- Content-based only 0.62 0.01 0.63 0.01 gagement. Additionally, methods and models that do not Behavioural plus content-based 0.69 0.01 0.68 0.01 assume the probabilities for the disengaged and censored 1 3 probabilityofbeing active probabilityofbeing active probabilityofbeing active probabilityofbeing active probabilityofbeing active probabilityofbeing active probabilityofbeing active probabilityofbeing active probabilityofbeing active probabilityofbeing active 86 Page 12 of 13 Social Network Analysis and Mining (2022) 12:86 A A A 1 1 1 A A A 2 2 2 A A A 3 3 3 A A A 4 4 4 A A A 5 5 5 A A A 6 6 6 A A A 7 7 7 A A A 8 8 8 A A A 9 9 9 A A A 10 10 10 A11 A11 A11 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Importance Importance Importance (a) ForPol SE dataset with θ =24 (b) ForPol SE dataset with θ =36 (c) ForDSSEdatasetwith θ =24 A A A 1 1 1 A A A 2 2 2 A A A 3 3 3 A A A 4 4 4 A A A 5 5 5 A A A 6 6 6 A A A 7 7 7 A A A 8 8 8 A A A 9 9 9 A A A 10 10 10 A A A 11 11 11 0.00 0.02 0.04 0.06 0.08 0.10 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 Importance Importance Importance (d) ForDSSEdataset with θ =36 (e) ForCSSEdataset with θ =24 (f) ForCSSEdataset with θ =36 Fig. 5 Average permutation importance of each attribute; models are trained and evaluated using behavioural and content-based attributes simul- taneously over the datasets shown in Table 2 Funding Open access funding provided by NTNU Norwegian Univer- engagement. By temporal context, we mean the effects of sity of Science and Technology (incl St. Olavs Hospital - Trondheim real-world events occurring within a specific time period University Hospital). This work was carried out as part of the Trond- on the behaviour of users of a QA community. Examples heim Analytica projecthttps://w ww.n tnu.e du/t rondh eiman alyti ca, sup- of such events include Brexit and the political campaigns ported by NTNU’s Digital Transformation programme. during an election in an influential country such as the Data availability The data used in this study are publicly available USA. from Archive.org under Creative Commons licences. Furthermore, for reproducibility, the code and other related artefacts, such as the preprocessed version of the data used in the experiments, are also avail- 8 Conclusion able on GitHub.comhttps://git hub.com/ habedi/ Sur viv alAnal ysisQA Co mmuni ties. We used survival analysis to study user disengagement using the historical data from three distinct QA communities from Declarations their inception to May 2021. We employed two categories Conflict of interest I do not have any conflicts or competing interests of user attributes and investigated the importance of these to declare. attributes. Our results confirm the previous findings that users with some initial contributions (e.g., questions and Open Access This article is licensed under a Creative Commons Attri- answers) are likelier to stay active longer than users who bution 4.0 International License, which permits use, sharing, adapta- tion, distribution and reproduction in any medium or format, as long contributed nothing. Furthermore, based on our results, as you give appropriate credit to the original author(s) and the source, behavioural user attributes can be used to estimate the disen- provide a link to the Creative Commons licence, and indicate if changes gagement probability of each user with reasonable accuracy. were made. The images or other third party material in this article are Moreover, based on the importance of attributes used to included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in train and evaluate the models, how favourable users see the the article's Creative Commons licence and your intended use is not content posted on the platform seems to affect the disengage- permitted by statutory regulation or exceeds the permitted use, you will ment time. need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://cr eativ ecommons. or g/licen ses/ b y/4.0/ . Acknowledgements I thank the reviewers and my colleague Yanzhe Bekkemoen for helping me improve the manuscript. 1 3 Attribute Attribute Attribute Attribute Attribute Attribute Social Network Analysis and Mining (2022) 12:86 Page 13 of 13 86 Miao F, Cai YP, Zhang YT et al (2015) Is random survival forest an References alternative to Cox proportional model on predicting cardiovascu- lar disease? In: MBEC, pp 740–743 Adaji I, Vassileva J (2015) Predicting churn of expert respondents in Molnar C (2020) Interpretable machine learning. Lulu.com social networks using data mining techniques: a case study of Ortega F, Convertino G, Zancanaro M et al (2014) Assessing the per- stack overflow. In: ICMLA. IEEE, pp 182–189 formance of question-and-answer communities using survival Breiman L (2001) Random forests. Mach Learn 45(1):5–32 analysis. arXiv preprint arXiv: 1407. 5903 Cox DR (1972) Regression models and life-tables. J R Stat Soc Ser B Pudipeddi JS, Akoglu L, Tong H (2014) User churn in focused ques- (Methodol) 34(2):187–202 tion answering sites: characterizations and prediction. In: WWW, Cox DR, Oakes D (2018) Analysis of survival data. Chapman and pp 469–474 Hall, London Rothmeier K, Pflanzl N, Hüllmann JA et al (2021) Prediction of player Davidson-Pilon C, Kalderstam J, Jacobson N et al (2021) CamDavid- churn and disengagement based on user activity data of a free- sonPilon/lifelines: 0.26.0. 10.5281/zenodo.4816284 mium online strategy game. IEEE Trans Games 13(1):78–88 Dias J, Godinho P, Torres P (2020) Machine learning for customer Singh A, Dharamshi N, Thimma Govarthanarajan P et al (2020) The churn prediction in retail banking. In: Computational science and tipping point in social networks: investigating the mechanism its applications, pp 576–589 behind viral information spreading. In: BigDataService, pp 54–61 Dror G, Pelleg D, Rokhlenko O et al (2012) Churn prediction in new Stepanova M, Thomas L (2002) Survival analysis methods for personal users of Yahoo! Answers. In: WWW, pp 829–834 loan data. Oper Res 50(2):277–289 Dupret G, Lalmas M (2013) Absence time and user engagement: evalu- Tagarelli A, Interdonato R (2018) Mining lurkers in online social net- ating ranking functions. In: WSDM, pp 173–182 works: principles, models, and computational methods. Springer, Fotso S et al (2019) PySurvival: open source package for survival Berlin analysis modeling. https:// www. pysur vival. io/ Utkin LV, Konstantinov AV, Chukanov VS et al (2019) A weighted Guan T, Wang L, Jin J et al (2018) Knowledge contribution behavior random survival forest. Knowl Based Syst 177:136–144 in online Q &A communities: an empirical investigation. Comput Wang P, Li Y, Reddy CK (2019) Machine learning for survival analy- Hum Behav 81:137–147 sis: a survey. ACM Comput Surv 51(6):1–36 Harrell FE, Califf RM, Pryor DB et al (1982) Evaluating the yield of Widodo A, Yang BS (2011) Machine health prognostics using sur- medical tests. JAMA 247(18):2543–2546 vival probability and support vector machine. Expert Syst Appl Ishwaran H, Kogalur UB, Blackstone EH et al (2008) Random survival 38(7):8430–8437 forests. Ann Appl Stat 2(3):841–860 Yang J, Wei X, Ackerman M et al (2010) Activity lifespan: an analysis Jin J, Li Y, Zhong X et al (2015) Why users contribute knowledge to of user survival patterns in online knowledge sharing communi- online communities: an empirical study of an online social Q &A ties. In: ICWSM community. Inf Manag 52(7):840–849 Yang G, Cai Y, Reddy CK (2018) Spatio-temporal check-in time pre- Jing H, Smola AJ (2017) Neural survival recommender. In: WSDM, diction with recurrent neural network based survival analysis. In: pp 515–524 IJCAI, pp 2976–2983 Joyce E, Kraut RE (2006) Predicting continued participation in news- Yao J, Zhu X, Zhu F et al (2017) Deep correlational learning for sur- groups. J Comput Mediat Commun 11(3):723–747 vival prediction from multi-modality data. In: MICCAI Kaplan EL, Meier P (1958) Nonparametric estimation from incomplete observations. J Am Stat Assoc 53(282):457–481 Publisher's Note Springer Nature remains neutral with regard to Karnstedt M, Hennessy T, Chan J et al (2010) Churn in social net- jurisdictional claims in published maps and institutional affiliations. works: a discussion boards case study. In: ICSC, pp 233–240 Kuzmeski M (2009) The connectors: how the world’s most successful businesspeople build relationships and win clients for life. Wiley, Hoboken Mantel N (1966) Evaluation of survival data and two new rank order statistics arising in its consideration. Cancer Chemother Rep 50(3):163–170 1 3 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Social Network Analysis and Mining Springer Journals

Survival analysis for user disengagement prediction: question-and-answering communities’ case

Loading next page...
 
/lp/springer-journals/survival-analysis-for-user-disengagement-prediction-question-and-WV3o9iQ28r
Publisher
Springer Journals
Copyright
Copyright © The Author(s) 2022
ISSN
1869-5450
eISSN
1869-5469
DOI
10.1007/s13278-022-00914-8
Publisher site
See Article on Publisher Site

Abstract

We used survival analysis to model user disengagement in three distinct questions-and-answering communities in this work. We used the complete historical data from domains including Politics, Data Science, and Computer Science from Stack Exchange communities from their inception until May 2021, including information about all users who were members of one of these three communities. Furthermore, in formulating the user disengagement prediction as a survival analysis task, we employed two survival analysis techniques (Kaplan–Meier and random survival forests) to model and predicted the probabilities of members of each community becoming disengaged. Our main finding is that the likelihood of users with even a few contributions staying active is noticeably higher than those who were making no contributions; this distinction may widen as time passes. Moreover, the results of our experiments indicate that users with more favourable views toward the content shared on the platform may stay engaged longer. Finally, regardless of their themes, the observed pattern holds for all three communities. Keywords Question-and-answering platforms · User disengagement · Survival analysis · Stack exchange 1 Introduction of time (e.g., more than a year). In this context, disengage- ment might have happened for various reasons; e.g., it might Online question-and-answering (QA) social networks like have occurred because disengaged users believed that the 2 3 Stack Overflow and Quora are dependent on their users’ platform had an elitist or even toxic culture. Another reason contributions for proper functioning. Arguably, the main could have been that user interests changed drastically over functionality of a QA platform is to connect two types time, and the platform hosting the QA community could not of users (Kuzmeski 2009); on one side, people who seek adapt to the change in an agile way. answers to their questions and on the other side, people who At the very least, a high disengagement rate has adverse are willing to share their knowledge and expertise with oth- effects on the overall quality of the service of a QA social ers (Guan et al. 2018). Nevertheless, a user who joined and network and platform. For example, suppose all the experts made many contributions to the community may become (i.e., users who post answers perceived as high quality by uninterested and then gets disengaged after a while. By dis- the community) become disengaged within a few months of engaged, we mean the situation where users—as individuals joining and being active. In that case, the quality of answers who previously made contributions (e.g., answered ques- might plummet, which may increase the rate of users’ disen- tions and participated in debates)—suddenly stopped their gagement from the community (Pudipeddi et al. 2014; Dror activities (i.e., there is no sign of them even visiting the et al. 2012). In the worst-case scenario, one could expect platform’s web pages). Moreover, it is not known whether the situation where the QA platform loses the bulk of its these users left the community or not, but they did not per- contributors, which in turn would lead to its demise. form any activity on the platform for a relatively long period Survival analysis (Cox and Oakes 2018; Wang et  al. 2019) is a family of statistical methods and techniques that * Hassan Abedi Firouzjaei In this work, we use the terms question-and-answering platform, hassan.abedi@ntnu.no social network, and community interchangeably. https:// stack over flow. com. Department of Computer Science, Norwegian University of Science and Technology (NTNU), Trondheim, Norway https:// www. quora. com. Vol.:(0123456789) 1 3 86 Page 2 of 13 Social Network Analysis and Mining (2022) 12:86 used to model and evaluate the problem of user disengage- ment prediction. Section 4 gives an overview of the dataset and the methodology used to represent users and the engage- ment time. Section 5 presents the results of the experiments and Sect. 6 discusses the results. Section 7 discusses the limitations of our work and gives an outline for the direction of future work. Finally, Sect. 8 concludes the paper. 2 Related work Fig. 1 User A and user B joined the platform in the past; during QA platforms like Stack Exchange and Quora provide an a period of observation which started at t and ended at t , B start end accessible knowledge-sharing environment. Due to the became disengaged at t . A did not become disengaged during disengaged importance of this role which earlier was played by mailing the observation, but it is not known that he will become disengaged in lists, newsgroups and IRC channels, the interest in study- future or not; information about A’s disengagement is censored ing phenomena on these platforms has exploded lately. For example, in Joyce and Kraut (2006), authors studied the can help model and predict the time of the occurrence of an continued user participation in newsgroups. They used the event of interest. Initially, it emerged out of medical research posts from six public newsgroups to test whether answers to find the probability of a patient surviving a disease such that users receive to their first few questions are crucial for as cancer—hence the term survival analysis. More recently, prolongation of user participation. Their findings suggest survival analysis methods have found widespread use in new that longer questions are more likely to receive a response. areas such as customer churn analysis (Dias et al. 2020; Furthermore, the quality and emotional tone, and whether Rothmeier et al. 2021) and credit risk scoring (Stepanova the answer was in response to a question from a new user, and Thomas 2002), mainly due to their flexibility and power seems not to influence of likelihood of further participation. in accurately and reliably modelling the problems posed in Authors, in Guan et al. (2018), used the data from the these areas. most popular Chinese social QA platform, Zhihu, to inves- In this work, we used survival analysis to study user dis- tigate the factors related to users’ motivations to participate engagement in three distinct QA social networks, namely, in community activities, especially knowledge contribution. Politics, Data Science, and Computer Science Stack Their findings suggest that social exchange is an important Exchange. Our choice allowed us to pose questions and seek factor influencing users’ continuous knowledge contribution answers based on the data from QA communities with the in social QA communities. Moreover, the findings show that themes mentioned above. To our knowledge, this is the first knowledge exchange based on norms of reciprocity is an work that applied survival analysis to quantify and study important factor affecting users’ continuous contribution. user disengagement using the entire historical data of online For example, a user who frequently seeks knowledge is more QA social networks. Figure 1 illustrates how disengagement likely to contribute knowledge to others, indicating users prediction can be seen and formulated as a survival analysis contribute because they expect that they can get a response task. to their questions in future. Similarly, Jin et al. (2015) stud- Following are the main contributions of our work: ied the elements, based on the data from Zhihu, that were influencing user knowledge contributions in QA platforms, We study the factors likely to be associated with the prob- incorporating three theories of social capital theory, social ability that users of QA communities will stay active for exchange theory, and social cognitive theory in their work. an extended period. For the first time, we analyze the Furthermore, the use of survival analysis methods is also relationships between attributes related to users’ contri- gaining popularity, where an analogy could be made between butions and their engagement time. the problem and the task of survival analysis. In Wang et al. • We propose to exploit behavioural (see Table  4), and (2019), the authors provided a comprehensive review of two content-based user attributes (see Table 5) to estimate major categories of methods and techniques for survival the engagement time on three comprehensive datasets analysis; namely, conventional and various machine learning from distinct QA communities. methods for survival analysis. Their work described and dis- cussed die ff rent related topics, including data transformation The rest of this article is organised as follows. Section 2 dis- cusses the related work. Section 3 presents preliminary con- cepts related to survival analysis and introduces techniques https:// www. zhihu. com. 1 3 Social Network Analysis and Mining (2022) 12:86 Page 3 of 13 86 and early prediction of complex events—along with appro- methods to predict the churn of expert respondents on Stack priate evaluation metrics. Overflow. Their result indicated that the random forest had Yang et al. (2010) used survival analysis methods to ana- the highest classification accuracy of the four machine min- lyze and study user retention in three major QA commu- ing algorithms and the highest values for the other evalua- nities: Baido Knows, Yahoo! Answers, and Naver Knowl- tion metrics. edge-iN. Their findings suggest that users who preferred With the recent success of deep learning methods in tack- answering tend to have a more extended and more active ling problems in domains such as computer vision and natu- engagement period within the platform. Moreover, garnering ral language processing, interest in the use of artificial neural enough questions in order to retain the experts seems essen- networks for handling the censored data used in survival tial. Additionally, users who put more effort into the aver - analysis has drastically increased. For example, in Yao et al. age length of questions they post both tend to receive more (2017), the authors proposed a deep correlational survival answers and stay engaged longer. Finally, for answerers, model (or DeepCorrSurv for short), which, in contrast with acknowledging one’s contribution by having one’s answers traditional survival analysis methods, can handle multimodal selected as best or being commented on, was tied to a more data. In essence, DeepCorrSurv is able to learn the complex extended stay on the platform. Although their work is similar interdependencies on multimodal patient data (e.g., the mix- to our work, we used the data for the whole lifespan of the ture of images and features). Furthermore, recurrent neural communities, where their work mainly focused on a limited network-based approaches also have been successfully com- period. bined with survival analysis techniques to predict the events Arguably, three of the most popular metrics to measure regularly occurring, such as the time to check-in by the user user engagement on a web-based platform are click-through to a venue (Yang et al. 2018), and for content recommenda- rates, page views, and time spent by the user on the web- tion and personalization (Jing and Smola 2017). site (Dupret and Lalmas 2013). The authors, in Dupret and Finally, Table 1 shows the information about the differ - Lalmas (2013), used survival analysis to analyze the user ences between the work described in this paper and in the engagement in a dataset of questions and answers from literature. Based on the information presented in Table 1, the Yahoo! Answers in Japan— utilising user absence time (or topic of user disengagement analysis for QA communities absence time for short) which is the duration between two has been investigated using two main approaches: as a classi- consecutive visits by the user—to measure engagement. The fication or as a survival analysis task. Both approaches have intuition is that if a user finds a website more exciting and three major components: data, model, and disengagement engaging, they will return to it sooner rather than later. The criterion. Furthermore, the approaches mentioned above study’s main goal was to identify observable correlations (i.e., classification and survival analysis) are sufficiently between absence time and user engagement. different in their goals and assumptions. Arguably, the most Most works in this area related to data from QA com- pronounced difference between the two approaches is that munities are mainly focused on the data from a few larger when the disengagement is formulated as a classification communities, such as Stack Overflow (Ortega et al. 2014). task, the time is not considered. In other words, it is assumed For example, in Pudipeddi et al. (2014), authors investigated that the probability that a user gets disengaged is constant the factors that correlate with user churn on Stack Over- and independent of the time. In contrast, the central notion flow, including the time gap between posts, answering speed, behind the survival analysis is that the probability of a user number of answers received by the user, and reputation of becoming disengaged is a function of time. Furthermore, the users who answered to the user’s questions. Their find- the main goal of survival analysis is to find a good estimate ings suggest that the time gap between subsequent posts is of the survival function, which outputs the probability of the most significant indicator of an increase in their inter - the event of interest not happening (in our case, the likeli- est in staying engaged. Additionally, in Adaji and Vassileva hood that the user stays engaged) at a specific time. In this (2015), the authors studied the problem of expert churn regard, the main benefit of the survival analysis approach is on Stack Overflow, formulated as a classification task. To the possibility of taking into account the censoring of the label the users who left the community, the authors used data. These properties make the work presented in this study the definition from Karnstedt et al. (2010), where a churner different from existing works that use a classification-based is defined as a user whose average activity over a specific approach. In addition, the remaining existing works using subsequent period has dropped to less than a fraction of their survival analysis utilised the Cox model. Some recent stud- average activity in a previously observed period. In other ies (e.g., in Miao et al. 2015) suggest that the Cox model, words, if a user has a noticeable drop in his activity fol- compared to a random survival forests model, may have a lowing a period of considerable activity (e.g., answering weaker discriminative power. The main reason for this can multiple questions for a period then stopping) he is consid- be because the Cox model can only infer the linear effects ered a churner. To that aim, they used four machine learning between the target and independent variables, while RFS can 1 3 86 Page 4 of 13 Social Network Analysis and Mining (2022) 12:86 Table 1 Difference with the work in the literature References Data and models Disengament/churn criteriun Yang et al. (2010) Authors used data from three QA communities for a period User inactivity over 100 days of 2 years. The main model used was Cox model (Cox 1972) Dror et al. (2012) Authors used data from Yahoo! Answers for a period of User inactivity after his first week of joining about nine months. The churn prediction was formulated as a binary classification. Altogether, seven learners were used: the majority, naive Bayes, logistic regression, SVM, decision tree, random forests, and KNN Dupret and Lalmas (2013) Authors used data from Yahoo! Answers Japan for a period User absence time in days of two weeks. Cox model (Cox 1972) was used Pudipeddi et al. (2014) Stack Overflow data for a period of 4 years (from 2008 to No new post by user for six months or more 2012) were used. User churn prediction was formulated as a binary classification task, and three types of classi- fiers were used. Namely, SVM, decision tree, and logistic regression Adaji and Vassileva (2015) Authors used the data from Stack Overflow from a period Decrease in user activity during a follow-up period of 6 years (from 2008 to 2014). The problem of predict- relative to his activity during a previous observa- ing the expert users’ churn was formulated as a binary tion period classification task, and four learners were used. Namely, logistic regression, multi-layer perceptron, random forests, and SVM The approach in this work Data from three QA communities were used; namely, Poli- User absence for an extended period of time tics, Data Science, Computer Science Stack Exchange. The data include the entire lifespan of communities. The prediction of user disengagement was formulated as a sur- vival analysis task, and two methods were used. Namely, Kaplan–Meier and random survival forests handle nonlinearity (Wang et al. 2019). Furthermore, RFS is to individuals for whom the event did not take place, i.e., a nonparametric method; compared to the Cox model, which the censored individuals. This difference allows for obtain- is a semi-parametric method, it is more versatile because it ing more accurate estimations. Although survival analysis does not make any assumption about the underlying distribu- originated from the field of medical research, mainly for esti- tion of the data. mating the time a patient would live after being diagnosed having a deadly disease such as breast cancer, it has gained much attention in other areas such as customer churn analy- 3 Preamble sis and prediction (Dias et al. 2020) and time to occurrence of a fault in a system (Widodo and Yang 2011). 3.1 Survival analysis Formally, T ≥ 0 is a random variable that models the time for an event of interest to happen; f(t) and F(t) are its prob- Survival analysis or time-to-event analysis (Cox and Oakes ability distribution and cumulative probability distribution, 2018) is a set of statistical models and methods for esti- respectively. mating the time it takes for a particular event of interest to happen. In a typical survival analysis task, a group of indi- F(t)= f (x)dx (1) viduals (e.g., patients) are observed for a period. For each −∞ individual, the time when the event of interest happened is Furthermore, S(t), called the survival function, is defined recorded. Usually, the event will not occur for all the indi- as the probability that the event did not happen before time viduals in the period of observation. The situation when the t. (Typically, when S(t) is plotted, it is called the survival event of interest did not happen for an individual during the curve.) observation is called censoring. The goal of survival analy- sis is to find the probability of happening of the event of S(t)= P[T > t]= 1 − F(t) (2) interest. In this regard, survival analysis is similar to regres- The hazard function h(t), is the instantaneous occurrence sion analysis but with a major difference, where survival rate of the event of interest, and is defined as: analysis models take into account the information related 1 3 Social Network Analysis and Mining (2022) 12:86 Page 5 of 13 86 1. Bootstrap q samples from the data, where q is the num- P[t ≤ T < T +dtT ≥ t] f (t) h(t)= lim = (3) ber of trees. On average, each sample excludes 37% of dt→0 dt S(t) the original data as out-of-bag (OOB) data. Survival and hazard functions can be connected via the fol- 2. Grow a survival tree for each bootstrap sample. At each lowing formula: node of the tree, select m (i.e., a subset of variables t used during the node split) candidate variables. Then − ∫ h(x)dx (4) S(t)= e split the node using the variable that maximises the sur- vival difference between its children nodes. Given n individual samples, each sample i ∈[1...n] is repre- i i i 3. Furthermore, grow the tree to be full under the constraint sented as triplet (A , E , T ) where: where no leaf node should have less than d > 0 deaths. i d The value of d is a hyperparameter, similar to q, which • A ∈ R is a d-dimensional real-valued vector of indi- is chosen to produce the best results. vidual features (i.e., user attributes in our context); 4. Compute the cumulative hazard function (or the survival • E ∈{0, 1} is the variable indicating the event of inter- i function) for each tree. est happened when E = 1 or not (censored) when i 5. Use the OOB data to calculate the prediction error for E = 0 , for individual i during the observation; the ensemble cumulative hazard function (or the sur- • T = min(t , t ) is the time when the event happened i end vival function). for individual i during the observation period; t is end the time when the observation was ended. T = t (i.e., end Different implementations of RSF mainly differ in their event did not happen) indicates sample i is censored. splitting rule. Ideally, the splitting rule should maximise the survival difference across two dataset partitions. In this The main task of the survival analysis methods is to esti- paper, we used the implementation from PySurvival library mate h(t) and S(t). (Fotso et al. 2019). 3.4 Concordance index 3.2 Kaplan–Meier estimator The concordance index (or C-index for short) is a generalisa- Kaplan–Meier estimator (Kaplan and Meier 1958) is a tion of the area under the ROC curve (AUC), which supports nonparametric model that calculates the survival function censored data (Harrell et al. 1982). The C-index widely is S (t) of a homogeneous cohort, i.e., the individuals in KM used as an evaluation metric of the performance of survival the same cohort (or group) share the same survival func- models. It summarises the model’s discriminatory power, tion. Given N individual samples in a cohort, it assumes which is how well a model can rank the survival times of that there are J distinct actual event times such that samples. Similar to AUC, the value of the C-index ranges t < t < ⋯ < t when J ≤ N , t hen: 1 2 J from 0.5 to 1, where 1 indicates the best performance. More formally, given S(t) be the survival function esti- S (t)= 1 − , KM (5) ∗ ∗ mated by some survival model, let t ,… , t be a set of fixed t ≤t j 1 s time points, e.g., t ,… , t where N is a distinct time index. 1 N where d is the individuals who experienced an event and n Then C-index is defined as: j j is the number of individuals that did not experience the event ∗ ∗ in time interval [t , t ]. C = 1[S(t ) > S(t )], j−1 j i j (6) j∶t <t Kaplan–Meier method only uses the information from i∶E =1 i j i i E and T to estimate the survival function. where M is the total number of comparable pairs and 1[.] is a function that will return 1 if its input argument is true or 0 otherwise. Note that there are slightly different definitions 3.3 Random survival forests for C-index in other works. In this work, we used the defini- tion proposed by Utkin et al. (2019). Ishwaran et al. (2008) proposed the random survival for- ests (RSF) model, which is an extension to the random 3.5 Log‑rank test forests ensemble model (Breiman 2001) for working with censored data. The general idea for creating an RSF model The log-rank test (Mantel 1966) is a nonparametric statis- for a particular dataset is as follows (Utkin et al. 2019; tical test for comparing the hazard functions, i.e., h(t), of Ishwaran et al. 2008): two cohorts/groups of individuals. The null hypothesis is 1 3 86 Page 6 of 13 Social Network Analysis and Mining (2022) 12:86 Table 2 Information about the datasets and the state of human rights. DS SE covers topics concern- ing the widespread field of data science. And CS SE covers Characteristic Dataset topics related to computer science. We chose these three Pol DS CS communities for two reasons. Firstly, although the sizes of these communities are smaller than the sizes of some other Number of questions 12,416 28,950 40,792 QA communities hosted on SE like Stack Overflow, the cho- Number of users 31,242 100,582 113,434 sen communities are thriving in their niche. Secondly, each Number of answers 25,909 32,334 46,785 of these communities is more or less focused on separate Number of comments 135,648 64,244 167,038 fields that, although they might share some topics, are dif- Year founded 2012 2014 2008 ferent enough to be viewed as distinct. It allows us to search for possible patterns related to disengagement, regardless of that the hazard functions of two groups, e.g., group 1 and 2, the specific topics of a field. are equal, i.e., h (t)= h (t) . The Log-rank test assumes that The datasets of the three communities were downloaded 1 2 survival probabilities (i.e., the probabilities of not becom- from the Stack Exchange data dump available on Archive. ing disengaged in our context) stay the same over time. It org. The data included the complete historical information is widely used to check whether the underlying survival about the questions and answers posted on the three QA distributions of two groups are the same or are different, communities from their inception until May 2021. Table 2 essentially. shows the general information about the datasets. 4.2 Community characteristics 4 Data Table 3 includes the summary statistics for users belong- 4.1 Data description ing to three communities whose data are used in this study. The information shown in the table was extracted from As mentioned earlier, we used data from three online QA the corresponding Users table for each community from platforms that are Politics (Pol), Data Science (DS), and the Stack Exchange data dump. Based on the informa- Computer Science (CS) Stack Exchange (SE). Pol SE is an tion presented in Table  3, for all three communities the ad-hoc QA community focused on politically-themed con- distribution of first four attributes (i.e., user reputation, tent, such as questions related to the nature of democracy profile views, upvotes, and downvotes) seems to follow a Table 3 Summary statistics of Community Attribute Statistics users in each community Mean STD Median First quartile Fourth quartile Pol User reputation 160.50 1377.06 101 1 101 Profile views 10.01 107.82 0 0 1 Upvotes 12.38 126.12 0 0 2 Downvotes 2.35 56.70 0 0 0 Year joined 2017 1.94 2018 2016 2019 DS User reputation 50.78 195.70 1 1 101 Profile views 1.58 24.44 0 0 0 Upvotes 1.36 26.76 0 0 0 Downvotes 0.13 10.47 0 0 0 Year joined 2018 1.81 2018 2017 2020 CS User reputation 67.50 962.95 3 1 101 Profile views 3.60 119.03 0 0 1 Upvotes 2.39 94.92 0 0 0 Downvotes 0.31 32.94 0 0 0 Year joined 2017 2.37 2017 2015 2019 https:// archi ve. org/ downl oad/ stack excha nge; the data are available under Creative Commons licences. 1 3 Social Network Analysis and Mining (2022) 12:86 Page 7 of 13 86 Table 4 Information about behavioural user attributes the user contributions and the probability of disengage- ment. Namely, behavioural attributes and content-based User attribute Description attributes. Number of downvotes cast by user i A Number of upvotes cast by user i 4.3.1 Behavioural attributes Number of questions posted by user i Number of answers posted by user i We identified five user attributes in the datasets that Number of comments written by user i directly correspond to the level of user contribution. These attributes primarily are based on the information related to user behaviour that seems crucial to the proper functioning heavy tail distribution due to large size of dispersion (i.e., of the platform. Table 4 includes the name and description STD) around the mean. Moreover, relative to users from of these attributes. the other two communities, users in Pol SE show more intensity of activity on average. This is apparent based on the observation that although the number of registered 4.3.2 Content‑based attributes users on Pol SE is smaller than the number of registered users on the other two communities, nonetheless, the aver- In addition to behavioural attributes, we picked up a set age values of the first four user attributes listed in Table  3 of content-based user attributes. These attributes hint at is noticeably larger. Additionally, regarding the trend of how the contributions made by each user might have been about the increase in the number of registered users in perceived favourably by the community, i.e., other users. each community, DS SE had the fastest growth relative to The primary motivation is that users can indirectly con- two other communities. It took only 4 years for DS SE to tribute to the platform, e.g., by asking a question that starts reach half of its registered users, while in comparison, the a stream of debates over a controversial topic such as refu- number of years it took for Pol SE and CS SE to reach half gee crisis in the context of the Pol SE community. And the of their registered users were 6 and 9 years, respectively. information about this type of indirect user contribution, Furthermore, all three communities show a large increase which is not only limited to the behaviour of a particular in the number of users lately (i.e., around 2018 onwards) user, can be extracted and utilised from user content (e.g., which we suspect to be due to the tipping point phenom- mainly from metadata of users’ posts). Table 5 includes enon (Singh et al. 2020). the name and description of the content-based attributes employed in this work. 4.3 User attributes 4.4 User representation The bulk of users in QA platforms do not make any con- tributions. These users, who are referred to as lurkers Based on the two types of user attributes mentioned ear- in some previous work  (e.g., Tagarelli and Interdonato lier, we constructed a numerical vector for each user, 2018), can be differentiated from the normal users, who called user representation vector (or URV for short). The include the experts, by their level of contribution to the URV of user i is defined as: platform. In this work, two categories of user attributes i i i URV =(A , A , … , A ) i (7) were used in order to investigate the relationship between 1 2 j Table 5 Information related to User attribute Description the content-based attributes Average number of times questions posted by user i were viewed A Average number of comments written for questions posted by user i A Average score (i.e., sum of upvotes and downvotes given to ques- tion) of the questions posted by user i Average score (i.e., sum of upvotes and downvotes given to answer) of the answers posted by user i A Average number of comments written for answers posted by user i Average number of upvotes given to the comments posted by user i 1 3 86 Page 8 of 13 Social Network Analysis and Mining (2022) 12:86 Table 6 Sets of users defined by dichotomizing users based on their Table 7 Sizes of each user set per dataset behavioural attributes corresponding to their contribution User set Dataset User set Definition Pol DS CS Q Users who posted at least one question |Q| 3775 (12%) 16,041 (16%) 20,841 (18%) Users who posted no question 27,466 (88%) 84,540 (84%) 92,592 (82%) A Users who answered at least one question |A| 3743 (12%) 7226 (7%) 7020 (6%) Users who did not answer any question 27,498 (88%) 93,355 (93%) 106,413 (94%) C Users who commented on at least one question/answer |C| 6358 (20%) 12,056 (12%) 16,788 (15%) Users who did not answer any question 24,883 (80%) 88,525 (88%) 96,645 (85%) U Users who upvoted at least one question/answer/comment |U| 11,025 (35%) 16,328 (16%) 20,767 (18%) Users who did not upvoted 20,216 (65%) 84,253 (84%) 92,666 (82%) D Users who downvoted at least one question/answer |D| 1596 (5%) 732 (1%) 1143 (1%) Users who did not downvote 29,645 (95%) 99,849 (99%) 112,290 (99%) where A is the corresponding user attribute from Tables 4 and 5. was censored. More precisely, let t be the last time user i been seen visiting the platform, and  be the threshold value, 4.5 Dichotomizing users then the value of E would be set based on the following relation: Moreover, we dichotomised users into pairs of disjoint sets 0, if d(t , t ) ≤ 𝜃 i i (or groups) using each one of behavioural user attributes. l E = f (t )= (8) 1, otherwise The main idea is that users can be partitioned into two groups naturally, where the criterion for the split is whether where t is the last recorded time in the dataset, and d(t , t ) is d d a user has made a particular type of contribution or not. the time difference between t and t in months. In this work, Categorising users in two disjoint sets based on the value of we used two threshold values (i.e.,  ) of 24 and 36 months. a behavioural attribute allowed us to investigate the impor- Subsequently, users who had not visited the platform for tance of one specific user attribute, e.g., by comparing the more than 2 and 3 years were considered disengaged. survival curves of two disjoint sets of users who posted at least one question and who did not. Table  6 includes the information about each group of users and its counterpart. 5 Results 4.6 Disengagement criterion 5.1 Results from Kaplan–Meier What amounts to the event of a user becoming disengaged Kaplan–Meier method was used to estimate (within the is domain-dependent and thus can vary in different settings. confidence interval of 95%) the survival functions (i.e., For example, normally, in medical research, the event usu- S(t)) of sets of users dichotomised based on the definitions ally is the patient’s death (Cox and Oakes 2018). In this shown in Table 6. The implementation from Lifeline library work, suitable to our need, we opted to use the information (Davidson-Pilon et al. 2021) was used to produce the sur- about the last time a user visited the QA platform to detect vival curves. Figures 2, 3, and 4 show the survival curve of disengagement. The information about the last time a user each pair sets of users for {Pol, DS, CS} SE datasets, respec- visited is available in the LastAccessDate column from the tively. Table 7 includes the information about the size and Users table in each dataset. The activity time of a user was proportion of each user set dichotomised based on a single calculated based on the difference in the number of months behavioural attribute. As mentioned earlier, for each user, since the user joined the platform until the last recorded the label indicating whether the user is disengaged or not activity time of the user (i.e., LastAccessDate associated (i.e., E ) was censored if the difference between the user’s with the user). For user i, if the number of months since his last activity time and t was less than or equal to 24 and 36 last visit to the platform exceeded a certain threshold value, months, respectively. Log-rank test (with p < 0.005 ) was he would be tagged as a disengaged user (i.e., E = 1 ); other- performed on each pair of curves. wise, the user’s state was considered still active (i.e., E = 0 ) which means the information about user’s disengagement 1 3 Social Network Analysis and Mining (2022) 12:86 Page 9 of 13 86 0.9 Users in Q Users in Q 0.8 Users in A Usersin A 0.8 Users in Q Users in Q Users in A Usersin A 0.8 0.7 0.8 0.7 0.6 0.6 0.7 0.7 0.5 0.5 0.4 0.6 0.4 0.6 020406080 100 020406080 100 020406080 100 020406080 100 time (months) time (months) time (months) time (months) (a) θ =24 (b) θ =36 (c) θ =24 (d) θ =36 Users in C Users in C Users in U Users in U 0.9 Users in C 0.9 Users in C Users in U Users in U 0.8 0.8 0.8 0.8 0.6 0.7 0.6 0.7 0.6 0.4 0.6 0.4 0.5 0.2 0.4 0.5 020406080 100 020406080 100 020406080 100 020406080 100 time (months) time (months) time (months) time (months) (e) θ =24 (f) θ =36 (g) θ =24 (h) θ =36 1 1 0.9 0.8 Users in D Users in D 0.8 Users in D Users in D 0.6 0.7 0.4 0.6 020406080 100 020406080 100 time (months) time (months) (i) θ =24 (j) θ =36 Fig. 2 Survival curves for Pol SE dataset estimated using Kaplan–Meier method more significant prediction errors ranked more important. 5.2 Results from RSF We chose permutation importance mainly because of its intuitive definition and subsequent interpretation, which is We used k-fold cross-validation (with k=5) in 30 runs to based on the idea that the importance of a variable is the make the predictions (using RSF models). The values of increase in model error when the variable’s information is model hyperparameters such as the number of trees (i.e., destroyed via value permutation (Molnar 2020). Moreover, q) have been tested in order to choose the ones that lead to it provides a compact global insight into the model’s behav- the best results. We used C-index to evaluate the perfor- iour. Figure 5 shows the per-dataset average permutation mance of the models. Only data for users with contribu- importance of each attribute on the prediction results of the tions were used to train and evaluate the RSF-based models. RSF models used in this work. In other words, only the information of users belonging to Q ∪ A ∪ C ∪ U ∪ D (from Table 6) was used. For each user, three URVs were constructed, using the behavioural user 6 Discussion attributes only, content-based attributes only, and finally, a combination of both. Table 8 includes the average C-indexes Based on results from the Kaplan–Meier method, we computed for the RSF models over the runs. observed: (i) the underlying hazard function of each set of users seems to be different; (ii) the probability that users 5.3 Attribute importance with even a few contributions (e.g., the user asked one ques- tion) are noticeably higher than other users who did not con- We used the permutation importance measure (Breiman tribute to the platform. We observed a distinctive difference 2001; Molnar 2020) present in RSF models to rank each user between the survival functions of the users who contributed attribute. The attributes which permuting their values caused 1 3 probabilityofbeing active probabilityofbeing active probabilityofbeing active probabilityofbeing active probabilityofbeing active probabilityofbeing active probabilityofbeing active probabilityofbeing active probabilityofbeing active probabilityofbeing active 86 Page 10 of 13 Social Network Analysis and Mining (2022) 12:86 0.9 Users in Q 0.95 Users in A Users in Q Usersin A 0.95 0.9 Users in Q Users in A Users in Q Usersin A 0.9 0.8 0.9 0.8 0.85 0.85 0.7 0.7 0.8 0.8 0.6 0.6 0.75 0.75 0.5 0.5 0.7 0.7 020406080 020406080 020406080 020406080 timeline (months) timeline (months) timeline(months) timeline (months) (a) θ =24 (b) θ =36 (c) θ =24 (d) θ =36 Usersin C Usersin C Users in U Usersin U 0.9 Usersin C Usersin C Users in U Usersin U 0.9 0.9 0.8 0.8 0.7 0.8 0.8 0.6 0.6 0.7 0.5 0.7 0.4 020406080 020406080 020406080 020406080 timeline (months) timeline (months) timeline(months) timeline(months) (e) θ =24 (f) θ =36 (g) θ =24 (h) θ =36 1 1 0.9 0.9 0.8 Usersin D Users in D Usersin D Users in D 0.7 0.8 0.6 0.5 0.7 020406080 020406080 timeline (months) timeline(months) (i) θ =24 (j) θ =36 Fig. 3 Survival curves for DS SE dataset estimated using Kaplan–Meier method to the platform and those who did not contribute. This pat- behavioural user attributes. The number of upvotes (i.e., A ) tern, which is present in all three datasets regardless of the received the highest importance in all three datasets. Sub- community niche, confirms the finding from previous related sequently, with a noticeable difference, the average number studies such as the ones reported in Joyce and Kraut (2006) of the times user questions were viewed (i.e., A ) received and Yang et al. (2010) that suggested that users with even relatively high importance. We suspect that the user’s higher a few initial contributions are more likely to stay loyal than upvotes might show that they hold a favourable view of the users without any contributions. The latter make up the bulk community (or platform) in general. On the other hand, the of the users. Furthermore, the gap between the probability information about the number of downvotes did not contain of disengagement of two groups seems to widen over time. much predictive information. We suspect it could be due to Predictions using RSF models show relatively similar the small number of users with downvoting activity in the patterns on all three datasets. Results (shown in Table 8) datasets. indicate that the inclusion of the information of behavioural attributes leads to better predictions compared to the use of content-based user attributes only. Furthermore, using the 7 Limitations and future work mixture of the information of behavioural and content-based attributes yielded a slight improvement. The value of  does There are a few limitations regarding the work done in this not seem to affect the overall results. paper. The datasets were used only from three QA com- Based on the permutation importance of attributes (see munities hosted by (the larger) Stack Exchange platform. Fig.  5), behavioural features play a more salient role in Consequently, this work did not investigate and compare the output of the RSF models. On average, 4 out of 5 top disengagement on other major platforms such as Quora. It attributes with the most permutation importance are from seems interesting to compare our results with the results 1 3 probabilityofbeing active probabilityofbeing active probabilityofbeing active probabilityofbeing active probabilityofbeing active probabilityofbeing active probabilityofbeing active probabilityofbeing active probabilityofbeing active probabilityofbeing active Social Network Analysis and Mining (2022) 12:86 Page 11 of 13 86 0.9 Usersin Q Usersin Q Usersin A Usersin A 0.8 Usersin Q Usersin Q 0.8 Usersin A Usersin A 0.8 0.8 0.7 0.6 0.7 0.6 0.6 0.6 0.4 0.4 0.5 0.5 0.4 020406080 100 020406080 100 020406080 100 020406080 100 timeline (months) timeline (months) timeline (months) timeline (months) (a) θ =24 (b) θ =36 (c) θ =24 (d) θ =36 Usersin C Usersin C Usersin U Usersin U 0.9 Usersin C Usersin C Usersin U Usersin U 0.8 0.8 0.8 0.8 0.7 0.6 0.6 0.6 0.6 0.4 0.4 0.5 0.4 0.2 0.4 0.2 020406080 100 020406080 100 020406080 100 020406080 100 timeline (months) timeline (months) timeline (months) timeline (months) (e) θ =24 (f) θ =36 (g) θ =24 (h) θ =36 1 1 0.8 0.8 Usersin D Usersin D 0.6 Usersin D Usersin D 0.6 0.4 020406080 100 020406080 100 timeline (months) timeline (months) (i) θ =24 (j) θ =36 Fig. 4 Survival curves for CS SE dataset estimated using Kaplan–Meier method obtained with data from other major QA platforms in users are the same can be used, which theoretically should future. Conventional assumptions related to the application lead to better predictions. of survival analysis techniques hold over our results, e.g., Temporal context can tentatively play an essential role the assumption that the probabilities of disengagement of in the intensity of user activities in a community and censored and none censored individuals are essentially the subsequently be an informative factor in the level of user same. We used user inactivity for an extended period (e.g., 2 years passed since the user visited the community web Table 8 Average C-index for RSF models using different attribute pages) to distinguish between disengaged and censored sets; higher C-index indicates better prediction users. This required the use of a time threshold in which Dataset Attributes  = 24  = 36 its value is set experimentally, not based on a well-defined rule. Finally, for most users, their behavioural information Mean STD Mean STD does not exist, making it hard to investigate further the Pol Behavioural only 0.75 0.01 0.76 0.01 survival probabilities of users dichotomised based on the Content-based only 0.68 0.01 0.68 0.01 definitions given in Table  6. Behavioural plus content-based 0.75 0.01 0.76 0.01 Including the data from a more numerous and diverse DS Behavioural only 0.66 0.01 0.66 0.01 set of QA platforms could be interesting for future work. Content-based only 0.61 0.01 0.63 0.01 Furthermore, the information related to the body of the Behavioural plus content-based 0.68 0.01 0.70 0.01 posts (e.g., the text of questions and answers) of each CS Behavioural only 0.68 0.00 0.68 0.01 user could be utilised to find the probabilities of disen- Content-based only 0.62 0.01 0.63 0.01 gagement. Additionally, methods and models that do not Behavioural plus content-based 0.69 0.01 0.68 0.01 assume the probabilities for the disengaged and censored 1 3 probabilityofbeing active probabilityofbeing active probabilityofbeing active probabilityofbeing active probabilityofbeing active probabilityofbeing active probabilityofbeing active probabilityofbeing active probabilityofbeing active probabilityofbeing active 86 Page 12 of 13 Social Network Analysis and Mining (2022) 12:86 A A A 1 1 1 A A A 2 2 2 A A A 3 3 3 A A A 4 4 4 A A A 5 5 5 A A A 6 6 6 A A A 7 7 7 A A A 8 8 8 A A A 9 9 9 A A A 10 10 10 A11 A11 A11 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Importance Importance Importance (a) ForPol SE dataset with θ =24 (b) ForPol SE dataset with θ =36 (c) ForDSSEdatasetwith θ =24 A A A 1 1 1 A A A 2 2 2 A A A 3 3 3 A A A 4 4 4 A A A 5 5 5 A A A 6 6 6 A A A 7 7 7 A A A 8 8 8 A A A 9 9 9 A A A 10 10 10 A A A 11 11 11 0.00 0.02 0.04 0.06 0.08 0.10 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 Importance Importance Importance (d) ForDSSEdataset with θ =36 (e) ForCSSEdataset with θ =24 (f) ForCSSEdataset with θ =36 Fig. 5 Average permutation importance of each attribute; models are trained and evaluated using behavioural and content-based attributes simul- taneously over the datasets shown in Table 2 Funding Open access funding provided by NTNU Norwegian Univer- engagement. By temporal context, we mean the effects of sity of Science and Technology (incl St. Olavs Hospital - Trondheim real-world events occurring within a specific time period University Hospital). This work was carried out as part of the Trond- on the behaviour of users of a QA community. Examples heim Analytica projecthttps://w ww.n tnu.e du/t rondh eiman alyti ca, sup- of such events include Brexit and the political campaigns ported by NTNU’s Digital Transformation programme. during an election in an influential country such as the Data availability The data used in this study are publicly available USA. from Archive.org under Creative Commons licences. Furthermore, for reproducibility, the code and other related artefacts, such as the preprocessed version of the data used in the experiments, are also avail- 8 Conclusion able on GitHub.comhttps://git hub.com/ habedi/ Sur viv alAnal ysisQA Co mmuni ties. We used survival analysis to study user disengagement using the historical data from three distinct QA communities from Declarations their inception to May 2021. We employed two categories Conflict of interest I do not have any conflicts or competing interests of user attributes and investigated the importance of these to declare. attributes. Our results confirm the previous findings that users with some initial contributions (e.g., questions and Open Access This article is licensed under a Creative Commons Attri- answers) are likelier to stay active longer than users who bution 4.0 International License, which permits use, sharing, adapta- tion, distribution and reproduction in any medium or format, as long contributed nothing. Furthermore, based on our results, as you give appropriate credit to the original author(s) and the source, behavioural user attributes can be used to estimate the disen- provide a link to the Creative Commons licence, and indicate if changes gagement probability of each user with reasonable accuracy. were made. The images or other third party material in this article are Moreover, based on the importance of attributes used to included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in train and evaluate the models, how favourable users see the the article's Creative Commons licence and your intended use is not content posted on the platform seems to affect the disengage- permitted by statutory regulation or exceeds the permitted use, you will ment time. need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://cr eativ ecommons. or g/licen ses/ b y/4.0/ . Acknowledgements I thank the reviewers and my colleague Yanzhe Bekkemoen for helping me improve the manuscript. 1 3 Attribute Attribute Attribute Attribute Attribute Attribute Social Network Analysis and Mining (2022) 12:86 Page 13 of 13 86 Miao F, Cai YP, Zhang YT et al (2015) Is random survival forest an References alternative to Cox proportional model on predicting cardiovascu- lar disease? In: MBEC, pp 740–743 Adaji I, Vassileva J (2015) Predicting churn of expert respondents in Molnar C (2020) Interpretable machine learning. Lulu.com social networks using data mining techniques: a case study of Ortega F, Convertino G, Zancanaro M et al (2014) Assessing the per- stack overflow. In: ICMLA. IEEE, pp 182–189 formance of question-and-answer communities using survival Breiman L (2001) Random forests. Mach Learn 45(1):5–32 analysis. arXiv preprint arXiv: 1407. 5903 Cox DR (1972) Regression models and life-tables. J R Stat Soc Ser B Pudipeddi JS, Akoglu L, Tong H (2014) User churn in focused ques- (Methodol) 34(2):187–202 tion answering sites: characterizations and prediction. In: WWW, Cox DR, Oakes D (2018) Analysis of survival data. Chapman and pp 469–474 Hall, London Rothmeier K, Pflanzl N, Hüllmann JA et al (2021) Prediction of player Davidson-Pilon C, Kalderstam J, Jacobson N et al (2021) CamDavid- churn and disengagement based on user activity data of a free- sonPilon/lifelines: 0.26.0. 10.5281/zenodo.4816284 mium online strategy game. IEEE Trans Games 13(1):78–88 Dias J, Godinho P, Torres P (2020) Machine learning for customer Singh A, Dharamshi N, Thimma Govarthanarajan P et al (2020) The churn prediction in retail banking. In: Computational science and tipping point in social networks: investigating the mechanism its applications, pp 576–589 behind viral information spreading. In: BigDataService, pp 54–61 Dror G, Pelleg D, Rokhlenko O et al (2012) Churn prediction in new Stepanova M, Thomas L (2002) Survival analysis methods for personal users of Yahoo! Answers. In: WWW, pp 829–834 loan data. Oper Res 50(2):277–289 Dupret G, Lalmas M (2013) Absence time and user engagement: evalu- Tagarelli A, Interdonato R (2018) Mining lurkers in online social net- ating ranking functions. In: WSDM, pp 173–182 works: principles, models, and computational methods. Springer, Fotso S et al (2019) PySurvival: open source package for survival Berlin analysis modeling. https:// www. pysur vival. io/ Utkin LV, Konstantinov AV, Chukanov VS et al (2019) A weighted Guan T, Wang L, Jin J et al (2018) Knowledge contribution behavior random survival forest. Knowl Based Syst 177:136–144 in online Q &A communities: an empirical investigation. Comput Wang P, Li Y, Reddy CK (2019) Machine learning for survival analy- Hum Behav 81:137–147 sis: a survey. ACM Comput Surv 51(6):1–36 Harrell FE, Califf RM, Pryor DB et al (1982) Evaluating the yield of Widodo A, Yang BS (2011) Machine health prognostics using sur- medical tests. JAMA 247(18):2543–2546 vival probability and support vector machine. Expert Syst Appl Ishwaran H, Kogalur UB, Blackstone EH et al (2008) Random survival 38(7):8430–8437 forests. Ann Appl Stat 2(3):841–860 Yang J, Wei X, Ackerman M et al (2010) Activity lifespan: an analysis Jin J, Li Y, Zhong X et al (2015) Why users contribute knowledge to of user survival patterns in online knowledge sharing communi- online communities: an empirical study of an online social Q &A ties. In: ICWSM community. Inf Manag 52(7):840–849 Yang G, Cai Y, Reddy CK (2018) Spatio-temporal check-in time pre- Jing H, Smola AJ (2017) Neural survival recommender. In: WSDM, diction with recurrent neural network based survival analysis. In: pp 515–524 IJCAI, pp 2976–2983 Joyce E, Kraut RE (2006) Predicting continued participation in news- Yao J, Zhu X, Zhu F et al (2017) Deep correlational learning for sur- groups. J Comput Mediat Commun 11(3):723–747 vival prediction from multi-modality data. In: MICCAI Kaplan EL, Meier P (1958) Nonparametric estimation from incomplete observations. J Am Stat Assoc 53(282):457–481 Publisher's Note Springer Nature remains neutral with regard to Karnstedt M, Hennessy T, Chan J et al (2010) Churn in social net- jurisdictional claims in published maps and institutional affiliations. works: a discussion boards case study. In: ICSC, pp 233–240 Kuzmeski M (2009) The connectors: how the world’s most successful businesspeople build relationships and win clients for life. Wiley, Hoboken Mantel N (1966) Evaluation of survival data and two new rank order statistics arising in its consideration. Cancer Chemother Rep 50(3):163–170 1 3

Journal

Social Network Analysis and MiningSpringer Journals

Published: Dec 1, 2022

Keywords: Question-and-answering platforms; User disengagement; Survival analysis; Stack exchange

References