Access the full text.
Sign up today, get DeepDyve free for 14 days.
Recently, with the express growth of social network, users have joined more and more of these networks and live their life virtually. Consequently, they create a huge data on these social networks: their profile, interest, and behavior such as post, comment, like, joining groups or communities, etc. This brings some new challenges to researchers: do users having the same profile/interest show the same behavior? And vice versa, do users having the same behavior have interest in the same things? One of the basic issues in these challenges is the problem of estimating the similarity among users on these social networks based on their profile, interest, and behavior. This paper presents a model for estimating the similarity between users based on their behavior on social networks. The considered behaviors are activities including posting entries, liking these entries, commenting and liking the comment in these entries. The model is then evaluated with a dataset-collected users from Twitter. The results show that the model estimates correctly the similarity among users in the majority of the cases. Keywords User similarity · Behavior similarity · Entry similarity · Social network 1 Introduction tendency also brings some new challenges to researchers: do users having the same profile/interest show the same behav- Nowadays, with the exploration of social networks, there are ior? And vice versa, do users having the same behavior have more and more people joining these networks. In these digi- interest in the same things? One of the basic issues in these tal worlds, users freely present themselves, share information challenges is the problem of estimating the similarity among about their favorites and passions, or share their personal users on these social networks based on their profile, interest, opinion on some issues of economic, social, cultural, etc. and behavior? through several activities on social network such as post- The problem of detecting the similarity or the difference ing entries, sharing video clips, images, or news they read, between users is not only based on the user profile on the and then leaving their comments or liking these entries or social network, but also based on the data about user behav- the comments of others, etc. Consequently, huge data are ior such as posting entries, commenting, and liking. This created on the social network. This huge data attract many problem has been attracting many researchers. For instance, researchers, businessmen, etc. to mine and exploit it. This Raad et al. [16] and Peled et al. [15] proposed a model to measure the similarity between user profiles. Anderson et al. [1] calculated the similarity between user characteristics. Liu B Manh Hung Nguyen et al. [8] estimated the similarity among preferences of user nmh.nguyenmanhhung@gmail.com; mhnguyen@ptit.edu.vn behavior. Liu et al. [9] and Chen et al. [4] measured the sim- Thi Hoi Nguyen ilarity among user mobility behavior. Xu et al. [23] analyzed hoint2002@gmail.com the user posting behavior on a popular social media website. Dinh Que Tran Erlandsson et al. [5] proposed association learning to detect tdque@yahoo.com; quetd@ptit.edu.vn relationships between users. Benevenuto et al. [2] presented a Gia Manh Dam kind of analysis of user workloads in online social networks. damgiamanh@gmail.com Singh et al. [18] formulated a metric based on the common Thuongmai University Hanoi, Hanoi, Vietnam words used in social networks to measure the user similarity Posts and Telecommunication Institute of Technology (PTIT), in textual posting. Zou et al. [26] mined individual behavior Hanoi, Vietnam patterns and study user similarity. Sun et al. [19] proposed a UMI UMMISCO 209 (IRD/UPMC), Hanoi, Vietnam 123 166 Vietnam Journal of Computer Science (2018) 5:165–175 mapping method, which integrates text and structure infor- 2 A model for estimation of similarity among mation for similarity computation. Guo et al. [6] developed a social network users based on their model to estimate continuous tie strength between users for behaviors friend recommendation with the heterogeneous data from social media community. Nguyen et al. [11] aimed to under- 2.1 Notations stand the strategies users employ to make retweet decision. Liu and Terzi [10] approached the privacy issues raised in Without loss of generality, we assume that: online social networks from the individual users viewpoint: they proposed a framework to compute the privacy score of a – A social network is a 4-tuples N =< U , G, E , B >,in user. Tang et al. [20] adopted a “microeconomics” approach which: to a model and predicted the individual retweet behavior. Xu et al. [22] introduced several methods to identify online com- – U ={u , u ,..., u } is a set of users, 1 2 m munities with similar sentiments in online social networks. – G ={g , g ,..., g } is a set of communities or 1 2 n Zhao et al. [25] proposed to separately model users’ topical groups, interests that come from these various behavioral signals to – E ={e , e ,..., e } is a set of entries of users on the 1 2 k construct better user profiles. Vedula et al. [21] detected pair- social network N, wise and global trust relations between users in the context – B ={b , b ,..., b } is a set of behaviors of each user 1 2 l of emergent real-world crisis scenarios. Jamali and Ester [7] u ∈ U on the group g ∈ G or on the entry e ∈ E on explored social rating networks, which record not only social the social network N. relations but also user ratings for items. Bhattacharyya et al. – A user could post a status, an image, or a video clip that [3] studied the relationship between semantic similarity of we call an entry e. An entry e could be viewed by a set of user profile entries and the social network topology. In the users U . Each user, within an entry, could like the entry model of Zhao et al. [24], two social factors, interpersonal rat- and comment on the entry or share that entry on their ing behavior similarity and interpersonal interest similarity, homepage. are fused into a consolidated personalized recommendation – Each user u, within an entry, could like a set of comments model based on probabilistic matrix factorization. of the entry. A user could like a page or join a group. In Most of these works try to estimate the similarity among this case, the user is a member of a community or group user based on: user profile, user interests or favorites, or user of the social network. An user could post an entry in relationship on social network. However, there are not many a community or a group, like an entry, comment into an works which estimate the similarity among social network entry, or like a set of comments in an entry of a community users based on their activities on social network. or share an entry. In line with our previous works ([13,14]), this paper intro- duces a model for measuring the similarity between users based on their behavior on social network. In this model, the 2.1.1 Entry similarity between users is estimated from the similarity of their behaviors such as posting an entry or sharing an exist- Generally, an entry on a social network can be a video, an ing entry, liking an entry or liking a comment, commenting image, a text, or a combination of all content. However, in this on a post, and joining a group or a community. The simi- paper, we only consider entries that contain textual content. larity of user behavior on these activities is also estimated If they do not contain texts such as video or images, they are based on the content of the entries that they post, like, or the ignored. Therefore, the problem is to consider and estimate content of their comment on these entries from social net- the similarity of users based on the entry that focuses on works. The similarity among entries is estimated based on reviewing and estimating similarity between texts. the content, tags, category, sentiment, and emotion included On a social network, there is a set of user U = in these entries [14]. The model is then evaluated with a {u , u ,..., u }. Each user u is characterized by a set of 1 2 m i dataset-collected users from Twitter. The results show that entries posted E and a set of behaviors B on the social the model estimates correctly the similarity among users in network. Each user u inU has a set of entries E = i i the majority of the cases. i i i e , e ,..., e and each entry e ∈ E has a set of features: 1 2 k The paper is organized as follows: Sect. 2 presents the j j j e = f , f ,..., f . similarity model. Section 3 takes some experiments to eval- j p 1 2 uate the proposed model with empirical data. Section 4 is the An entry could have several features, including explicit conclusion and perspectives. features such as the content, and the implicit features such as tag, category, sentiment, and emotion. As the implicit fea- tures cannot be directly extracted from an entry, the model 123 Vietnam Journal of Computer Science (2018) 5:165–175 167 needs a step to extract these features before estimating the 2.1.3 Group or community similarity on entries. This model considers five features of an entry: As a community or a group is described by its meta-data, the similarity between two communities or groups is, thus, con- sidered as the similarity between two multi-feature objects – Content of entry e , noted as f : content is the whole cont (Nguyen and Nguyen [12]). Each meta-data of a community text part in the entry itself. This is an explicit feature. or group is considered as a feature of the community or a – Tags of entry e , noted as f : an entry could be tagged tags group. In this model, we assume that a social network has to a set of tags. Each tag is an independent word or expres- a set of user U = {u , u ,..., u }. Each user u ∈ U can 1 2 m i sion. In some cases, tags can be directly tagged by the user be joined into a set of communities or groups g ∈ G with v v v (explicit). In some other case, it is not explicitly tagged features: g = g , g ,..., g : 1 2 w by the user (implicit). – Category of entry e , noted as f : an entry could be v cate – Name of the community, noted as g : it could be an name assigned to a category. Each category is represented by entitle or a short brief sentence. After eliminating all stop an independent word or expression. words in the title, this feature becomes a set of words to – Sentiment of entry e , noted as f : an entry could have sent be compared to that of other communities. So, estimating a sentiment of the user. A sentiment may be agree (posi- the similarity on the name of the community is to estimate tive), disagree (negative), or neutral opinion. the similarity between two sets of words. – Emotion of entry e , noted as f : an entry could also v emot – Categories of the community, noted as g :onsome camu have some emotion of the user. Each emotion is repre- social networks, each community is always assigned to at sented by an independent word or expression least one category. Each category is an independent word (or independent expression). So, estimating the similarity on the categories of the community is also to estimate the As an entry is considered as a set of features and only similarity between two sets of expressions. their textual contents are considered, the problem of esti- – Description of the community, noted as g :onmanyof desc mating the similarity among entries could be considered as the social networks, each community is also provided a the computation of the similarity among texts or among sets short description. A description is normally a short text. of expressions. After eliminating all stop words in the text, this feature becomes a set of words to be compared to that of other communities. So, estimating the similarity on the descrip- 2.1.2 Behavior tion of the community is also to estimate the similarity between two set of expressions. In this model, only five popular behaviors are considered: post an entry, like an entry, comment on an entry, share an As the comparison between two entries is considered as entry, and join a group on social network. We assume that a a comparison between their sets of feature and only their { } social network has a set of user U = u , u ,..., u . Each 1 2 m textual values are considered, the comparison between two user u ∈ U posts a set of entries E and acts with a set of i i i behaviors and communities or groups was made by com- behaviors B = b , b ,..., b . Each behavior b ∈ B may i l 1 2 l paring only their textual values. Therefore, the problem of l l l have a set of features b = f , f ,..., f : 1 2 estimating the similarity among users based on behaviors becomes the computation of the similarity among texts or among sets of expressions. – Post of entry, noted as b : the user writes an entry on post the user homepage. 2.2 General model – Like an entry or like a comment, noted as b : the user like clicks on the like icon of an entry or a comment. The general model is as follows: – Comment of entry, noted as b : the user writes some comt comments on an entry l Input: .u , u ∈ U with their two sets of entries E , E ∈ 1 2 1 2 – Share of entry, noted as b : the user shares an entry on shar E and two sets of behaviors B , B ∈ B 1 2 his/her wall. The shared entry could belong to different Output: Estimated similarity between the two entered users of social media or its social network. users u , u ∈ U called sim(u , u ). 1 2 1 2 – Join a group, noted as b : the user joins a group or join Inside the model, there are four main steps: community. A group usually has the name of a group, description of the group and other characters of the group. – Step 1: modeing entries E and behaviors B. 123 168 Vietnam Journal of Computer Science (2018) 5:165–175 – Step 2: extracting the value for implicit features of where N , N are the number of texts in the set T , T , l ¬l l ¬l x x entries. respectively. N , N are the number of texts in the set l ¬l – Step 3: estimating the similarity on each entry’s fea- T , T , respectively, which contains the term x. l ¬l tures and on each user’s behaviors. – Step 5: For a new text t, the choice of label to assign to – Step 4: aggregating the similarity between two sets the text is as follows: of entries E , E and between two sets of behaviors 1 2 – Split t into a set of n-grams or terms X = (x , x ,... 1 2 B , B of users u , u . 1 2 1 2 x ). – Calculate the term frequency for each term x in the These steps will be described in detail in the next sections. text t: tf (x , t ). – For each label l ∈ L, calculate the label-oriented 2.3 Determination value features of entries document score: 2.3.1 Evaluation value implicit features s (t , l ) = × s (x , l ) ∗ tf (x , t ). (2) LOD i LOT i Let’s consider an example of a status on Twitter: “Thank x ∈t you @apple for Find My Mac - just located and wiped my stolen Air”. When we see this status, only the content is explicitly presented, which is the whole text of the status. –If s (t , l )> 0: LOD i However, we can quickly identify some other features of • In the multi-label problem where a text could this status, such as category (technology), tags (Apple, Mac), be assigned to several labels, the text t will be sentiment (neutral—neither agree nor disagree), and emotion labeled with l . (gratitude, joy). The features whose value is not explicitly • In the single label problem where a text could presented in the entry, but could be extracted from the inside be assigned to only one label, it is needed to of the entry, are called implicit features. Our object in this calculate all the final label-oriented (disoriented) step is extracting the value of implicit features of an entry. scores of the text t for all the labels l ∈ L.The In this model, we apply a method to extract the value of label whose label-oriented document score is the each of four implicit features as follows (called the method highest will be assigned to the text t. to classify the texts into classes) [14]: – Step 1: Construct a set of labeled samples (texts), called 2.3.2 Evaluation value sentiment features training set, in which each text is assigned to a set of labels. The union of all labels of all texts is called the set To determine the sentiment of a short text, it is necessary of labels L. to estimate the value of the point of view of the text that – Step 2: For each label l ∈ L, create two sets of text views the author’s point of view expressed in the text. In this samples: paper, we apply the method to classify the texts into classes in Sect. 2.3.1. Therefore, the value of the sentiment feature – T is the set of all texts which are labeled with l . l i of a text is assigned to one of three values (classes), that is, – T is the set of all texts which are not labeled with ¬l positive, negative, or neutral. l . – Step 3: For each text t ∈ T or (t ∈ T ), calculate the k l k ¬l i i label-oriented features as follows: 2.3.3 Evaluation value emotion features – Split t into a set of n-gram or term (stop words may be removed). The emotions of the user represented in the entries are often – Take the union of all terms in all texts in the set T represented by icons or images, each of which is equivalent and T . to a term describing that emotion. Therefore, estimating the ¬l similarity between two emotions of the entry is to estimate – Step 4: Calculate the label-oriented term score of each the similarity between the two terms. In this paper, we apply term in the corresponding set for each label l : the method of classifying the texts into classes in Sect. 2.3.1. Therefore, the emotion value of a text is assigned to one of x x N N N N l ¬l ¬l l i i i i the values (classes): enjoy; happy for; love; gratitude; admi- s (x , l ) = × log − × log , LOT i x x N N N N l ¬l i i ration; pride; hope; sad; sorry; regret; disappointed; disgust; ¬l l i i (1) angry; confused; no emotion. 123 Vietnam Journal of Computer Science (2018) 5:165–175 169 1 1 2 2 2 2 2 2.4 Estimating similarity on each feature , ··· < g ,v >) and v = (< g ,v >, < g ,v > n n 1 1 2 1 2 2 , ··· < g ,v >). m m In this model, we distinguish two kinds of textual values of – Calculate the distance between the two vectors: a feature: 1 2 D(v ,v ) = d , (5) – First, the feature value is already in the form of a set of k expressions, such as the value of features tags, category, sentiment, and emotion. Their similarity is considered as where N is the number of different n-grams considered the similarity among sets of expressions. 1 2 in both t ∪ t and d is the distance on each element – Second, the feature value is in the form of a general text, k 1 1 1 2 2 2 < g ,v > of v (or element < g ,v > of v , respec- such as the value of the feature content. Their similarity i i j j tively): is considered as the similarity among texts. 2 2 2 – If there is an element < g ,v > of v (or element l l 2.4.1 Estimating the similarity for expression features 1 1 l 2 1 < g ,v > of v , respectively) such that g = g , l l l i then: Since the content of the feature is in the form of a set of textual expressions, their similarity is defined as follows: suppose 1 2 | v − v | m n i l 1 2 1 2 that A = (a , a ,... a ), A = (a , a ,... a ) are two d = . (6) 1 2 k 1 1 1 2 2 2 1 2 max(v ,v ) i l sets of expressions or strings, in which, m, n are the sizes of the set A and A , respectively. Let v be the size of the set 1 2 – Otherwise, d = 1. of intersection of A and A . The similarity between A and 1 2 1 A is defined by the formula: 1 2 – It is clear that the value of D(v ,v ) is in the interval [0, 1]. Similarity between the two features is then: 2×| A ∩ A | 2 × v 1 2 s ( A , A ) = = . (3) exp 1 2 | A |+| A | m + n 1 2 1 2 1 2 s (t , t ) = 1 − D(v ,v ). (7) txt It is clear that all possible values of s ( A , A ) are in the exp 1 2 interval [0, 1]. This formula could be applied to the features, 2.4.3 Estimating similarity between two entries whose value is a set of expressions. j j j i i i i j Suppose that e = ( f , f ,... f ), e = ( f , f ,... f ) 1 2 1 2 We considered an entry via five features including: content, are two entries represented by their features. Let us consider tag, category, sentiment and emotion. In this case, there are the feature k whose value is a set of expressions. The simi- four expression features of entry including: tags, category, i j larity between entries e and e on the feature k is defined by sentiment and emotion. So they are estimated as the similarity the formula: on expression feature as follows: i j i s (e , e ) = s ( f , f ), (4) k exp k j i j i s (e , e ) =s ( f , f ), (8) cate exp cate cate j i j i s (e , e ) =s ( f , f ), (9) where f , f are the expression values of the feature k of the tags exp tags tags k k i j two entries e and e . j i j i s (e , e ) =s ( f , f ), (10) sent exp sent sent i j i s (e , e ) =s ( f , f ). (11) emot exp 2.4.2 Estimating similarity for text features emot emot The problem of estimating the similarity among textual val- One text feature of entry is content, so it is estimated as the ues becomes the estimation of the similarity among texts. text feature similarity, calculated as follows: We can apply the technique TF–IDF (term frequency–inverse document frequency) [17] to characterize the texts as follows: j i j i s (e , e ) = s ( f , f ). (12) cont txt cont cont 1 1 1 1 – Split the text into a set of n-gram t = (g , g ,... g ) 1 2 n i j 2 2 2 2 Let e and e be two considered entries whose con- and t = (g , g ,... g ). 1 2 tent, tags, categories, sentiment and emotion are features of – Calculate the TF–IDF of each n-gram in the text. Then, j j j j i i i i i entries: e ; e ; e ; e ; e ; e ; e ; e ; e ; cont cont tags tags cate cate sent sent emot represent the feature value by a vector in which each element is a pair e . Based on the approach of multi-attribute similarity of emot 1 1 1 1 1 i < n-gram, td-idf >: v = (< g ,v >, < g ,v > two objects [12], the similarity between the two entries e 1 1 2 2 123 170 Vietnam Journal of Computer Science (2018) 5:165–175 and e is estimated as follows: nities G and G is defined by the formula: 1 2 i j i j i j s (G , G ) = f (T ) = f (t , t ,..., t ), (17) s (e , e ) = f (s (e , e ), s (e , e ), css 1 2 set set 1 2 p+q entry ent cont tags i j i j i j s (e , e ), s (e , e ), s (e , e )), cate sent emot where f :[0, 1] →[0, 1] is a similar function between set (13) two sets. where f :[0, 1] →[0, 1] is similarity is a similar ent 2.5 Estimating each behavior of the user function between two entries, which satisfies the following conditions: In this paper, we consider five behaviors of the user on social networks including: post an entry, like, comment, share an (i ) f (v ,w, x , y, z) f (v ,w, x , y, z) if v v ; ent0 1 ent 2 1 2 entry, and join a group or a community. (ii ) f (v, w , x , y, z) f (v, w , x , y, z) if w w ; ent 1 ent 2 1 2 (iii ) f (v, w, x , y, z) f (v, w, x , y, z) if x x ; 2.5.1 The similarity between post or share behavior ent 1 ent 2 1 2 (i v) f (v, w, x , y , z) f (v, w, x , y , z) if y y ; ent 1 ent 2 1 2 In the case of post or share an entry, the similarity post or (v) f (v, w, x , y, z ) f (v, w, x , y, z ) if z z . ent 1 ent 2 1 2 share an entry is estimated by estimating the similarity of two (14) sets of posted or shared entries as follows: 1 1 1 2 2 2 Let E = e , e ,..., e and E = e , e ,..., e be two 1 2 p q 1 2 1 2 2.4.4 Estimating the similarity between two groups considered sets of posted or shared entries. We create a com- mon set of these two sets E = E + E = e , e ,..., e 12 1 2 1 2 p+q Once the similarity between two groups on each feature is and then construct their non-ordered semantic vectors T = estimated, the similarity between two groups is then esti- (t , t ,..., t ) as: 1 2 p+q mated by a weighted average aggregation of the similarity between them on all considered features as follows: 1 2 t = min(max(s (e , e )), max(s (e , e ))) i entry i entry i k v k = 1 ... p; v = 1 ... q, (18) –Let w ,w ,w be the weight of features Name, Descrip- 1 2 3 tion and Category, respectively. They have to satisfy this where s (x , y) is the similarity between the two entries entry condition: w + w + w = 1. 1 2 3 x and y. To measure the similarity between two sets of – The similarity between group g and group g is: i j entries E and E , we make use of the following assump- 1 2 tions: The bigger the magnitude of the vector T , the higher i j s (g , g ) =w × s (g , g ) group i j 1 exp name name is the similarity between E and E . The similarity between 1 2 two non-ordered sets of entries E and E is defined by the + w × s (g , g ) 1 2 2 exp desc desc formula: i j + w × s (g , g ), (15) 3 exp camu camu s (E , E ) = f (T ) = f (t , t ,..., t ), (19) ess 1 2 set set 1 2 p+q where w ,w ,w are, respectively, the weight of the fea- 1 2 3 tures Name, Description, and Category. s ( A, B) is the ex p where f :[0, 1] →[01] is a similar function between set similarity between the two sets of expressions A and B. two sets, which satisfies the following conditions: In the case of two sets of communities, let G = (i ) f (0, 0,..., 0) = 0; set 1 1 1 2 2 2 g , g ,..., c and G = g , g ,..., g be the two con- 1 2 m 1 2 n (ii ) f (1, 1,..., 1) = 1; set sidered sets of communities. We create a common set of (iii ) f (X ) f (X ) if X X . (20) these two sets G = G + G = g , g ,..., g and set 1 set 2 1 2 12 1 2 1 2 m+n then construct their non-ordered semantic vectors T = For example, the following functions are similar function (t , t ,..., t ) as: 1 2 m+n between two sets of entries: 1 2 t = min(max(s (g ; g )), max(s (g ; g ))) i group i group i n i =1 (1) f (x , x ,..., x ) = , 1 2 n k = 1 ... m; v = 1 ... n, (16) i =1 where s (x , y) is the similarity between two groups x and group (2) f (x , x ,..., x ) = . (21) 1 2 n y. The similarity between two non-ordered sets of commu- 123 Vietnam Journal of Computer Science (2018) 5:165–175 171 In the case of similarity between two sets of posted or – In each entry, if the number of positive comments of an u j shared entries, let E and E be, respectively, two sets user is greater than that of the negative comments, then post post of posted or shared entries of user u and user u . The posting- the entry is considered as positive for the user. Vice versa, i j based (or sharing-based) behavior similarity of user u and if the number of positive comments of a user is smaller user u is defined by the formula: than that of negative comments, then the entry is consid- ered as negative for the user. – In the case where the numbers of positive comments and s (u , u ) = s (E , E ), (22) post i j ess post post the negative comments are equal, we will consider the comments as liked by the user: where s ( A, B) is the similarity between two sets of entries ess A and B. – If the number of positive comments liked by an user is greater than that of negative comments, then the 2.5.2 The similarity on behavior of joining a group entry is considered as positive for the user. – If the number of positive comments liked by a user u j Let’s G and G are respectively the two sets of commu- join join is smaller than that of negative comments, then the nities or groups which were joined by user u and user u . i j entry is considered as negative for the user. The joining a group behavior similarity of user u and user – If the numbers of positive comments and negative u is defined by the formula: comments liked by a user are equal, then the entry is considered as neutral for the user and it will be s (u , u ) = s (G , G ), (23) join i j css join join removed from the considering set of entries for the user. where s ( A, B) is the similarity between two sets of com- css munities A and B. u u i i Let C and C be, respectively, the set of positive and nega- p n u u j j 2.5.3 The similarity on behavior of liking an entry tive entries for user u . C and C are, respectively, the set i p n of positive and negative entries for user u . To measure the u j Let’s L and L are respectively the two sets of entries comment/like comment-based behavior similarity of user u like like were liked by user u and user u . To measure the like-based and user u , the following is defined: i j behavior similarity of user u and user u , the following is i j u j defined: the more the two sets L and L are similar, the u j like like – The more the two sets C and C are similar, the higher p p higher is the like/dislike-based behavior similarity of user u is the comment/like comment-based behavior similarity and user u is. The like-based behavior similarity of user i of user u and user u . i j and user j is defined by the formula: u j – The more the two sets C and C are similar, the higher n n u is the comment/like comment-based behavior similarity u j s (u , u ) = s (L , L ), (24) like i j ess like like of user u and user u is. i j – The less the two sets C and C are similar, the higher p n where s ( A, B) is the similarity between the two sets of ess is the comment/like comment-based behavior similarity entries A and B of user u and user u . i j u j – The less the two sets C and C are similar, the higher n p 2.5.4 The similarity between two comment likes in is the comment/like comment-based behavior similarity comment-based behaviors of user u and user u . i j Although this behavior is obviously a confirmation of what the user already liked or disliked, sometimes some user could The comment/like comment-based behavior similarity of comment or like some comment without liking or disliking user u and user u is defined by the formula: i j the entry. In these cases, we take it into account to measure the similarity among users on the following principles: s (u , u ) = min(1, max(0, s (C , C ) comt i j ess p u u j j u u i i – The value of each comment is detected as: positive, neg- + s (C , C ) − s (C , C ) ess n ess n n p ative, or neutral. This determination could be done by − s (C , C ))), (25) ess p applying the method of classifying the texts into classes in Sect. 2.3.1. – In each entry, only the positive or negative comments are where s ( A, B) is the similarity between two sets of entries ess counted. The neutral comment will be removed. A and B. 123 172 Vietnam Journal of Computer Science (2018) 5:165–175 2.5.5 Estimating the similarity between two users Table 1 Collected data from Twitter.com Criteria Twitter Once the similarities between two users on each kind of Collected data User: 1000 behavior are estimated, the similarity between the two users Posts: 150000 is then estimated by a weighted average aggregation, and the Activities: 150000 similarity between them on all considered kinds of behaviors Criteria of entry Content are as follows: Tags Category –Let w ,w ,w ,w be the weight of the similarity based 1 2 3 4 Sentiment on posting/sharing, joining a group, liking entries, and comment/like comment respectively. They have to satisfy Emotions this condition: w + w + w + w = 1. Behavior of user Post 1 2 3 4 – The similarity between user u and user u is: Like i j Comment s(u , u ) =w × s (u , u ) + w × s (u , u ) i j 1 post i j 2 join i j Share Join a group + w × s (u , u ) + w × s (u , u ), 3 like i j 4 comt i j (26) Table 2 Sample constructed where w ,w ,w ,w are, respectively, the weight of 1 2 3 4 Source Number of samples from Twitter the similarity based on posting/sharing, joining a group, Twitter 500 liking entries, and comment/like comment. s (u , u ); post i j s (u , u ); s (u , u ); s (u , u ) are the similarity join i j like i j comt i j between the two users u and u based on posting entries, i j joining a group, liking entries, commenting/liking com- – Each sample contains three users collected from Twit- ment behaviors, respectively. ter.com. These users are called as user A, B, and C, respectively. – We ask a number of selected volunteers to answer the 3 Experiments and evaluation question: Which user, B or C, is more similar to user A than the other? 3.1 Method – Then, we compare the number of people who chooses B, and that of people who chooses C. If the number of 3.1.1 Collection of data answer B is greater than that of C, then the value of this sample is 1. It means that user B is more similar to user To evaluate the proposed model, we collected data from Twit- A than C. On the contrary, if the number of answer C is ter.com sources (Table 1): we could directly apply the model greater than that of B, then the value of this sample is 2. to estimate the similarity among Twitter users. Each tweet It means that user C is more similar to user A than B. If is considered in five features: content, tags, category, senti- the number of the answers B and C are not significantly ment and emotion as in the model. The considered activities different, this sample will be removed from the sample of Twitter user are: post/share, like, comment, and list of set. groups of user. Note that in Twitter, there are no explicit activities as like and join a group as in the model. Therefore, After this step, we have a set of samples. We use the samples we have to map some similar activities in Twitter to these and save them in a set of samples. In experiments, we use two activities as follows: the sample with the size of each sample set as described in Table. 2. – Like: the like activity of a user in Twitter is considered as the favorite tweets list of the user. 3.1.3 Scenario – Join a group: in the case of Twitter, a group could be considered as a list that some users subscribed to. The experiment is performed as follows: 3.1.2 Construction of sample set – For each sample, we use the model proposed in this paper to estimate the similarity between user B and user A, and Each sample is constructed as follows: that between user C and user A. 123 Vietnam Journal of Computer Science (2018) 5:165–175 173 Table 3 Correct ratio CR of the sample set Table 4 Best weight of the entry criteria for the sample set Sample set Number of correct samples Correct ratio CR Cont. Tags Cate. Sent. Emot. Accuracy (%) Twitter 438 87.60 1/5 criteria 1.00 34.00 1.00 47.20 1.00 55.00 1.00 57.00 – If B is more similar to A than C is, then the result of this sample is 1. On the contrary, If C is more similar to A 1.00 67.60 than B is, then the result of this sample is 2. 2/5 criteria 0.80 0.20 45.40 – We then compare the result and the value of each sample. 0.35 0.65 52.20 If they are identical, we increase the variable number of 0.85 0.15 57.00 correct samples by 1. 0.90 0.10 58.80 0.80 0.20 65.00 3.1.4 Output parameters 0.60 0.40 69.20 0.70 0.30 70.00 The correct ratio (CR) of the model over the given sample 0.85 0.15 74.00 set is calculated as follows: 0.85 0.15 75.00 0.80 0.20 76.00 number of correct sample CR = × 100%. (27) 3/5 criteria 0.60 0.20 0.20 64.00 total of sample 0.35 0.35 0.30 65.80 0.40 0.30 0.30 69.00 The more the CR value is close to 100%, the more is the model 0.70 0.15 0.15 70.00 correct. We expect that the obtained value of CR would be 0.45 0.10 0.45 71.40 as high as possible. 0.20 0.70 0.10 73.00 0.20 0.50 0.30 75.20 3.2 Results 0.65 0.15 0.20 79.20 0.65 0.20 0.15 82.00 The results are presented in Table 3. In total, the correct ratio of the model over all samples is about 438/500 (87.60%). 0.40 0.30 0.30 83.00 For more details, we run experiments with several com- 4/5 criteria 0.45 0.25 0.15 0.15 77.80 binations of weights from criteria of an entry, and weights 0.15 0.60 0.10 0.15 78.60 from behavior of user with the following detailed scenario: 0.20 0.55 0.15 0.10 83.20 0.35 0.25 0.10 0.30 85.20 – At the level of entry, we run the experiment with only 0.35 0.25 0.10 0.30 86.20 1/5, 2/5, 3/5, 4/5, and 5/5 criteria to detect the similar- 5/5 criteria 0.30 0.25 0.10 0.15 0.20 87.60 ity among entries: 1/5 and 4/5 criteria have five possible Bold values indicate the best case of experiment result combinations; 2/5 and 3/5 criteria have ten possible com- binations, 5/5 criteria have only 1 combination. – For each combination, we run the experiment with differ- of entry, so they are most important in the results. Meanwhile, ent weights of each selected criteria. The changing step the three remaining criteria emotion, sentiment, and category for each weight is 0.05. Therefore, each criteria weight are implicit values from entry. Their value possibly depends runs from 0.05 to 1.00 as long as the sum of all criteria on the classifying method and, therefore, their importance weights in the experiment is equal to 1. may be reduced in the final results. The criterion category is – The same principle is applied at the level of behavior: we less important possibly because this has only small possible run the experiment with 1/4, 2/4, 3/4, and 4/4 behaviors. different values. A value of this criterion could represent a Each combination is also applied in the same manner as big number of different entries. Therefore, this criteria does the previous level. not very well classify the entries as the other criteria. At the level of behavior, the best weight combination is: The results are presented in Table 4 (for entry) and Table 5 0.35 of post,0.30 for comment,0.25 for like, and 0.10 for (for behavior). At the level of entry, the best weight combina- join a group. As mentioned in the scenario, in Twitter, there tion is that: 0.30 of content,0.25 for tags,0.20 for emotion, is no real data about like and join a group;wehavetomap 0.15 for sentiment, and 0.10 for category. These results are the favorite list in Twitter to the activity like, and map the reasonable: in Twitter, the content and tags are explicit value subscribed to list in Twitter to the activity join a group of 123 174 Vietnam Journal of Computer Science (2018) 5:165–175 Table 5 Best weight of behavior for the sample set References Post Like Comm. Join Accuracy (%) 1. Anderson, A., Huttenlocher, D., Kleinberg, J., Leskovec, J.: Effects of user similarity in social media. In: Proceedings of the Fifth ACM 1/4 behavior 1.00 38.00 International Conference on Web Search and Data Mining, WSDM 1.00 39.80 ’12, pp. 703–712. ACM, New York, NY, USA (2012) 1.00 54.00 2. Benevenuto, F., Rodrigues, T., Cha, M., Almeida, V.: Characteriz- ing user behavior in online social networks. In: Proceedings of the 1.00 65.00 9th ACM SIGCOMM Conference on Internet Measurement, IMC 2/4 behavior 0.60 0.40 57.00 ’09, pp. 49–62. ACM, New York, NY, USA (2009) 0.45 0.55 59.00 3. Bhattacharyya, P., Garg, A., Wu, S.F.: Analysis of user keyword 0.60 0.40 64.20 similarity in online social networks. Soc. Netw. Anal. Min. 1(3), 143–158 (2011) 0.65 0.35 74.80 4. Chen, X., Pang, J., Xue, R.: Constructing and comparing user 0.70 0.30 81.20 mobility profiles for location-based services. In: Proceedings of 0.75 0.25 83.20 the 28th Annual ACM Symposium on Applied Computing, SAC ’13, pp. 261–266. ACM, New York, NY, USA (2013) 3/4 behavior 0.30 0.45 0.25 82.20 5. Erlandsson, F., Bródka, P., Borg, A., Johnson, H.: Finding influ- 0.40 0.30 0.30 84.20 ential users in social media using association rule learning. CoRR 0.40 0.30 0.30 84.80 arXiv:1604.08075 (2016) 6. Guo, C., Tian, X., Mei, T.: User specific friend recommendation in 0.40 0.25 0.35 85.20 social media community. In: 2014 IEEE International Conference 4/4 behavior 0.35 0.25 0.30 0.10 87.60 on Multimedia and Expo (ICME), pp. 1–6 (2014) 7. Jamali, M., Ester, M.: Modeling and comparing the influence of Bold values indicate the best case of experiment result neighbors on the behavior of users in social and similarity net- works. In: 2010 IEEE International Conference on Data Mining Workshops, pp. 336–343 (2010) 8. Liu, H., Hu, Z., Mian, A., Tian, H., Zhu, X.: A new user similarity model to improve the accuracy of collaborative filtering. Knowl. the model. This may be the main reason why in this experi- Based Syst. 56, 156–166 (2014) ment these two activities are less important than the two real 9. Liu, H., Schneider, M.: Similarity measurement of moving object activities of post and comment. trajectories. In: Proceedings of the Third ACM SIGSPATIAL Inter- national Workshop on GeoStreaming, IWGS ’12, pp. 19–22. ACM, New York, NY, USA (2012) 10. Liu, K., Terzi, E.: A framework for computing the privacy scores of users in online social networks. ACM Trans. Knowl. Discov. Data 5(1), 6:1–6:30 (2010) 11. Nguyen, D.A., Tan, S., Ramanathan, R., Yan, X.: Analyzing infor- 4 Conclusions mation sharing strategies of users in online social networks. In: 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), pp. 247–254 (2016) In this paper, we present a model for estimating the similarity 12. Nguyen, M.H., Nguyen, T.H.: A general model for similarity mea- between users based on their entries and behavior on social surement between objects. Int. J. Adv. Comput. Sci. Appl. 6(2), network. The considered behaviors are based on the activity 235–239 (2015) of posting an entry or sharing an existing entry, liking an entry 13. Nguyen, T.H., Tran, D.Q., Dam, G.M., Nguyen, M.H.: Multi- feature based similarity among entries on media portals. In: or liking a comment on an entry, commenting on an entry, Akagi, M., Nguyen, T.T., Vu, D.T., Phung, T.N., Huynh, V.N. and joining a group or community. The model is applied to (eds.) Advances in Information and Communication Technol- estimate the similarity among users of Twitter. The results ogy. Proceedings of the International Conference on Advances in Information and Communication Technology (ICTA 2016), pp. show that the model could estimate correctly the similarity 373–382. Springer, Thai Nguyen, Viet Nam (2016) among users in the majority of cases. 14. Nguyen, T.H., Tran, D.Q., Dam, G.M., Nguyen, M.H.: Integrated This model could be applied to several applications such sentiment and emotion into estimating the similarity among entries as to predict the behavior of a social network user in com- on social network. In: Chen, Y., Duong, T.Q. (eds.) Industrial Networks and Intelligent Systems, pp. 242–253. Springer, Cham menting or liking some kind of status; to recommend some (2018) new entries which could be appropriate to a given user; to 15. Peled, O., Fire, M., Rokach, L., Elovici, Y.: Entity Matching in cluster the user based on some criteria. Online Social Networks. Social Computing/IEEE International Conference on Privacy, Security, Risk and Trust, 2010 IEEE Inter- national Conference on 0, pp. 339–344 (2013) Open Access This article is distributed under the terms of the Creative 16. Raad, E., Chbeir, R., Dipanda, A.: User profile matching in social Commons Attribution 4.0 International License (http://creativecomm networks. In: Proceedings of the 2010 13th International Con- ons.org/licenses/by/4.0/), which permits unrestricted use, distribution, ference on Network-Based Information Systems, NBIS ’10, pp. and reproduction in any medium, provided you give appropriate credit 297–304. IEEE Computer Society, Washington, DC, USA (2010) to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. 123 Vietnam Journal of Computer Science (2018) 5:165–175 175 17. Salton, G., McGill, M.J.: Introduction to Modern Information 24. Zhao, G., Qian, X., Feng, H.: Personalized Recommendation by Retrieval. McGraw-Hill Inc, New York (1986) Exploring Social Users’ Behaviors, pp. 181–191. Springer, Cham 18. Singh, K., Shakya, H.K., Biswas, B.: Clustering of people in social (2014) network based on textual similarity. Recent Trends in engineering 25. Zhao, Z., Cheng, Z., Hong, L., Chi, E.H.: Improving user topic and material sciences. Perspect. Sci. 8(Supplement C), 570–573 interest profiles by behavior factorization. In: Proceedings of the (2016) 24th International Conference on World Wide Web, WWW ’15, pp. 19. Sun, S., Li, Q., Yan, P., Zeng, D.D.: Mapping users across social 1406–1416. International World Wide Web Conferences Steering media platforms by integrating text and structure information. In: Committee, Republic and Canton of Geneva, Switzerland (2015) 2017 IEEE International Conference on Intelligence and Security 26. Zou, Z., Xie, X., Sha, C.: Mining user behavior and similarity Informatics (ISI), pp. 113–118 (2017) in location-based social networks. In: 2015 Seventh International 20. Tang, X., Miao, Q., Quan, Y., Tang, J., Deng, K.: Predicting individ- Symposium on Parallel Architectures, Algorithms and Program- ual retweet behavior by user similarity. Know. Based Syst. 89(C), ming (PAAP), pp. 167–171 (2015) 681–688 (2015) 21. Vedula, N., Parthasarathy, S., Shalin, V.L.: Predicting trust relations within a social network: A case study on emergency response. In: Publisher’s Note Springer Nature remains neutral with regard to juris- Proceedings of the 2017 ACM on Web Science Conference, Web- dictional claims in published maps and institutional affiliations. Sci ’17, pp. 53–62. ACM, New York, NY, USA (2017) 22. Xu, K., Li, J., Liao, S.S.: Sentiment community detection in social networks. In: Proceedings of the 2011 iConference, iConference ’11, pp. 804–805. ACM, New York, NY, USA (2011) 23. Xu, Z., Zhang, Y., Wu, Y., Yang, Q.: Modeling user posting behav- ior on social media. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’12, pp. 545–554. ACM, New York, NY, USA (2012)
Vietnam Journal of Computer Science – Springer Journals
Published: May 19, 2018
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.