Poster: Telephony Network Characterization for Spammer Identi cation Hossein Kaffash Bokharaei , Yashar Ganjali Ram Keralapura , Antonio Nucci Department of Computer Science, University of Toronto Narus Inc., California, USA {hossein, yganjali}@cs.toronto.edu {rkeralapura, anucci}@narus.com Voice-over-IP (VoIP) has moved beyond being a mere technological object and has become an integral part of many people s cyber lives. Unfortunately, the very same openness and ubiquity that make IP networks such powerful infrastructures also make them a liability. Risks include Denial of Service (DoS), service theft, spam, call routing manipulation, identity theft and impersonation, among others. In this work, we study individual and network-wide characteristics of telephony communications in a large phone network. Our objective is to analyze phone call patterns and nd statistical properties that are inherent of such service. We study several metrics in the network graph de ned by phone calls including node degrees, call duration, neighborhood connectivity, call repetition, call reciprocity, and call density. We aim at identifying metrics which are helpful in retaining the inner structure of a telephony service and its usage as well as metrics that act as indicators of abnormal service usage. This can be helpful, for instance, in identifying Spam in Internet Telephony (SPIT). For this study, we collect data from one of the largest phone providers in North America for a period of two months (Oct-Nov 2009). Our dataset contains the call records of more than 14 million phone subscribers and 450 million calls to/from these subscribers. Each call record includes call time, duration, and anonymized caller and callee IDs. We do not have access to the content of calls. In order to capture the most salient properties of phone user behaviors, i.e., nodes in the call pattern graph, we focus on dynamics such as node degree and neighborhood connectivity. We nd that almost 80% of nodes in the graph have an in-degree larger than their out-degree with a very small percentage (less than 5%) of users with extraordinary large out-degree being indicative of telemarketers, large-medium organizations and/or SPITters. In terms of neighborhood connectivity we nd that the majority of the users exhibit a very small clustering coe cient implying that people who they talked tend not to talk directly to each other unlike online social networks. Interestingly, the existence of a diversity in average talk duration between neighbors of a user with small out-degree have predictable call duration when compared to users with large out-degree. When focusing on call duration of a user who places less than 300 outgoing calls we nd that she tends to talk for about 300 seconds; however user with more than a thousand outgoing calls talk for about 100 seconds. Repetitive and reciprocal calls represent strong social connections between users. In our dataset, we nd that 80% of the calls are repetitive and 50% are reciprocal. In addition, 50% of reciprocal calls account for 15% of all the edges in the network indicating strong call activity between small number of users. In fact, the talk duration among these users is ten times more than others. When it comes to identifying SPITters, one cannot simply rely on the basic statistical properties of the call pattern graph in isolation. In other words, even though each individual metric (such as in-degree, out-degree, call duration, call frequency, reciprocity, etc.) can point to a set of suspicious nodes, this set will almost always include a large number of legitimate users. For instance, even though SPITters need to make a large number of calls, there are always legitimate users (say businesses) who make a large number of calls. Or, as mentioned before, neighborhood connectivity cannot be used to distinguish legitimate users from SPITters: unlike online social networks even legitimate users seem to have small clustering coe cients in the phone call network. On the positive note, we show that slightly more complex metrics can be very e ective in identifying SPITters. To this end, we consider two properties of the phone network, namely strong ties property and weak ties property. Strong ties property is well-known in the context of social networks. Simply stated, it says people normally spend most of their time communicating with only a small number of their friends. In our dataset, we observe that 90% of phone system subscribers spend more than 80% of their time talking to only 5 people. Weak ties property considers the other end of the spectrum: for a legitimate user, we expect to have a signi cant fraction of calls to be long. For instance, in our dataset for 90% of users the length of the call is longer than a minute for at least 30% of the calls. We believe a combination of weak ties and strong ties properties can be very e ective in identifying SPITters. We also measure the global ranking of nodes in our network using a variation of the famous PageRank algorithm called SymRank [1]. Unlike weak ties and strong ties which are based on local properties in the network, SymRank assigns a ranking score to each node based on the global properties of the network. Interestingly, this seemingly orthogonal ranking matches very nicely with the outliers of the strong ties and weak ties properties.
/lp/association-for-computing-machinery/poster-telephony-network-characterization-for-spammer-identification-kEos3yFKhE