Social-Aware Spatial Top-k and Skyline Queries

Social-Aware Spatial Top-k and Skyline Queries Abstract The widespread proliferation of location-acquisition techniques and GPS-embedded mobile devices have resulted in the generation of geo-tagged data at unprecedented scale and have essentially enhanced the user experience in location-based services associated with social networks. Such location-based social networks allow people to record and share their location and are a rich source of information which can be exploited to study people’s various attributes and characteristics to provide various Geo-Social (GS) services. In this paper, we propose two new types of queries called Top-k famous places TkFP and Socio-Spatial Skyline Query SSSQ query, which enrich the semantics of the conventional spatial queries by introducing a social relevance component. In addition, three approaches namely, (1) Social-First, (2) Spatial-First and (3) Hybrid are proposed to efficiently process TkFP and SSSQ queries. Finally, we conduct an extensive evaluation of the proposed schemes using real and synthetic datasets and demonstrate the effectiveness of the proposed approaches. 1. INTRODUCTION The fusion of social and geographical information has given rise to the notion of online social media known as location-based social networks (LBSNs) such as Facebook and Foursquare. An LBSN is usually represented as a complex graph where nodes represent various entities in the social network (such as users, places or pages) and the edges represent the relationships between different nodes. These relationships are not only limited to friendship relations but also contain other types of relationships such as works-at, born-in and studies-at. In addition, the nodes may also contain spatial information such as a user’s check-ins at different locations. Consider the example of a Facebook user Alice who was born in Germany, works at Monash University and checks-in at a particular restaurant. Facebook records this information by linking Facebook pages for Monash University and Germany with Alice [1], e.g. Alice and Monash University are connected by an edge labeled works-at and Alice and Germany are connected with an edge labeled born-in. The check-in information records the places the user has visited. Spatial data and social relationships in LBSNs provide a rich source of information which can be exploited to offer many interesting services. Consider the example of a German tourist visiting Melbourne. She may want to find a nearby pub which is popular (e.g. frequently visited) among people from Germany. This involves utilizing spatial information (i.e. nearby pub, check-ins) as well as social information (i.e. people who were born-in Germany). Similarly, a user may want to find nearby places that are most popular among her friends, e.g. the places most frequently visited by her friends. Similarly, in disease monitoring, we are interested in finding frequently visited spot by people having certain types of disease e.g. Ebola Virus. The people are connected to each other in a social network through the disease and by analysing their visits, the frequently visited region can be found from where they could have carried out the disease. Further, in public safety and crime prevention field, let us say that there are some users who have tweeted about Drugs and have also joined some pages containing drugs related information on social networks. To find their frequently visited places for crime prevention, where they might involve in drugs related activities, law enforcement agencies can exploit the information to raid there. Such kind of queries can also be utilized by various businesses. Consider a chain of gaming stores that is interested in opening a new store in an area which is frequently visited by such young adults for shopping/leisure who have keen interest in gaming. These young adults are connected with social networking through various gaming pages and/or tweets containing information about gaming. By analysing their visits information, the business can discover a suitable place/shopping mall for this purpose. Although various types of queries have been studied on LBSNs [2–7], to the best of our knowledge, this problem has not been studied before and in the existing work, their techniques are either not applicable or cannot be efficiently extended to answer the queries like the above that aim at finding nearby places that are popular among a particular group of users satisfying a social constraint. It is notable that recommendation techniques that exploit the similar interest of friends have gained significant attention, as several works on social network analysis have remarked that a user’s behavior indeed often correlates to the behavior of her friends [8–12]. In our daily life, besides our own preference, we usually turn to our friends for opinions of songs, restaurants or movies. Therefore, as argued above, the spatial keyword based queries cannot capture the social influence which is usually important to influence one’s selection criteria. Motivated by this, in this paper, we formalize this problem as a Top-k famous places TkFP query and Socio-Spatial Skyline Query SSSQ and propose efficient query processing techniques. The two proposed queries are briefly introduced below: 1. A Top-k famous places TkFP query retrieves top- k places (points of interest) ranked according to their spatial and social relevance to the query user where the spatial relevance is based on how close the place is to the query location and the social relevance is based on how frequently it is visited by the one-hop neighbors of the query user in the social graph. We use a scoring function to obtain a final score based on social and spatial scores (relevance) of a place. In TkFP queries, a user needs to define a scoring function that combines social and spatial scores to rank the objects which may not be trivial (e.g. due to incompatible attributes, different distributions of attributes, the inability of users to choose a good scoring function) [13]. Motivated by this, to complement Top-k famous places TkFP query, we also study Skyline queries that do not need scoring function to retrieve the desired objects. A formal definition is provided in Section 3.1.1. 2. A Socio-Spacial Skyline Query SSSQ returns every place for which there does not exist any other place that has a better social score and better spatial score. In SSSQ query, we do not need to have a scoring function; therefore, skyline queries are natural and popular choice for the applications involving multi-criteria decision making [14–20]. Let us take an example of a German tourist visiting Melbourne who is looking for some restaurants that are close and are also popular among German people. A socio-spatial skyline query (SSSQ) returns every restaurant p for which there does not exist any other restaurant p′ that is closer to her location and is more popular among German people. A formal definition is provided in Section 5.1.1. We present three approaches to answer TkFP and SSSQ queries namely, (1) Social-First, (2) Spatial-First and (3) Hybrid. The first two approaches separately process the social and spatial components of the queries and do not require a specialized index. However, the third approach (Hybrid) is capable of processing social and spatial components simultaneously by utilizing a hybrid index specifically designed to handle TkFP and SSSQ queries. Contributions. We make the following contributions in this paper. To the best of our knowledge, we are the first to study the TkFP queries that retrieves nearby places popular among a particular group of users in the social network. We extend our work to propose SSSQ queries that returns the places which are not dominated by any other place. To process both of the approaches, we explore three different directions. We conduct an exhaustive evaluation of the proposed schemes using real and synthetic datasets and demonstrate the effectiveness of the proposed approaches. 2. RELATED WORK 2.1. Geo-social queries Geo-Social query processing is an emerging field and is getting attention of research community these days [3, 21–23]. In [24], an Algebric model is proposed to process geo-social query. This model consists of set of operators to query the geo-social data. They replicate the graphs to represent spatial and social components which can be very large in terms of social networks thus, making query processing more cumbersome. Huang et al. [25] studied a Geo-Social query that retrieves the set of nearby friends of a user that share common interests, without providing concrete query processing algorithms. In [6], they defined a new query namely, Geo-Social Circle of Friends to retrieve the group of friends in geo-social settings whose members are close to each other based on their geographical and social circumstances such as for group sports, social gathering and community services. Yang et al. [7] introduced another type of group query by extending the work presented in [6] namely, Social-Spatial Group Query (SSGQ) which is useful for impromptu activity planning. In addition, nearest-neighbor queries have been widely applied in location-based social networks recently [26–30]. 2.2. Top-k queries Top-k queries retrieve the top-k objects based on a user defined scoring function. The problem has been extensively studied [31–35]. Fagin’s algorithm (FA) [36], threshold algorithm (TA) (independently proposed in [36–38]) and no-random access (NRA) [36] propose some of the top-k processing algorithms that combine multiple ranked lists and return the top-k objects. Ilyas et al. [34] give a comprehensive survey of top-k query processing techniques. Wu et al. [31] proposed a new query named as social-aware top-k spatial keyword query (SkSK) which retrieves a list of k objects ranked according to their spatial proximity, textual (e.g. restaurant has menu and different facilities) and social relevance. The social relevance is defined as a function of the users who are ‘fans’ of an object considering how close they are to the query user. This definition is fundamentally different from our definition because we consider only the friends of the query user who have visited the object. Another work is presented by Jiang et al. [39] in which they proposed a method to find top-k local users in geo-social media data. First, they extract all tweets in the given range and rank each tweet based on the number of replies/forwards to that tweet. For this, they build a tweet thread tree of each tweet and sum-up replies/forwards at each level. This tweets ranking is considered as social score of the user who initiated the tweet. Then, they compute spatial score of each user who has posted tweets in the range. However, their social scoring criteria is not applicable to our problem definition. Therefore, to the best of our knowledge, none of the existing techniques can be applied or trivially extended to solve Top-k famous Places Query TkFP ⁠. 2.3. Skyline queries There has been considerable work done in the literature regarding skyline queries [40–43]. These include computing skyline queries in partially ordered domain, in a distributed setting, continuous skyline queries [44–46] and many more. The skyline operator was first introduced in [14] followed by many generic skyline computation algorithms such as Block-Nested Loop (BNL), Divide and Conquer (DC) approach which were proposed by same authors. Since then, skyline processing has appealed many researchers and has attracted the attention of many in database community. Additionally, a Bitmap algorithm was proposed in [20] to improve the original algorithms which involve low cardinality domains i.e. datasets with a small number of discrete attributes. There exist few other solutions to process such datasets for instance, as proposed in [42]. Similarly, another approach was introduced in [14] known as Index algorithm which divides the dataset into d sorted lists for d optimized measures. Another R-Tree [47] based approach known as Nearest-Neighbor (NN) was proposed in [15]. This approach starts with finding the nearest neighbor to the query and thus, objects dominated by the nearest neighbor can be pruned. Papadias et al. [16] proposed a Branch-and-Bound Search (BBS) method to overcome the overlapping problem persists in (NN) algorithm. This approach is guaranteed to visit each page of the R-Tree at most once. Most of work on skyline computation does not consider social aspect. To the best of our knowledge, there exists only one work [4] in the literature that considers both social and spatial aspects for skyline queries. They define skyline query as a set of users who are not dominated by any other user where a user u is said to be dominated by another user u′ if u′ is closer to the query location and u′ is socially closer to the query user. The problem we study in this paper is fundamentally different as we focus on returning skyline places (instead of skyline users) where a place p is dominated by another place p′ if p′ is closer to the query user and the number of q′ s friends who visited p′ is greater than the number of q′ s friend who visited p. For this, they use Random walk with restart method (RWR) to compute social distance which is very expensive and to optimize, they propose a method to approximate the results (social similarity) which does not return exact results. Therefore, their problem settings are completely different and cannot be applied to our settings to answer SSSQ queries. 3. TOP-K FAMOUS PLACES QUERIES 3.1. Preliminaries 3.1.1. Problem definition Location-Based Social Network (LBSN): A location-based social network consists of a set of entities U (e.g. users, pages, groups, etc.) and a set of places P ⁠. The relationship between two entities u and v are indicated by a labeled edge where the label indicates the type of relationship (e.g. friend, lives-in). LBSN also records check-ins where a check-in of a user u∈U at a particular place p∈P indicates an instance that u had visited the place P ⁠. Score of a place p: Given a query user q ⁠, and a range r ⁠, the score of a place p∈P is 0 if ∣∣q,p∣∣>r where ∣∣q,p∣∣ is the Euclidean distance between query location and p ⁠. If ∣∣q,p∣∣≤r ⁠, the score of p is a weighted sum of its spatial score (denoted as pspatial ⁠) and its social score (denoted as psocial) Score(p)=α×pspatial+(1−α)×psocial (1) where α is a parameter used to control the relative importance of spatial and social scores. The social score psocial is computed as follows. Let Fq denotes the one-hop neighbors of the query user q considering a particular relationship type, e.g. if the relationship is born-in and the query entity is the Facebook Page named Germany, then Fq is a set of users born in Germany. Although our techniques can be used on any type of relationship, for the ease of presentation, in the rest of the paper we only consider the friendship relationships. In this context, Fq contains the friends of the query user q ⁠. Let Vp denotes the set of all users who visited (i.e. checked-in at) the place p ⁠. The social score psocial is computed as follows: psocial=∣Fq∩Vp∣∣Fq∣ (2) where ∣X∣ denote the cardinality of a set X and if ∣Fq∣ is zero then we assume that psocial is also zero. Intuitively, psocial is the proportion of the friends of q who have visited the place p ⁠. The spatial score pspatial is based on how close the place is to the query location. Formally, given a range r ⁠, pspatial=0 if the place does not lie in the range. Otherwise, pspatial=(r−∣∣q,p∣∣) where ∣∣q,p∣∣ indicates Euclidean distance between the query location and p ⁠. Note that psocial is always between 0 and 1 and we normalize pspatial such that it is also within the range 0 to 1 ⁠, e.g. the data space is normalized such that ∣∣q,p∣∣≤1 and r≤1 ⁠. Top- kFamous Places (⁠ TkFP ⁠) Query: Given an LBSN, a TkFP query q returns k places with the highest scores where the score Score(p) of each place p is computed as described above. Example 2.1 Figure 1a illustrates the locations of a set of places P={p1,p2,p3,p4} ⁠. The query q shown in Fig. 1a with k=2 and range r=0.15 ⁠, has a set of friends Fq={u1,u2,u3,u4,u5,u6,u7,u8,u9,u10} ⁠. The number in bracket next to each place is the check-in count made by q’s friends. Figure 1b shows the Euclidean distances and the visitors of each place among q’s friends. Let us assume α=0.5 ⁠, the score of the p1 w.r.t. q is computed as 0.025+0=0.025 ⁠. Similarly, we have Score(p2)=0.185 ⁠, Score(p3)=0.205 and Score(p4)=0.115 ⁠. The result of the query q is (p2,p3) according to scoring function in Equation (1) (Table 1). Figure 1. View largeDownload slide Top- k Query Example. Figure 1. View largeDownload slide Top- k Query Example. Table 1. Notations. Notation Definition q Query User i.e. q∈U u User i.e. u∈U p A place i.e. p∈P Pv Set of places visited by friends of u Fq Set of Friends of query user q Vp Set of visitors of a place p r Given query range k Number of requested places p′≺p place p’ dominates place p cij Range Grid’s cell Pc Set of places lie inside a cell Vcell Set of visitors of a cell bij Skyline workspace block Notation Definition q Query User i.e. q∈U u User i.e. u∈U p A place i.e. p∈P Pv Set of places visited by friends of u Fq Set of Friends of query user q Vp Set of visitors of a place p r Given query range k Number of requested places p′≺p place p’ dominates place p cij Range Grid’s cell Pc Set of places lie inside a cell Vcell Set of visitors of a cell bij Skyline workspace block View Large Table 1. Notations. Notation Definition q Query User i.e. q∈U u User i.e. u∈U p A place i.e. p∈P Pv Set of places visited by friends of u Fq Set of Friends of query user q Vp Set of visitors of a place p r Given query range k Number of requested places p′≺p place p’ dominates place p cij Range Grid’s cell Pc Set of places lie inside a cell Vcell Set of visitors of a cell bij Skyline workspace block Notation Definition q Query User i.e. q∈U u User i.e. u∈U p A place i.e. p∈P Pv Set of places visited by friends of u Fq Set of Friends of query user q Vp Set of visitors of a place p r Given query range k Number of requested places p′≺p place p’ dominates place p cij Range Grid’s cell Pc Set of places lie inside a cell Vcell Set of visitors of a cell bij Skyline workspace block View Large 3.2. Framework overview The proposed framework consists of three approaches to answer TkFP query: (I) Social-First, (II) Spatial-First and (III) Hybrid. The Social-First approach first processes the social component (e.g. friendship relations and their check-ins) and then processes the spatial component (e.g. places in given range), whereas Spatial-First initially processes the spatial component followed by processing the social component. In contrast, the Hybrid approach is capable of processing both social and spatial components simultaneously to answer such queries. More specifically, it leverages two types of pre-processed information associated with each user u∈U ⁠, her check-in information (check-ins) and summary of her friends’ check-ins information. To the best of our knowledge, there is no unanimously accepted social or spatial storage implementation. Specifically, Facebook uses adjacency lists stored in Memcached [48] which is a distributed memory caching system, on the other hand, Twitter leverages the R*-Tree [49] spatial index. Further, Foursquare uses MongoDB [50], a document oriented database. Similarly, academics research has been adopting various kind of approaches such as [5] uses adjacency list stored in Neo4j which is a graph based database, whereas [24] utilizes relational tables for storing the friendship relations. Similar to the existing work on KNN queries, we tailored the storage implementation for our technique in a way that suites our requirements. More specifically, we index places and users’ check-in information by adopting the R-Tree [47] spatial Index structure. Before presenting our techniques, we present the definition of Facility R-Tree, Check-in R-Tree and Friendship Index. Definition 3.1 (Facility R-Tree). Stores all places (⁠ p∈P ⁠) in a given dataset. A node of Facility R-Tree is represented by a minimum bounded rectangle (MBR) that constitutes all places in its sub-tree. Definition 3.2 (Check-in R-Tree). For the sake of efficiency, we store check-in information of each user by indexing all visited places by her in a separate index based on R-Tree. If a place pis visited by a user multiple times, it will be indexed as many times it was visited hence, Check-in R-Tree contains duplicate entries for the place psince many applications (e.g. which include ranking and recommendation of places) do require complete check-in information of users. Definition 3.3 (Friendship Index). To index each user u∈Uand their social relationships, we build an index structure by employing B+-Tree based on Unicorn [1] built to search the facebook social graph. 4. PROPOSED TECHNIQUES 4.1. Social-first-based approach This approach first looks at the check-ins of each friend u∈Fq to compute social score of each visited place p∈P in given range r ⁠. Next, using given range r ⁠, it processes the spatial component of the q and computes the score of the remaining places p∈P which are not checked-in and returns the set of top-k places based on their score and q’s defined preference criteria α ⁠. We next describe the technique in detail with pseudocode given in Algorithm 1. Initially, in the first loop of the algorithm (at line 1), it exploits Check-in R-Tree of each friend u∈Fq of query q to get the places in range r followed by computing social score and score of each candidate place p in r (at Line 4). In addition, we maintain the score of current kth place p based on social and spatial scores (at Line 5). Further, in the third loop (at Line 7), the Facility R-Tree is exploited to compute the score of those places p∈P in range r which are not visited by q’s friends (at Line 10) hence, their respective score only comprises of spatial score and their (Psocial)=0 ⁠. Finally, the top-k result set is retrieved using Scorek (at Line 12). Let us assume, the score of current kth place p is Scorek ⁠, next lemma shows that if the ∣∣q,p∣∣≥(r−Scorekα) ⁠, we can prune that place p ⁠. Next, we introduce our first pruning rule in Lemma 1. Lemma 4.1 Every place pthat has a distance ∣∣q,p∣∣from query qgreater than current (r−(Scorek/α))cannot be in the Top-k places. Algorithm 1 Social-First. Algorithm 1 Social-First. Proof Given a query user q ⁠, a range r ⁠, preference factor α ⁠, a place p which is not checked-in by any user u∈Fq has social component (psocial)=0 ⁠, by using Equations (1) and (2), we get: Score(p)=α(r−∣∣q,p∣∣)+0 (3) To be the candidate for the Top-k places, a place p’s score must be greater than current Scorek ⁠, hence Scorek≤Score(p) (4) By substituting the value of Score(p) from Equation (3): Scorek≤α(r−∣∣q,p∣∣)∣∣q,p∣∣≤(r−(Scorek/α)) (5) □ 4.2. Spatial-first-based approach Initially, this approach retrieves all places in given range r and computes spatial score of each place p regardless of the fact that whether it is checked-in by any friend u∈Fq or not. Moreover, it then computes the social score of each place p by computing the number of friends checked-in to it by exploring the visitors information of each place Vp ⁠. Finally, it computes the score of each place while updating the current kth place’s score and yielding the result set. We next elaborate the technique in detail with pseudocode given in Algorithm 2. The algorithm starts with the issuance of a range query on Facility R-Tree to retrieve all places in range r (at Line 1). Then in first loop (at Line 2), for each place p∈P in range r ⁠, we compute the Score(p) in ascending order of the distance of the place p from q ⁠. To achieve this, a Heap is initialized with the root entry of Facility R-Tree with ∣∣q,e∣∣ as a key to process spatial component first. Further, it computes social score of p by retrieving the number of friends u∈Fq who visited the place by exploring the visitors’ information of the place (at Line 6). Finally, the final score of place p is computed using Equation (1) (at Line 6). Let us assume, the score of current kth place p is Scorek ⁠, next lemma shows that if the ∣∣q,p∣∣≥(r−(Scorek−(1−α))α) ⁠, the process stops (at Line 5) since every subsequent place p entry in Heap is further than the current place p entry from q ⁠. Next, we introduce our second pruning rule in Lemma 2. Lemma 4.2 Every place pthat has distance ∣∣q,p∣∣from query qgreater than current Scorek ⁠, cannot be in the Top-k places. Algorithm 2 Spatial-First. Algorithm 2 Spatial-First. Proof Given a query user q ⁠, a range r ⁠, preference factor α ⁠, to be the candidate for the Top-k places, a place p’s score must be greater than current Scorek ⁠, by using Equations (1) and (2), we get: Scorek≤Score(p) (6) By substituting the value of Score(p) ⁠, we get: Scorek≤α(r−∣∣q,p∣∣)+(1−α)Fq∩Vp∣Fq∣ (7) Since the maximum possible social score of given place can be 1, we get: Scorek≤α(r−∣∣q,p∣∣)+(1−α)*1∣∣q,p∣∣≤r−(Scorek−(1−α))α (8) □ 4.3. Hybrid approach 4.3.1. Friends Check-ins R-Tree To optimize pruning of irrelevant friends and places, we propose a spatial indexing structure, the Friends Check-ins R-tree (FCR-Tree), that supports simultaneous pruning of friends and places. It is an R-Tree-based structure which is constructed for each user u∈U and is able to prune the search space. FCR-Tree stores check-in information of each friend u∈Fq of q ⁠, thus representing the check-in summary of all friends of q ⁠. The objects of FCR-Tree are the root MBRs of each friend’s Check-in R-Tree. The update of the index in case of new check-in entry of any friend u ⁠, is not costly since these are being bulk updated after certain period of time. Let us assume a query q∈U where the friends of q are Fq={u2,u3,u4,u5,u6} ⁠. Figure 2 shows the conceptual view of the FCR-Tree of u1 ⁠. In the following section, we describe our proposed technique in detail. Figure 2. View largeDownload slide Check-Ins example. (a) Check-In information and (b) Friends Check-Ins summary. Figure 2. View largeDownload slide Check-Ins example. (a) Check-In information and (b) Friends Check-Ins summary. In this approach, the score of each place p in range r is computed by processing social and spatial components of query q together. To answer TkFP queries efficiently, this approach leverages the Friends Check-ins R-Tree of query q to prune the friends who did not visit the top-k places. Similarly, a Grid Spatial Index is used to prune the places in given range r which cannot be the part of candidate place for top-k result set. More specifically, to compute the social and spatial scores of a place p ⁠, this approach supports simultaneous pruning of q’s friends and places p∈P in given range r ⁠. We next elaborate the technique in detail with pseudocode given in Algorithm 3. The algorithm employs a grid partitioning approach to divide the region formed by given range r ⁠. In addition, for each grid cell cij ⁠, a set of places Pc∈P that lie in the cell is maintained by using the Facility R-Tree (at Line 4) and distance of the closest place p to q in a cell is recorded as the cell distance from q (at Line 5). In second loop (at Line 8), a set of friends Vcell who might have visited a cell is computed for each cell by exploiting Friends Check-ins R-Tree (⁠ FCR−Tree ⁠) of q and counting the number of overlapping objects of FCR−Tree with the cell (at Line 9). Figure 3 illustrates an example of a cell and overlapping objects in which four objects are overlapping with the cell and therefore, the overlap count of the cell is 4. This overlapping count is considered as social score of the cell and servers as an upper bound on social score of all places that lie in the cell. Figure 3. View largeDownload slide Cell overlap. Figure 3. View largeDownload slide Cell overlap. Once the social and spatial score of each cell cij is computed, a ranking score of each cell is computed using Equation (9) (at Line 9), which serves as an upper bound on score of any place in the cell. Further, in third loop (at Line 10), for each cell in descending order of cell score, the algorithm accesses all the places that lie in the cell (at Line 13) to compute social score and score of each place p while maintaining the current kth place score (at Lines 14 and 15). If the current kth place score is greater than the next cell’s score, the process stops since all subsequent cells cannot contain a place with higher ranking score than current kth place (at Line 12): Scorecell=α(r−cell.distance)+(1−α)∣Vcell∣∣Fq∣ (9) 5. SOCIO-SPATIAL SKYLINE QUERIES In this section, we present our algorithms to answer socio-spatial skyline queries (SSSQ). First, we formally define the problem, and introduce terms and notations, then we present three algorithms to answer these queries. 5.1. Preliminaries 5.1.1. Problem definition In this paper, we are interested in retrieving places in given range r based on their social and spatial scores. The top-k query uses a scoring function that combines social and spatial scores to rank the objects. However, users must have adequate domain knowledge to be able to decide upon a good value of α ⁠. In particular, it is not easy to define a scoring function (e.g. due to incompatible attributes, different distributions of attributes, the inability of users to choose a good scoring function) [13]. Therefore, to complement our Top-k famous places query, we extend our work to study skyline queries which return the objects that are within the range r (i.e. ∣∣q,p∣∣<r ⁠) and are not dominated by any other object. In order to answer these queries, we compute psocial and pspatial scores of each place as defined in Section 3.1.1. The intuition behind using a range r is that sometimes users are not interested in places that are too far. Below we formally define our query. Dominance: A place p is dominated by a place p′ if psocial≤psocial′ and pspatial≤pspatial′ and for at least one of the following two holds: psocial<psocial′ and pspatial<pspatial′ ⁠. We denote the dominance relationship as p′≺p which implies that place p is dominated by place p′ ⁠. Socio-Spatial Skyline Query (SSSQ): Given a query q and a range r ⁠, an SSSQ returns every place p for which ∣∣q,p∣∣<r and p is not dominated by any other place p′ ⁠. Algorithm 3 Hybrid algorithm. Algorithm 3 Hybrid algorithm. We use an example in Figs 4 and 5 to illustrate the problem definition. Let us assume that we have a set of places P={p1,p2,p3,p4,p5,p6,p7,p8} inside range r=0.15 ⁠, given by query q and a set of friends of q ⁠, Fq={u1,u2,u3,u4,u5,u6,u7,u8,u9,u10} ⁠. For each place, we compute its spatial and social scores based on its distance from query q and number of friends of q who checked-in at this place. Next, we map each place to a space where x-coordinate refers to spatial score and y-coordinate refers to social score as illustrated in Fig. 5b. Figure 4. View largeDownload slide Sample dataset. Figure 4. View largeDownload slide Sample dataset. Figure 5. View largeDownload slide Mapping. (a) Places in range and (b) mapping to 2D-space. Figure 5. View largeDownload slide Mapping. (a) Places in range and (b) mapping to 2D-space. For example in Fig. 4, the social and spatial scores of place p3 are 0.8 and 0.01, respectively, and using these scores, p3 is mapped to the space as shown in Fig. 5b. To retrieve query result, we utilize this space to find such places that are not dominated by any other place. For example, place p7 dominates p5 because p7.social>p5.social and p7.spatial>p5.spatial ⁠. Hence, the SSSQ query returns p7 and p3 which are not dominated by any other place. 6. PROPOSED TECHNIQUES 6.1. Social-first based algorithm Social-First-based approach accesses only those places that are visited by q’s friends rather than accessing each place in range r ⁠. This approach first looks at the check-ins of each friend to compute social score of each visited place p∈P in given range r ⁠. Then we only use the visited places in the range to compute skyline places as described in Algorithm 4. Initially, in the first loop of the algorithm (at Line 1), it exploits Check-in R-Tree of each friend u∈Fq of query q to get the places in range r followed by computing social score of each candidate place p in r (at Line 3). In addition to this, we also compute spatial score of each candidate place p ⁠. Next, each candidate place p is accessed in descending order of the sum of two scores (at Line 5) because accessing the places in this order guarantees that a place is skyline if and only if it is not dominated by any place in S [16], where S is the set of skyline places obtained so far. Then each candidate place p is examined for the dominance (at Line 6). Finally, the nearest neighbor of query q is computed (at Line 8) and is added to the skyline places if it is not checked-in by her friends. Below Lemma 6.1 shows that nearest neighbor of query q is always a skyline object. Lemma 6.1 A nearest neighbor (NN) of the query is always a skyline place. Algorithm 4 Skyline: Social-First algorithm. Algorithm 4 Skyline: Social-First algorithm. Proof There cannot be any place p′ that has a smaller distance than the nearest neighbor p of the query q ⁠. If there are more than one nearest neighbors, then the nearest neighbor with highest social score is not dominated by any other nearest neighbor and is considered as a skyline place.□ 6.2. Spatial-first-based algorithm This approach first gets all places in range r by issuing a range query on Facility R-Tree (at Line 1) in Algorithm 5. Then, in the first loop of the algorithm (at Line 2), it computes spatial and social scores of each place in given range r ⁠. Next, each place in range is accessed in descending order of the sum of two scores (at Line 5) and then is examined for the dominance (at Line 6). If the place is not dominated by skyline places obtained so far, it is inserted into the skyline places set S ⁠. Spatial-First approach accesses only one R-tree index (i.e. Facility R-Tree), while Social-First approach has to access as many R-tree indices as the number of friends of query q ⁠. On Contrary, the down side of Spatial-First approach is that it retrieves all the places in given range r and computes their social score while Social-First approach computes the social score of only those places in the range r that are visited by q’s friends. Next, we address the weakness of both in below section. 6.3. Hybrid algorithm This section focuses on our third approach (i.e. Hybrid) to process SSSQ which is capable of processing both social and spatial aspects simultaneously. Before presenting the technique, first we describe our index and record-keeping structures. 6.3.1. Two grids We first introduce two grid indices that are employed to speed-up retrieval and pruning process. 1. Range Grid: This grid is built upon the region formed by given range r by splitting it into small cells as shown in Fig. 6a. Each cell has following information associated with it: Places that lie in the cell. Number of overlapping Friends’ Check-In R-tree (FCR-Tree) object rectangles (root MBR of Check-In R-Trees) with that particular cell. As stated in Section 4.3, this information is used to compute a bound on the social scores of the places in the cell. Figure 6. View largeDownload slide Sample dataset and skyline mapping. (a) Places in range and (b) mapping to skyline workspace. Figure 6. View largeDownload slide Sample dataset and skyline mapping. (a) Places in range and (b) mapping to skyline workspace. 2. Skyline Workspace Grid: As described earlier, for each place inside range r ⁠, we compute social (⁠ Psocial ⁠) and spatial (⁠ Pspatial ⁠) scores and then maps the place to a 2D space where Psocial is mapped along y-axis and Pspatial is mapped along x-axis. This 2D space is called skyline workspace. Similar to Range Grid, we divide our skyline workspace into a grid as shown in Fig. 6b to index each object based on its social and spatial scores. This aids in retrieving, examining dominance and filtering objects efficiently. Note that to avoid disambiguity, we denote a cell of range grid as a cell and a cell of skyline workspace grid as a block in rest of the paper. Figure 6b shows the mapping of all places in the range to their corresponding skyline workspace grid blocks based on their social and spatial scores. For example, denoting bottom-left block of the skyline workspace grid bij (where i is a row number staring with 0 and j is a column number starting with 0 ⁠) as b0,0 ⁠, place p3 is mapped to block b3,0 and place p7 is mapped to block b3,4 ⁠. Algorithm 5 Skyline: Spatial-First algorithm. Algorithm 5 Skyline: Spatial-First algorithm. 6.3.2. Mapping range grid cell In addition to the mapping of places to skyline grid blocks, each range grid cell Cij is mapped to skyline workspace grid. To achieve this, first we compute social and spatial scores (i.e., csocial ⁠, cspatial ⁠) of each cell. To understand further, let us take an example of range grid cell C1,2 with three places (i.e. p2,p6,p8 ⁠) inside it as shown in Fig. 7a with their social and spatial scores listed in Fig. 7b. Assuming, the cell C1,2 overlaps with six objects of FCR-Tree that is considered as social score (⁠ csocial=0.6 ⁠) of the cell. In addition, the spatial score of place P8 that lies in the cell is largest among all and is considered as the cell’s spatial score (⁠ cspatial=0.10 ⁠). Therefore, these scores serve as an upper bound on scores of any place inside the cell and by using these scores, the cell is mapped to its corresponding skyline workspace gird block b2,3 as a point C1,2(csocial ⁠, cspatial) as shown in Fig. 8. Figure 7. View largeDownload slide Social and spatial score of a Range Grid Cell. (a) Cell C1,2 and (b) places’ scores. Figure 7. View largeDownload slide Social and spatial score of a Range Grid Cell. (a) Cell C1,2 and (b) places’ scores. Figure 8. View largeDownload slide Range cell mapping to skyline workspace grid. Figure 8. View largeDownload slide Range cell mapping to skyline workspace grid. Based on this, we can conclude that the cell point C1,2(0.10,0.6) in skyline workspace clearly dominates all the places (i.e. p2,p6,p8 ⁠) that lie in it. In contrast, if cell C1,2 in skyline workspace is dominated by any other object (e.g. p7 ⁠), the cell is immediately pruned along with all the places inside it due to having smaller social and spatial scores than the cell’s. Therefore, the places lie inside the cell cannot be the part of skyline places hence, this pruning considerably improves the query processing time. Algorithm 6 describes the indexing of range and skyline workspace along with mapping of each range grid cell Cij to the skyline workspace grid. Initially, we start by constructing a grid index in region formed by range r (at Line 1). Then in the first loop (at Line 3), we index each place in range to its corresponding range grid cell C along with updating the cell’s spatial score (⁠ cspatial ⁠) (at Line 4). In addition, the skyline workspace is divided into a grid (at Line 5) and a range query is issued on FCR-Tree to get all the friends of query q who might have visited any place in the range (at Line 6). Finally, in second loop (at Line 7), for each range grid cell C ⁠, the upper bound (⁠ csocial ⁠) on social score of the places that lie in the cell is computed and then the cell is mapped to its corresponding skyline workspace block b using the cell’s social and spatial scores. Each block bij of skyline workspace grid is associated with two types of lists, one of which contains range grid cell objects that lie inside and the second one contains actual places p inside that block as shown in Fig. 9. Figure 9. View largeDownload slide Grid Index and record-keeping structures. Figure 9. View largeDownload slide Grid Index and record-keeping structures. 6.3.3. Computation module Intuition: Let us assume that we have a place p in skyline workspace grid as illustrated in Fig. 10. Note that the block b2,2 is dominated by place p ⁠; therefore, no place or range grid cell in the block can contain a skyline place p ⁠. Similarly, all the blocks in the shaded area R do not need to be accessed if we have already seen the place p ⁠. Figure 10. View largeDownload slide Dominated blocks. Figure 10. View largeDownload slide Dominated blocks. To make blocks access efficient, we need to access them in a particular order where order is determined by maxScore which is defined in Definition 6.1. For each block, we en-heap the blocks below and towards left of it. If a block is dominated, it is pruned. Definition 6.1 maxscore(b) maxScore(b)of any given block of skyline workspace grid is a summation of its top-right corner coordinates (i.e., Psocial ⁠, Pspatial ⁠). For example, in Fig. 11, the maxScore of block b4,4 is 2 (sum of top-right corner coordinates i.e. 1,1) and the maxScore of b3,3 is 0.8 + 0.8 = 1.6. Since the top-right corner of skyline workspace P(1,1) has the highest maxScore, the block b4,4 is selected first for processing. Now Consider two points p1 and p2 at the low-right and at the top-left corner of b4,4 respectively. Note that points p1 and p2 have higher maxScore than any object in shaded region. Precisely, for every unprocessed block bij ⁠, maxScore(bij)≤max(maxScore(b3,4),maxScore(b4,3)) ⁠. Consequently, the block to be processed after b4,4 is either b3,4 or b4,3 and let us assume that maxScore(b4,3≤maxScore(b3,4) ⁠, b3,4 is the second one to be processed. Further, the blocks with the third highest maxScore is determined among b4,3 ⁠, b3,3 and b2,4 ⁠. We next describe the technique in detail with pseudocode given in Algorithm 7. Algorithm 6 Indexing and mapping. Algorithm 6 Indexing and mapping. Algorithm 7 Skyline: Hybrid Algorithm 7 Skyline: Hybrid Figure 11. View largeDownload slide Block visiting order. Figure 11. View largeDownload slide Block visiting order. Algorithm: Initially, Algorithm 7 invokes Indexing and Mapping algorithm (Algorithm 6) to compute social and spatial scores of each range grid cell and to index them to their corresponding skyline workspace grid blocks (at Line 1). Further, the algorithm employs the method described above to process blocks in descending maxScore(b) order, retaining the property of visiting the minimal set of blocks. To handle this, a maxHeap is initialized with the top-right block b4,4 of skyline workspace grid with its maxScore as a sorting key (at Line 2). Then, algorithm starts de-heaping the blocks iteratively (at Line 3) and if a block is dominated by any skyline place p (at Line 5), it is immediately pruned and consequently, the blocks below and left of it are not en-heaped. Since at this stage, only range grid cells are indexed to the skyline workspace, it first examines each cell object Cij for dominance which lie inside the de-heaped block (at Line 6) and if a cell is dominated by any already found skyline place so far (at Line 7), it is pruned. Consequently, all the places that lie in the pruned cell object are also discarded and are not processed further. Further, if a cell object Cij is not dominated, then for each place that lie in it (loop at Line 8), the algorithm computes social score (⁠ Psocial ⁠) of it (at Line 9). Then, the place is indexed to its corresponding skyline workspace grid block b provided that the place is not dominated by any skyline place found so far (at Line 10). However, if a corresponding block is dominated, it is marked to avoid being en-heaped in maxHeap. Subsequently, after indexing each place in the range to its corresponding blocks, the algorithm starts examining each place p that lie in the current de-heaped block for dominance (at Line 12) in descending order of Psocial+Pspatial and updates the skyline result set. The algorithm also en-heaps the blocks below and to the left of current de-heaped block using their maxScores provided that neither of them have been en-heaped before nor are they dominated by any skyline place found so far (at Line 14). The algorithm terminates when all the blocks in maxHeap are examined and returns the skyline places (at Line 16). 7. EXPERIMENTS 7.1. Experimental setup To the best of our knowledge, this problem has not been studied before and in the existing work, their techniques are either not applicable or cannot be efficiently extended to solve TkFP and SSSQ queries. However, although the paper [31] studies a different problem, we have implemented their algorithm and compared our techniques against it (Table 2). Table 2. Parameters (default shown in bold). Parameters Values Number of queries 50, 100, 150, 200 Range (km) 50, 100, 200, 400 Dataset size (places in thousands) 100, 200, 300, 400, 500, 1300 Grid size 2, 4, 8, 16, 32, 64 Average friends 200, 400, 600, 800 k 5, 10, 15, 20 Parameters Values Number of queries 50, 100, 150, 200 Range (km) 50, 100, 200, 400 Dataset size (places in thousands) 100, 200, 300, 400, 500, 1300 Grid size 2, 4, 8, 16, 32, 64 Average friends 200, 400, 600, 800 k 5, 10, 15, 20 View Large Table 2. Parameters (default shown in bold). Parameters Values Number of queries 50, 100, 150, 200 Range (km) 50, 100, 200, 400 Dataset size (places in thousands) 100, 200, 300, 400, 500, 1300 Grid size 2, 4, 8, 16, 32, 64 Average friends 200, 400, 600, 800 k 5, 10, 15, 20 Parameters Values Number of queries 50, 100, 150, 200 Range (km) 50, 100, 200, 400 Dataset size (places in thousands) 100, 200, 300, 400, 500, 1300 Grid size 2, 4, 8, 16, 32, 64 Average friends 200, 400, 600, 800 k 5, 10, 15, 20 View Large All algorithms are implemented in C++ and experiments are run on Intel Core I3 2.4 GHz PC with 8 GB memory running on 64-bit Ubuntu Linux. We use real dataset of Gowalla [51] along with five synthetic datasets with characteristics as shown in Table 3. Gowalla is a location-based social network which later was acquired by Facebook. It contains 196 591 users, 950 327 friendships, 6 442 890 check-ins and 1 280 956 checked-in places across the world. The page size of each Facility R-Tree index is set to 4096 Bytes and 1024 Bytes for Check-in R-Tree and FCR-Tree indexes. For each experiment, we randomly select 100 users and treat them as query points. The cost in the experiments corresponds to the average cost of 100 queries. The default value of range r is 100 km and the default value of k is set to 10 unless mentioned otherwise. Table 3. Datasets characteristics. DataSet Places Users Friendships Check-Ins Gowalla 1280 956 196 591 950 327 6 442 890 Synthetic 01 100 000 17 162 77 254 503 000 Synthetic 02 200 000 39 223 152 957 1 100 698 Synthetic 03 300 000 61 394 241 369 1 830 235 Synthetic 04 400 000 86 687 301 887 2 536 897 Synthetic 05 500 000 103 856 395 745 3 258 659 DataSet Places Users Friendships Check-Ins Gowalla 1280 956 196 591 950 327 6 442 890 Synthetic 01 100 000 17 162 77 254 503 000 Synthetic 02 200 000 39 223 152 957 1 100 698 Synthetic 03 300 000 61 394 241 369 1 830 235 Synthetic 04 400 000 86 687 301 887 2 536 897 Synthetic 05 500 000 103 856 395 745 3 258 659 View Large Table 3. Datasets characteristics. DataSet Places Users Friendships Check-Ins Gowalla 1280 956 196 591 950 327 6 442 890 Synthetic 01 100 000 17 162 77 254 503 000 Synthetic 02 200 000 39 223 152 957 1 100 698 Synthetic 03 300 000 61 394 241 369 1 830 235 Synthetic 04 400 000 86 687 301 887 2 536 897 Synthetic 05 500 000 103 856 395 745 3 258 659 DataSet Places Users Friendships Check-Ins Gowalla 1280 956 196 591 950 327 6 442 890 Synthetic 01 100 000 17 162 77 254 503 000 Synthetic 02 200 000 39 223 152 957 1 100 698 Synthetic 03 300 000 61 394 241 369 1 830 235 Synthetic 04 400 000 86 687 301 887 2 536 897 Synthetic 05 500 000 103 856 395 745 3 258 659 View Large 7.2. Performance evaluation 7.2.1. Top-k famous places queries (⁠ Tk FP) Effect of Range: We analyse the performance of our algorithms for various range values ranging from 100 to 400 km. The size of the area formed by given range determines the number of places it contains (ranging from 1500 to 94 000). In addition, we analyse the performance of our techniques by comparing them with [31] and found that our Hybrid algorithm is at least 8–10 times faster than their algorithm as depicted in Fig. 12 and, SkSK and Spatial-First algorithms are most affected at bigger range values hence their performance deteriorates due to large number of places. Further SkSK incurs considerably more IO cost as shown in Fig. 12b for higher range values due to large number of places in range which result in higher index access rate. Figure 12. View largeDownload slide Effect of varying range (number of places). (a) CPU cost and (b) I/O cost. Figure 12. View largeDownload slide Effect of varying range (number of places). (a) CPU cost and (b) I/O cost. Effect of Average number of Friends: In Fig. 13, we study the effect of average number of friends of each query. Note that the size of FCR-Tree depends on the size of friends’ set of each user in dataset which essentially affects the Hybrid algorithm. Similarly, data structure proposed in SkSK [31] stores users’ information with each node at each level which basically is a union of all users’ sets of each child node which considerably affects the performance of the algorithm for large number of users. Figure 13a illustrates the processing time of each algorithm where Hybrid is much faster (8–10 times) than SkSK typically for higher number of friends. Similarly, Fig. 13b shows the I/O cost of each method for varying average number of friends and we found that SkSK incurs much higher cost due to very large size of the data structure used. The average number of places in given range r is 38 319. Note that when average number of friends increases, the CPU and I/O cost of all four algorithms increases since each friend’s check-in information is required to be accessed to get candidate places. Figure 13. View largeDownload slide Performance comparison on different numbers of friends. (a) CPU cost and (b) I/O cost. Figure 13. View largeDownload slide Performance comparison on different numbers of friends. (a) CPU cost and (b) I/O cost. Effect of concurrent number of Queries: Geo-Social services seek to answer large number of incoming queries simultaneously due to the enormous size of registered users. Therefore, the number of concurrent queries ranging from 50 to 200 is analyzed for all the four algorithm. In addition, each experiment involves average number of friends ranging from 200 to 800 and approximately 10 000 average number of places in given range r ⁠. The three proposed algorithms need to traverse the Facility-RTree every time a TkFP query is issued to retrieve candidate places. The Social-First algorithm also needs to traverse the Check-in R-Tree of each friend. On the other hand, Hybrid algorithm leverages the FCR-Tree and both Spatial-First and Hybrid greatly rely on the visitors set of the places. In addition, we compare the performance of our algorithms with SkSK [31] and observed that our Hybrid algorithm is at least 10 times faster than SkSK as shown in Fig. 14. We report the CPU and I/O cost of each algorithm on Gowalla dataset for different numbers of queries. As expected, the I/O cost of Social-First algorithm is less than the other three due to low dependency on indexes and SkSK incurs much higher IO cost than any of the proposed techniques. Figure 14. View largeDownload slide Effect of number of queries. (a) CPU cost and (b) I/O cost. Figure 14. View largeDownload slide Effect of number of queries. (a) CPU cost and (b) I/O cost. Effect of Grid Size: In Fig. 15, we study the effect of the size of grid partitioning ranges from 2 to 64 on Hybrid algorithm. The size of grid affects the CPU cost since the size of a cell defines how many places need to be processed/pruned at once. Similarly, it also affects the termination condition on the algorithm. Note that the best CPU performance can be achieved by dividing the region into grid of size 4×4 ⁠. Figure 15. View largeDownload slide Effect of grid size. Figure 15. View largeDownload slide Effect of grid size. Effect of k: In previous experiments, the value of k is set to 10. Next, we analyse the performance of the four algorithms for various values of k ⁠. Note that in Fig. 16a, Hybrid is 8–10 times faster than SkSK and all four algorithms are nearly independent of k ⁠. The reason is that we have to update the result set every time we update the score of a place. Therefore, the size of the result set does not impose great computation load. In terms of I/O cost, Fig. 16b shows that all four algorithms do not get affected by the value of k since the higher value of k does not incur more disk access. In addition, SkSK’s disk access is up to 25 times more than the proposed algorithms due to very large index size. Figure 16. View largeDownload slide Effect of varying number of requested places (⁠ k ⁠). (a) CPU cost and (b) I/O cost. Figure 16. View largeDownload slide Effect of varying number of requested places (⁠ k ⁠). (a) CPU cost and (b) I/O cost. Effect of Dataset Size: In Fig. 17a and b, we study the effect of dataset size on the performance of the four algorithms. Specifically, we conduct experiments on synthetic datasets of different sizes containing places ranging from 100k to 500k. In Fig. 17a, note that the SkSK algorithm is most effected by the number of places because the more number of places, the higher the number of visitors will be associated to the nodes of index structure at each level. Due to this, SkSK suffers in better processing time and I/O. Similarly, in Fig. 17b, SkSK has higher I/O cost due to the intersection performed on visitors set of nodes/places and friends set of query q ⁠. Figure 17. View largeDownload slide Effect of varying dataset sizes (number of places). (a) CPU cost and (b) I/O cost. Figure 17. View largeDownload slide Effect of varying dataset sizes (number of places). (a) CPU cost and (b) I/O cost. 7.2.2. Socio-spatial skyline queries (SSSQ) Effect of Range: We analyse the performance of our algorithms for various range values ranging from 100 to 400 km. The size of the area formed by given range determines the number of places it contains (ranging from 1500 to 94000). Figure 18 shows that Spatial-First algorithm is most affected at bigger range values due to more number of places to be processed hence, its performance deteriorates. Similarly, Social-First approach does not get affected much by the range because it only takes into account the visited places by query q’s friends. Note that, the Hybrid algorithm performs better for bigger range since it is more likely to find skyline places by processing fewer blocks and by pruning more cells including their corresponding places which lie in them simultaneously. Figure 18b shows that I/O cost increases for bigger range values due to large number of places in given range and large number of visitors (in spatial-first approach) which results in higher index access rate. Figure 18. View largeDownload slide Effect of varying range (number of places). (a) CPU cost and (b) I/O cost. Figure 18. View largeDownload slide Effect of varying range (number of places). (a) CPU cost and (b) I/O cost. Effect of Average number of Friends: In Fig. 19, we study the effect of the average number of friends of each query. Note that the size of FCR-Tree depends on the number of friends of each user in the dataset which essentially affects the Hybrid algorithm to some extent. Further, Spatial-First algorithm is greatly affected by it because the intersection of two large sets i.e. visitors set of each place in range and friends set of query q is more expensive. Similarly, Social-First algorithm is greatly affected by the number of friends since it has to process more Check-in R-trees. Specifically, Fig. 19a shows the CPU cost and Fig. 19b shows the I/O cost of each method for varying average number of friends. The average number of places in given range r is 38 319. Note that when the average number of friends increases, the CPU and I/O cost of all three algorithms increases since each friend’s check-in information is required to verify the candidate places. Figure 19. View largeDownload slide Performance comparison on different number of friends. (a) CPU cost and (b) I/O cost. Figure 19. View largeDownload slide Performance comparison on different number of friends. (a) CPU cost and (b) I/O cost. Effect of concurrent number of Queries: The number of concurrent queries ranging from 50 to 200 are analysed for all three algorithm. In addition, each experiment involves average number of friends ranging from 200 to 800 and approximately 10 000 average number of places in given range r ⁠. All three algorithms need to traverse the Facility-RTree every time an SSSQ is issued to retrieve candidate places. In addition, Social-First algorithm also traverses the Check-in R-Tree that belongs to each friend and as we increase the number of queries, the number of friends to be processed, also increase. Therefore, Social-First algorithm exhibits more CPU cost for large number of queries. On the other hand, Hybrid algorithm leverages the FCR-Tree and both Spatial-First and Hybrid greatly rely on the size of visitors set of the places. In Fig. 20, we report the CPU and I/O cost of each algorithm on Gowalla dataset for different number of queries. As expected, the I/O cost of Social-First algorithm is less than the other two due to low dependency on indexes. Note that Hybrid is up to eight times better than Social-First and Spatial-First algorithms. Figure 20. View largeDownload slide Effect of number of queries. (a) CPU cost and (b) I/O cost. Figure 20. View largeDownload slide Effect of number of queries. (a) CPU cost and (b) I/O cost. Effect of Grid Size: In Fig. 21, we study the effect of the size of grid partitioning ranges from 2 to 16 on Hybrid algorithm. For region grid, the size of grid affects the CPU cost since the size of a cell defines how many places will be processed/pruned simultaneously. Similarly, it also affects the termination condition on the algorithm. Note that the best CPU performance can be achieved by dividing the area into grid of size 4×4 ⁠. In addition, the Skyline workspace is also partitioned into 4×4 grid because algorithm achieves optimal performance at this granularity. Figure 21. View largeDownload slide Effect of grid size. Figure 21. View largeDownload slide Effect of grid size. Effect of Dataset Size: In Fig. 22a and b, we study the effect of dataset size on the performance of the three algorithms. Specifically, we conduct experiments on synthetic datasets of different sizes containing places ranging from 100k to 500k. In Fig. 22a, note that the Spatial-First algorithm is most effected by the number of places. Similarly, in Fig. 22b, Hybrid and Spatial-First have higher I/O cost due to the intersection performed on visitors set of each place and friends set of query q ⁠. Figure 22. View largeDownload slide Effect of varying dataset sizes. (a) CPU cost and (b) I/O cost. Figure 22. View largeDownload slide Effect of varying dataset sizes. (a) CPU cost and (b) I/O cost. 7.3. Analysis of results quality Top- k queries and skyline queries both have been extensively studied in the past. The advantage of a top-k query is that the number of objects to be returned is controlled by the user (by giving a value of k ⁠). However, the top- k query assumes that the user is able to define a suitable scoring function (e.g. a suitable value of α in this paper). This may be challenging because the user may not be able to choose a suitable scoring function mainly because of the incompatibility of the attributes involved in top- k queries and their distributions [13]. A skyline query addresses this problem and does not require a scoring function to be defined. However, the user cannot control the number of objects returned by the query and, in the worst case, the number of skyline objects may be equal to the total number of objects in the dataset. Therefore, top- k queries and skyline queries complement each other. In this section, we analyse the size of socio-spatial skyline queries and compare the results returned by top- k queries and skyline queries. Size of Skyline: In Fig. 23, we run 100 skyline queries for each setting and report the average size of skyline. Figure 23a shows that the average size of skyline is 2–5 as we vary the average number of friends for the query user. Note that, on an average, the total number of places in the query range is more than 7000 and skyline shortlists up to five places, on an average, that dominate all other places in terms of both spatial score and social score. One reason for such a small skyline size is that the data is sparse and there may not be many check-ins in the given range by all of the query users’ friends and, as a result, the social score for most of the places may be zero. Figure 23. View largeDownload slide Effect of number of friends. (a) Small number of average friends and (b) large number of average friends. Figure 23. View largeDownload slide Effect of number of friends. (a) Small number of average friends and (b) large number of average friends. In Fig. 23b, we evaluate the size of skyline for a more challenging case where the average number of friends for the query user is varied from 25 000 to 100 000. We remark that this is a realistic setting and many users may have such a large number of friends, e.g. query user is a page ‘Germany’ and its friends represent the people who were born in Germany. Figure 23b shows that the size of skyline increases with the average number of friends but the size is still much smaller compared to the total number of places in the range. This shows that the skyline query studied in this paper is useful and returns only a small number of objects to the user. In the rest of the experiments, we choose 50 000 as default for the average number of friends of the query user. Results Returned by Top-k vs. Skyline: In this section, we compare and analyse the results returned by skyline queries and top- k queries. In Fig. 24, we run 100 queries for each setting and report the average number of result objects returned by top- k queries, skyline queries and the average number of objects that are returned by both of the queries (shown as ‘# Common Places’). Specifically, Fig. 24a studies the effect of k and Fig. 24b compares skyline and top- 10 queries for varying α ⁠. Figure 24 demonstrates that the results returned by both top- k and skyline queries share many objects but, at the same time, each query reports several places that the other query fails to return. This shows that the two queries complement each other. Figure 24. View largeDownload slide # common places returned by both queries. (a) Effect of k and (b) effect of α ⁠. Figure 24. View largeDownload slide # common places returned by both queries. (a) Effect of k and (b) effect of α ⁠. In Fig. 25, we further analyse the results returned by the two types of queries. Specifically, the result places are mapped to a 2D space where x-axis corresponds to their social scores and y-axis corresponds to their spatial scores. In Fig. 25a, the skyline query returns 15 places. The top-5 query with α=0.1 (high preference for social score) returns the places shown with small red circles. Three of these top-5 places are the skyline points and the other two places are not the skyline points because they are dominated by other places. For the top-5 queries with α=0.5 (equal preference for both social and spatial scores) and α=0.9 (high preference for spatial score), the top-5 places are the places on the top-left of the figure (having high spatial scores but low social scores). Figure 25b shows similar results except that some of the top-5 places for α=0.5 (equal preference) are the places in bottom-right of the figure and some are in the top-left of the figure. Figure 25. View largeDownload slide Analysis of results. (a) Skyline vs. top-k for User 1 and (b) skyline vs. top-k for User 2. Figure 25. View largeDownload slide Analysis of results. (a) Skyline vs. top-k for User 1 and (b) skyline vs. top-k for User 2. Figure 25 shows that the top- k queries may sometimes fail to capture the users’ requirements, e.g., for example, by choosing α=0.5 ⁠, a user may have wanted to obtain the places that have reasonably high values on both social and spatial scores but the results may contain places with either high social scores but very low spatial scores or high spatial scores but very low social scores (as in Fig. 25). The skyline query addresses this problem to some extent and gives a better coverage of the results. However, it fails to capture the requirements of users who have chosen α to be too high or too low. For example, in Fig. 25b, the skyline contains only one object that has a high social score, therefore, it would fail to capture the requirements of a user who prefers social score much more than the spatial score (e.g. α=0.1 ⁠) and wants to obtain several places with high social scores. In contrast, the top-5 query with α=0.1 returns five objects each having a high social score. Also, as pointed out earlier, the number of skyline objects may be arbitrarily large and the user may not be able to control the number of objects returned. 8. CONCLUSIONS We are the first to formalize a problem namely, Top-k famous places TkFP query and propose efficient query processing techniques. In addition, we extend our work to propose another query that is, Socio-Spacial Skyline Query SSSQ. We present three approaches to process the queries called, (1) Social-First, (2) Spatial-First and (3) Hybrid. The first two approaches separately process the social and spatial components of the query and do not require a specialized index. The third approach (Hybrid) is capable of processing social and spatial components simultaneously by utilizing a hybrid index specifically designed to handle TkFP queries. We analyse the performance of our techniques and found them better than previous techniques. REFERENCES 1 Curtiss , M. et al. ( 2013 ) Unicorn: a system for searching the social graph . PVLDB , 6 , 1150 – 1161 . 2 Ahuja , R. , Armenatzoglou , N. , Papadias , D. and Fakas , G.J. ( 2015 ) Geo-Social Keyword Search. Advances in Spatial and Temporal Databases—14th International Symposium, SSTD 2015, Hong Kong, China, August 26–28, 2015. Proceedings, pp. 431–450. Springer, Berlin Heidelberg. 3 Armenatzoglou , N. , Ahuja , R. and Papadias , D. ( 2015 ) Geo-social ranking: functions and query processing . VLDB J. , 24 , 783 – 799 . Google Scholar Crossref Search ADS 4 Emrich , T. , Franzke , M. , Mamoulis , N. , Renz , M. and Züfle , A. ( 2014 ) Geo-Social Skyline Queries. Database Systems for Advanced Applications—19th International Conference, DASFAA 2014, Bali, Indonesia, April 21–24, 2014. Proceedings, Part II, pp. 77–91. Springer, Berlin Heidelberg. 5 Doytsher , Y. , Galon , B. and Kanza , Y. ( 2012 ) Managing Socio-spatial Data as Large Graphs. 21st Int. World Wide Web Conf. 6 Liu , W. , Sun , W. , Chen , C. , Huang , Y. , Jing , Y. and Chen , K. ( 2012 ) Circle of Friend Query in Geo-social Networks. Database Systems for Advanced Applications—17th International Conference, DASFAA 2012, Busan, South Korea, April 15–19, 2012, Proceedings, Part II, pp. 126–137. Springer, Berlin Heidelberg. 7 Yang , D. , Shen , C. , Lee , W. and Chen , M. ( 2012 ) On Socio-spatial Group Query for Location-Based Social Networks. The 18th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, KDD ‘12, Beijing, China, August 12–16, 2012, pp. 949–957. ACM, New York, NY, USA. 8 Fond , T.L. and Neville , J. ( 2010 ) Randomization Tests for Distinguishing Social Influence and Homophily Effects. Proc. 19th Int. Conf. World Wide Web, WWW 2010, Raleigh, NC, USA, April 26–30, 2010, pp. 601–610. ACM, New York, NY, USA. 9 Singla , P. and Richardson , M. ( 2008 ) Yes, There is a Correlation: From Social Networks to Personal Behavior on the Web. Proc. 17th Int. Conf. World Wide Web, WWW 2008, Beijing, China, April 21–25, 2008, pp. 655–664. ACM, New York, NY, USA. 10 Chua , F.C.T. , Lauw , H.W. and Lim , E. ( 2011 ) Predicting Item Adoption using Social Correlation. Proc. Eleventh SIAM Int. Conf. on Data Mining, SDM 2011, April 28–30, 2011, Mesa, AZ, USA, pp. 367–378. 11 Ma , H. , King , I. and Lyu , M.R. ( 2009 ) Learning to Recommend with Social Trust Ensemble. Proc. 32nd Annu. Int. ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2009, Boston, MA, USA, July 19–23, 2009, pp. 203–210. 12 Ye , M. , Liu , X. and Lee , W. ( 2012 ) Exploring Social Influence for Recommendation: A Generative Model Approach. The 35th Int. ACM SIGIR Conf. Research and Development in Information Retrieval, SIGIR ‘12, Portland, OR, USA, August 12–16, 2012, pp. 671–680. ACM, New York, NY, USA. 13 Fagin , R. , Kumar , R. and Sivakumar , D. ( 2003 ) Efficient Similarity Search and Classification via Rank Aggregation. Proc. 2003 ACM SIGMOD Int. Conf. Management of Data, San Diego, CA, USA, June 9–12, 2003, pp. 301–312. ACM, New York, NY, USA. 14 Börzsönyi , S. , Kossmann , D. and Stocker , K. ( 2001 ) The Skyline Operator. Proc. 17th Int. Conf. Data Engineering, April 2–6, 2001, Heidelberg, Germany, pp. 421–430. Springer, Berlin Heidelberg. 15 Kossmann , D. , Ramsak , F. and Rost , S. ( 2002 ) Shooting Stars in the Sky: An Online Algorithm for Skyline Queries. VLDB 2002, Proc. 28th Int. Conf. Very Large Data Bases, August 20–23, 2002, Hong Kong, China, pp. 275–286. ACM, New York, NY, USA. 16 Papadias , D. , Tao , Y. , Fu , G. and Seeger , B. ( 2003 ) An Optimal and Progressive Algorithm for Skyline Queries. Proc. 2003 ACM SIGMOD Int. Conf. Management of Data, San Diego, CA, USA, June 9–12, 2003, pp. 467–478. ACM, New York, NY, USA. 17 Deng , K. , Zhou , X. and Shen , H.T. ( 2007 ) Multi-source Skyline Query Processing in Road Networks. Proc. 23rd Int. Conf. Data Engineering, ICDE 2007, The Marmara Hotel, Istanbul, Turkey, April 15–20, 2007, pp. 796–805. IEEE, New York, NY, USA. 18 Sacharidis , D. , Bouros , P. and Sellis , T.K. ( 2008 ) Caching Dynamic Skyline Queries. Scientific and Statistical Database Management, 20th Int. Conf., SSDBM 2008, Hong Kong, China, July 9–11, 2008, Proceedings, pp. 455–472. IEEE, New York, NY, USA. 19 Sharifzadeh , M. and Shahabi , C. ( 2006 ) The Spatial Skyline Queries. Proc. 32nd Int. Conf. Very Large Data Bases, Seoul, Korea, September 12–15, 2006, pp. 751–762. ACM, New York, NY, USA. 20 Tan , K. , Eng , P. and Ooi , B.C. ( 2001 ) Efficient Progressive Skyline Computation. VLDB 2001, Proc. 27th Int. Conf. Very Large Data Bases, September 11–14, 2001, Roma, Italy, pp. 301–310. ACM, New York, NY, USA. 21 Armenatzoglou , N. , Papadopoulos , S. and Papadias , D. ( 2013 ) A general framework for geo-social query processing . PVLDB , 6 , 913 – 924 . 22 Ference , G. , Ye , M. and Lee , W. ( 2013 ) Location Recommendation for Out-of-Town Users in Location-Based Social Networks. 22nd ACM Int. Conf. Information and Knowledge Management, CIKM’13, San Francisco, CA, USA, October 27–November 1, 2013, pp. 721–726. ACM, New York, NY, USA. 23 Mouratidis , K. , Li , J. , Tang , Y. and Mamoulis , N. ( 2015 ) Joint search by social and spatial proximity . IEEE Trans. Knowl. Data Eng. , 27 , 781 – 793 . Google Scholar Crossref Search ADS 24 Doytsher , Y. , Galon , B. and Kanza , Y. ( 2010 ) Querying Geo-social Data by Bridging Spatial Networks and Social Networks. Proc. 2010 Int. Workshop on Location Based Social Networks, LBSN 2010, November 2, 2010, San Jose, CA, USA, Proceedings, pp. 39–46. ACM, New York, NY, USA. 25 Huang , Q. and Liu , Y. ( 2009 ) On Geo-social Network Services. Geoinformatics, 2009 17th Int. Conf., pp. 1–6. IEEE, New York, NY, USA. 26 Ye , M. , Yin , P. and Lee , W.-C. ( 2010 ) Location Recommendation for Location-Based Social Networks. Proc. 18th SIGSPATIAL Int. Conf. Advances in Geographic Information Systems, pp. 458–461. ACM, New York, NY, USA. 27 Sarwat , M. , Levandoski , J.J. , Eldawy , A. and Mokbel , M.F. ( 2014 ) Lars*: an efficient and scalable location-aware recommender system . IEEE Trans. Knowl. Data Eng. , 26 , 1384 – 1399 . Google Scholar Crossref Search ADS 28 Gao , H. and Liu , H. ( 2014 ) Data Analysis on Location-Based Social Networks. Mobile Social Networking , pp. 165 – 194 . Springer , Berlin, Heidelberg . 29 Li , J. and Cardie , C. ( 2014 ) Timeline generation: tracking individuals on twitter. 23rd International World Wide Web Conference, WWW ‘14, Seoul, Republic of Korea, April 7–11, 2014, pp. 643–652. ACM, New York, NY, USA. 30 Li , G. , Chen , S. , Feng , J. , Tan , K. and Li , W. ( 2014 ) Efficient Location-Aware Influence Maximization. Int. Conf. Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22–27, 2014, pp. 87–98. ACM, New York, NY, USA. 31 Wu , D. , Li , Y. , Choi , B. and Xu , J. ( 2014 ) Social-Aware Top-k Spatial Keyword Search. IEEE 15th Int. Conf. Mobile Data Management, MDM 2014, Brisbane, Australia, July 14–18, 2014—Volume 1, pp. 235–244. IEEE, New York, NY, USA. 32 Shen , Z. , Cheema , M.A. , Lin , X. , Zhang , W. and Wang , H. ( 2012 ) A generic framework for top-k pairs and top-k objects queries over sliding windows . IEEE Trans. Knowl. Data Eng. , 26 , 1349 – 1366 . Google Scholar Crossref Search ADS 33 Sohail , A. , Murtaza , G. and Taniar , D. ( 2016 ) Retrieving Top-k Famous Places in Location-Based Social Networks. Databases Theory and Applications—27th Australasian Database Conference, ADC 2016, Sydney, NSW, September 28–29, 2016, Proceedings, pp. 17–30. Springer, Berlin, Heidelberg. 34 Ilyas , I.F. , Beskales , G. and Soliman , M.A. ( 2008 ) A survey of top-k query processing techniques in relational database systems . ACM Comput. Surv. , 40 , 11:1 – 11:58 . Google Scholar Crossref Search ADS 35 Cheema , M.A. , Shen , Z. , Lin , X. and Zhang , W. ( 2014 ) A Unified Framework for Efficiently Processing Ranking Related Queries. Proc. 17th Int. Conf. Extending Database Technology, EDBT 2014, Athens, Greece, March 24–28, 2014, pp. 427–438. OpenProceedings, Konstanz, Germany. 36 Fagin , R. , Lotem , A. and Naor , M. ( 2003 ) Optimal aggregation algorithms for middleware . J. Comput. Syst. Sci. , 66 , 614 – 656 . Google Scholar Crossref Search ADS 37 Nepal , S. and Ramakrishna , M.V. ( 1999 ) Query Processing Issues in Image (Multimedia) Databases. Proc. 15th Int. Conf. Data Engineering, Sydney, Australia, March 23–26, 1999, pp. 22–29. IEEE, New York, NY, USA. 38 Güntzer , U. , Balke , W. and Kießling , W. ( 2000 ) Optimizing Multi-feature Queries for Image Databases. VLDB 2000, Proc. 26th Int. Conf. Very Large Data Bases, September 10–14, 2000, Cairo, Egypt, pp. 419–428. ACM, New York, NY, USA. 39 Jiang , J. , Lu , H. , Yang , B. and Cui , B. ( 2015 ) Finding Top-k Local Users in Geo-tagged Social Media Data. 31st IEEE Int. Conf. Data Engineering, ICDE 2015, Seoul, South Korea, April 13–17, 2015, pp. 267–278. IEEE, New York, NY, USA. 40 Zhang , W. , Lin , X. , Zhang , Y. , Cheema , M.A. and Zhang , Q. ( 2012 ) Stochastic skylines . ACM Trans. Database Syst. , 37 , 14:1 – 14:34 . Google Scholar Crossref Search ADS 41 Cheema , M.A. , Lin , X. , Zhang , W. and Zhang , Y. ( 2013 ) A Safe Zone Based Approach for Monitoring Moving Skyline Queries. Joint 2013 EDBT/ICDT Conferences, EDBT ‘13 Proceedings, Genoa, Italy, March 18–22, 2013, pp. 275–286. OpenProceedings, Konstanz, Germany. 42 Morse , M.D. , Patel , J.M. and Jagadish , H.V. ( 2007 ) Efficient Skyline Computation over Low-Cardinality Domains. Proc. 33rd Int. Conf. Very Large Data Bases, University of Vienna, Austria, September 23–27, 2007, pp. 267–278. ACM, New York, NY, USA. 43 Godfrey , P. , Shipley , R. and Gryz , J. ( 2005 ) Maximal Vector Computation in Large Data Sets. Proc. 31st Int. Conf. Very Large Data Bases, Trondheim, Norway, August 30–September 2, 2005, pp. 229–240. ACM, New York, NY, USA. 44 Balke , W. , Güntzer , U. and Zheng , J.X. ( 2004 ) Efficient Distributed Skylining for Web Information Systems. Advances in Database Technology—EDBT 2004, 9th Int. Conf. Extending Database Technology, Heraklion, Crete, Greece, March 14–18, 2004, Proceedings, pp. 256–273. Springer, Berlin, Germany. 45 Chan , C.Y. , Eng , P. and Tan , K. ( 2005 ) Stratified Computation of Skylines with Partially-Ordered Domains. Proc. ACM SIGMOD Int. Conf. Management of Data, Baltimore, Maryland, USA, June 14–16, 2005, pp. 203–214. ACM, New York, NY, USA. 46 Tao , Y. and Papadias , D. ( 2006 ) Maintaining sliding window skylines on data streams . IEEE Trans. Knowl. Data Eng. , 18 , 377 – 391 . Google Scholar Crossref Search ADS 47 Guttman , A. ( 1984 ) R-trees: A Dynamic Index Structure for Spatial Searching. SIGMOD’84, Proc. Annual Meeting, Boston, MA, June 18–21, 1984, pp. 47–57. IEEE, New York, NY, USA. 48 Memcached . http://memcached.org/. 49 Twitter: Real-time Geo . http://slideshare.net/raffikrikorian/rtgeo-where-20-2011. 50 GeoSpatial indexes in MongoDB . http://docs.mongodb.org/manual/core/geospatial-indexes/. 51 Cho , E. , Myers , S.A. and Leskovec , J. ( 2011 ) Friendship and Mobility: User Movement in Location-Based Social Networks. Proc. 17th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, San Diego, CA, USA, August 21–24, 2011, pp. 1082–1090. ACM, New York, NY, USA. © The British Computer Society 2018. All rights reserved. For permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png The Computer Journal Oxford University Press

Social-Aware Spatial Top-k and Skyline Queries

Loading next page...
 
/lp/oxford-university-press/social-aware-spatial-top-k-and-skyline-queries-8plnI3Zm6M
Publisher
Oxford University Press
Copyright
© The British Computer Society 2018. All rights reserved. For permissions, please email: journals.permissions@oup.com
ISSN
0010-4620
eISSN
1460-2067
D.O.I.
10.1093/comjnl/bxy019
Publisher site
See Article on Publisher Site

Abstract

Abstract The widespread proliferation of location-acquisition techniques and GPS-embedded mobile devices have resulted in the generation of geo-tagged data at unprecedented scale and have essentially enhanced the user experience in location-based services associated with social networks. Such location-based social networks allow people to record and share their location and are a rich source of information which can be exploited to study people’s various attributes and characteristics to provide various Geo-Social (GS) services. In this paper, we propose two new types of queries called Top-k famous places TkFP and Socio-Spatial Skyline Query SSSQ query, which enrich the semantics of the conventional spatial queries by introducing a social relevance component. In addition, three approaches namely, (1) Social-First, (2) Spatial-First and (3) Hybrid are proposed to efficiently process TkFP and SSSQ queries. Finally, we conduct an extensive evaluation of the proposed schemes using real and synthetic datasets and demonstrate the effectiveness of the proposed approaches. 1. INTRODUCTION The fusion of social and geographical information has given rise to the notion of online social media known as location-based social networks (LBSNs) such as Facebook and Foursquare. An LBSN is usually represented as a complex graph where nodes represent various entities in the social network (such as users, places or pages) and the edges represent the relationships between different nodes. These relationships are not only limited to friendship relations but also contain other types of relationships such as works-at, born-in and studies-at. In addition, the nodes may also contain spatial information such as a user’s check-ins at different locations. Consider the example of a Facebook user Alice who was born in Germany, works at Monash University and checks-in at a particular restaurant. Facebook records this information by linking Facebook pages for Monash University and Germany with Alice [1], e.g. Alice and Monash University are connected by an edge labeled works-at and Alice and Germany are connected with an edge labeled born-in. The check-in information records the places the user has visited. Spatial data and social relationships in LBSNs provide a rich source of information which can be exploited to offer many interesting services. Consider the example of a German tourist visiting Melbourne. She may want to find a nearby pub which is popular (e.g. frequently visited) among people from Germany. This involves utilizing spatial information (i.e. nearby pub, check-ins) as well as social information (i.e. people who were born-in Germany). Similarly, a user may want to find nearby places that are most popular among her friends, e.g. the places most frequently visited by her friends. Similarly, in disease monitoring, we are interested in finding frequently visited spot by people having certain types of disease e.g. Ebola Virus. The people are connected to each other in a social network through the disease and by analysing their visits, the frequently visited region can be found from where they could have carried out the disease. Further, in public safety and crime prevention field, let us say that there are some users who have tweeted about Drugs and have also joined some pages containing drugs related information on social networks. To find their frequently visited places for crime prevention, where they might involve in drugs related activities, law enforcement agencies can exploit the information to raid there. Such kind of queries can also be utilized by various businesses. Consider a chain of gaming stores that is interested in opening a new store in an area which is frequently visited by such young adults for shopping/leisure who have keen interest in gaming. These young adults are connected with social networking through various gaming pages and/or tweets containing information about gaming. By analysing their visits information, the business can discover a suitable place/shopping mall for this purpose. Although various types of queries have been studied on LBSNs [2–7], to the best of our knowledge, this problem has not been studied before and in the existing work, their techniques are either not applicable or cannot be efficiently extended to answer the queries like the above that aim at finding nearby places that are popular among a particular group of users satisfying a social constraint. It is notable that recommendation techniques that exploit the similar interest of friends have gained significant attention, as several works on social network analysis have remarked that a user’s behavior indeed often correlates to the behavior of her friends [8–12]. In our daily life, besides our own preference, we usually turn to our friends for opinions of songs, restaurants or movies. Therefore, as argued above, the spatial keyword based queries cannot capture the social influence which is usually important to influence one’s selection criteria. Motivated by this, in this paper, we formalize this problem as a Top-k famous places TkFP query and Socio-Spatial Skyline Query SSSQ and propose efficient query processing techniques. The two proposed queries are briefly introduced below: 1. A Top-k famous places TkFP query retrieves top- k places (points of interest) ranked according to their spatial and social relevance to the query user where the spatial relevance is based on how close the place is to the query location and the social relevance is based on how frequently it is visited by the one-hop neighbors of the query user in the social graph. We use a scoring function to obtain a final score based on social and spatial scores (relevance) of a place. In TkFP queries, a user needs to define a scoring function that combines social and spatial scores to rank the objects which may not be trivial (e.g. due to incompatible attributes, different distributions of attributes, the inability of users to choose a good scoring function) [13]. Motivated by this, to complement Top-k famous places TkFP query, we also study Skyline queries that do not need scoring function to retrieve the desired objects. A formal definition is provided in Section 3.1.1. 2. A Socio-Spacial Skyline Query SSSQ returns every place for which there does not exist any other place that has a better social score and better spatial score. In SSSQ query, we do not need to have a scoring function; therefore, skyline queries are natural and popular choice for the applications involving multi-criteria decision making [14–20]. Let us take an example of a German tourist visiting Melbourne who is looking for some restaurants that are close and are also popular among German people. A socio-spatial skyline query (SSSQ) returns every restaurant p for which there does not exist any other restaurant p′ that is closer to her location and is more popular among German people. A formal definition is provided in Section 5.1.1. We present three approaches to answer TkFP and SSSQ queries namely, (1) Social-First, (2) Spatial-First and (3) Hybrid. The first two approaches separately process the social and spatial components of the queries and do not require a specialized index. However, the third approach (Hybrid) is capable of processing social and spatial components simultaneously by utilizing a hybrid index specifically designed to handle TkFP and SSSQ queries. Contributions. We make the following contributions in this paper. To the best of our knowledge, we are the first to study the TkFP queries that retrieves nearby places popular among a particular group of users in the social network. We extend our work to propose SSSQ queries that returns the places which are not dominated by any other place. To process both of the approaches, we explore three different directions. We conduct an exhaustive evaluation of the proposed schemes using real and synthetic datasets and demonstrate the effectiveness of the proposed approaches. 2. RELATED WORK 2.1. Geo-social queries Geo-Social query processing is an emerging field and is getting attention of research community these days [3, 21–23]. In [24], an Algebric model is proposed to process geo-social query. This model consists of set of operators to query the geo-social data. They replicate the graphs to represent spatial and social components which can be very large in terms of social networks thus, making query processing more cumbersome. Huang et al. [25] studied a Geo-Social query that retrieves the set of nearby friends of a user that share common interests, without providing concrete query processing algorithms. In [6], they defined a new query namely, Geo-Social Circle of Friends to retrieve the group of friends in geo-social settings whose members are close to each other based on their geographical and social circumstances such as for group sports, social gathering and community services. Yang et al. [7] introduced another type of group query by extending the work presented in [6] namely, Social-Spatial Group Query (SSGQ) which is useful for impromptu activity planning. In addition, nearest-neighbor queries have been widely applied in location-based social networks recently [26–30]. 2.2. Top-k queries Top-k queries retrieve the top-k objects based on a user defined scoring function. The problem has been extensively studied [31–35]. Fagin’s algorithm (FA) [36], threshold algorithm (TA) (independently proposed in [36–38]) and no-random access (NRA) [36] propose some of the top-k processing algorithms that combine multiple ranked lists and return the top-k objects. Ilyas et al. [34] give a comprehensive survey of top-k query processing techniques. Wu et al. [31] proposed a new query named as social-aware top-k spatial keyword query (SkSK) which retrieves a list of k objects ranked according to their spatial proximity, textual (e.g. restaurant has menu and different facilities) and social relevance. The social relevance is defined as a function of the users who are ‘fans’ of an object considering how close they are to the query user. This definition is fundamentally different from our definition because we consider only the friends of the query user who have visited the object. Another work is presented by Jiang et al. [39] in which they proposed a method to find top-k local users in geo-social media data. First, they extract all tweets in the given range and rank each tweet based on the number of replies/forwards to that tweet. For this, they build a tweet thread tree of each tweet and sum-up replies/forwards at each level. This tweets ranking is considered as social score of the user who initiated the tweet. Then, they compute spatial score of each user who has posted tweets in the range. However, their social scoring criteria is not applicable to our problem definition. Therefore, to the best of our knowledge, none of the existing techniques can be applied or trivially extended to solve Top-k famous Places Query TkFP ⁠. 2.3. Skyline queries There has been considerable work done in the literature regarding skyline queries [40–43]. These include computing skyline queries in partially ordered domain, in a distributed setting, continuous skyline queries [44–46] and many more. The skyline operator was first introduced in [14] followed by many generic skyline computation algorithms such as Block-Nested Loop (BNL), Divide and Conquer (DC) approach which were proposed by same authors. Since then, skyline processing has appealed many researchers and has attracted the attention of many in database community. Additionally, a Bitmap algorithm was proposed in [20] to improve the original algorithms which involve low cardinality domains i.e. datasets with a small number of discrete attributes. There exist few other solutions to process such datasets for instance, as proposed in [42]. Similarly, another approach was introduced in [14] known as Index algorithm which divides the dataset into d sorted lists for d optimized measures. Another R-Tree [47] based approach known as Nearest-Neighbor (NN) was proposed in [15]. This approach starts with finding the nearest neighbor to the query and thus, objects dominated by the nearest neighbor can be pruned. Papadias et al. [16] proposed a Branch-and-Bound Search (BBS) method to overcome the overlapping problem persists in (NN) algorithm. This approach is guaranteed to visit each page of the R-Tree at most once. Most of work on skyline computation does not consider social aspect. To the best of our knowledge, there exists only one work [4] in the literature that considers both social and spatial aspects for skyline queries. They define skyline query as a set of users who are not dominated by any other user where a user u is said to be dominated by another user u′ if u′ is closer to the query location and u′ is socially closer to the query user. The problem we study in this paper is fundamentally different as we focus on returning skyline places (instead of skyline users) where a place p is dominated by another place p′ if p′ is closer to the query user and the number of q′ s friends who visited p′ is greater than the number of q′ s friend who visited p. For this, they use Random walk with restart method (RWR) to compute social distance which is very expensive and to optimize, they propose a method to approximate the results (social similarity) which does not return exact results. Therefore, their problem settings are completely different and cannot be applied to our settings to answer SSSQ queries. 3. TOP-K FAMOUS PLACES QUERIES 3.1. Preliminaries 3.1.1. Problem definition Location-Based Social Network (LBSN): A location-based social network consists of a set of entities U (e.g. users, pages, groups, etc.) and a set of places P ⁠. The relationship between two entities u and v are indicated by a labeled edge where the label indicates the type of relationship (e.g. friend, lives-in). LBSN also records check-ins where a check-in of a user u∈U at a particular place p∈P indicates an instance that u had visited the place P ⁠. Score of a place p: Given a query user q ⁠, and a range r ⁠, the score of a place p∈P is 0 if ∣∣q,p∣∣>r where ∣∣q,p∣∣ is the Euclidean distance between query location and p ⁠. If ∣∣q,p∣∣≤r ⁠, the score of p is a weighted sum of its spatial score (denoted as pspatial ⁠) and its social score (denoted as psocial) Score(p)=α×pspatial+(1−α)×psocial (1) where α is a parameter used to control the relative importance of spatial and social scores. The social score psocial is computed as follows. Let Fq denotes the one-hop neighbors of the query user q considering a particular relationship type, e.g. if the relationship is born-in and the query entity is the Facebook Page named Germany, then Fq is a set of users born in Germany. Although our techniques can be used on any type of relationship, for the ease of presentation, in the rest of the paper we only consider the friendship relationships. In this context, Fq contains the friends of the query user q ⁠. Let Vp denotes the set of all users who visited (i.e. checked-in at) the place p ⁠. The social score psocial is computed as follows: psocial=∣Fq∩Vp∣∣Fq∣ (2) where ∣X∣ denote the cardinality of a set X and if ∣Fq∣ is zero then we assume that psocial is also zero. Intuitively, psocial is the proportion of the friends of q who have visited the place p ⁠. The spatial score pspatial is based on how close the place is to the query location. Formally, given a range r ⁠, pspatial=0 if the place does not lie in the range. Otherwise, pspatial=(r−∣∣q,p∣∣) where ∣∣q,p∣∣ indicates Euclidean distance between the query location and p ⁠. Note that psocial is always between 0 and 1 and we normalize pspatial such that it is also within the range 0 to 1 ⁠, e.g. the data space is normalized such that ∣∣q,p∣∣≤1 and r≤1 ⁠. Top- kFamous Places (⁠ TkFP ⁠) Query: Given an LBSN, a TkFP query q returns k places with the highest scores where the score Score(p) of each place p is computed as described above. Example 2.1 Figure 1a illustrates the locations of a set of places P={p1,p2,p3,p4} ⁠. The query q shown in Fig. 1a with k=2 and range r=0.15 ⁠, has a set of friends Fq={u1,u2,u3,u4,u5,u6,u7,u8,u9,u10} ⁠. The number in bracket next to each place is the check-in count made by q’s friends. Figure 1b shows the Euclidean distances and the visitors of each place among q’s friends. Let us assume α=0.5 ⁠, the score of the p1 w.r.t. q is computed as 0.025+0=0.025 ⁠. Similarly, we have Score(p2)=0.185 ⁠, Score(p3)=0.205 and Score(p4)=0.115 ⁠. The result of the query q is (p2,p3) according to scoring function in Equation (1) (Table 1). Figure 1. View largeDownload slide Top- k Query Example. Figure 1. View largeDownload slide Top- k Query Example. Table 1. Notations. Notation Definition q Query User i.e. q∈U u User i.e. u∈U p A place i.e. p∈P Pv Set of places visited by friends of u Fq Set of Friends of query user q Vp Set of visitors of a place p r Given query range k Number of requested places p′≺p place p’ dominates place p cij Range Grid’s cell Pc Set of places lie inside a cell Vcell Set of visitors of a cell bij Skyline workspace block Notation Definition q Query User i.e. q∈U u User i.e. u∈U p A place i.e. p∈P Pv Set of places visited by friends of u Fq Set of Friends of query user q Vp Set of visitors of a place p r Given query range k Number of requested places p′≺p place p’ dominates place p cij Range Grid’s cell Pc Set of places lie inside a cell Vcell Set of visitors of a cell bij Skyline workspace block View Large Table 1. Notations. Notation Definition q Query User i.e. q∈U u User i.e. u∈U p A place i.e. p∈P Pv Set of places visited by friends of u Fq Set of Friends of query user q Vp Set of visitors of a place p r Given query range k Number of requested places p′≺p place p’ dominates place p cij Range Grid’s cell Pc Set of places lie inside a cell Vcell Set of visitors of a cell bij Skyline workspace block Notation Definition q Query User i.e. q∈U u User i.e. u∈U p A place i.e. p∈P Pv Set of places visited by friends of u Fq Set of Friends of query user q Vp Set of visitors of a place p r Given query range k Number of requested places p′≺p place p’ dominates place p cij Range Grid’s cell Pc Set of places lie inside a cell Vcell Set of visitors of a cell bij Skyline workspace block View Large 3.2. Framework overview The proposed framework consists of three approaches to answer TkFP query: (I) Social-First, (II) Spatial-First and (III) Hybrid. The Social-First approach first processes the social component (e.g. friendship relations and their check-ins) and then processes the spatial component (e.g. places in given range), whereas Spatial-First initially processes the spatial component followed by processing the social component. In contrast, the Hybrid approach is capable of processing both social and spatial components simultaneously to answer such queries. More specifically, it leverages two types of pre-processed information associated with each user u∈U ⁠, her check-in information (check-ins) and summary of her friends’ check-ins information. To the best of our knowledge, there is no unanimously accepted social or spatial storage implementation. Specifically, Facebook uses adjacency lists stored in Memcached [48] which is a distributed memory caching system, on the other hand, Twitter leverages the R*-Tree [49] spatial index. Further, Foursquare uses MongoDB [50], a document oriented database. Similarly, academics research has been adopting various kind of approaches such as [5] uses adjacency list stored in Neo4j which is a graph based database, whereas [24] utilizes relational tables for storing the friendship relations. Similar to the existing work on KNN queries, we tailored the storage implementation for our technique in a way that suites our requirements. More specifically, we index places and users’ check-in information by adopting the R-Tree [47] spatial Index structure. Before presenting our techniques, we present the definition of Facility R-Tree, Check-in R-Tree and Friendship Index. Definition 3.1 (Facility R-Tree). Stores all places (⁠ p∈P ⁠) in a given dataset. A node of Facility R-Tree is represented by a minimum bounded rectangle (MBR) that constitutes all places in its sub-tree. Definition 3.2 (Check-in R-Tree). For the sake of efficiency, we store check-in information of each user by indexing all visited places by her in a separate index based on R-Tree. If a place pis visited by a user multiple times, it will be indexed as many times it was visited hence, Check-in R-Tree contains duplicate entries for the place psince many applications (e.g. which include ranking and recommendation of places) do require complete check-in information of users. Definition 3.3 (Friendship Index). To index each user u∈Uand their social relationships, we build an index structure by employing B+-Tree based on Unicorn [1] built to search the facebook social graph. 4. PROPOSED TECHNIQUES 4.1. Social-first-based approach This approach first looks at the check-ins of each friend u∈Fq to compute social score of each visited place p∈P in given range r ⁠. Next, using given range r ⁠, it processes the spatial component of the q and computes the score of the remaining places p∈P which are not checked-in and returns the set of top-k places based on their score and q’s defined preference criteria α ⁠. We next describe the technique in detail with pseudocode given in Algorithm 1. Initially, in the first loop of the algorithm (at line 1), it exploits Check-in R-Tree of each friend u∈Fq of query q to get the places in range r followed by computing social score and score of each candidate place p in r (at Line 4). In addition, we maintain the score of current kth place p based on social and spatial scores (at Line 5). Further, in the third loop (at Line 7), the Facility R-Tree is exploited to compute the score of those places p∈P in range r which are not visited by q’s friends (at Line 10) hence, their respective score only comprises of spatial score and their (Psocial)=0 ⁠. Finally, the top-k result set is retrieved using Scorek (at Line 12). Let us assume, the score of current kth place p is Scorek ⁠, next lemma shows that if the ∣∣q,p∣∣≥(r−Scorekα) ⁠, we can prune that place p ⁠. Next, we introduce our first pruning rule in Lemma 1. Lemma 4.1 Every place pthat has a distance ∣∣q,p∣∣from query qgreater than current (r−(Scorek/α))cannot be in the Top-k places. Algorithm 1 Social-First. Algorithm 1 Social-First. Proof Given a query user q ⁠, a range r ⁠, preference factor α ⁠, a place p which is not checked-in by any user u∈Fq has social component (psocial)=0 ⁠, by using Equations (1) and (2), we get: Score(p)=α(r−∣∣q,p∣∣)+0 (3) To be the candidate for the Top-k places, a place p’s score must be greater than current Scorek ⁠, hence Scorek≤Score(p) (4) By substituting the value of Score(p) from Equation (3): Scorek≤α(r−∣∣q,p∣∣)∣∣q,p∣∣≤(r−(Scorek/α)) (5) □ 4.2. Spatial-first-based approach Initially, this approach retrieves all places in given range r and computes spatial score of each place p regardless of the fact that whether it is checked-in by any friend u∈Fq or not. Moreover, it then computes the social score of each place p by computing the number of friends checked-in to it by exploring the visitors information of each place Vp ⁠. Finally, it computes the score of each place while updating the current kth place’s score and yielding the result set. We next elaborate the technique in detail with pseudocode given in Algorithm 2. The algorithm starts with the issuance of a range query on Facility R-Tree to retrieve all places in range r (at Line 1). Then in first loop (at Line 2), for each place p∈P in range r ⁠, we compute the Score(p) in ascending order of the distance of the place p from q ⁠. To achieve this, a Heap is initialized with the root entry of Facility R-Tree with ∣∣q,e∣∣ as a key to process spatial component first. Further, it computes social score of p by retrieving the number of friends u∈Fq who visited the place by exploring the visitors’ information of the place (at Line 6). Finally, the final score of place p is computed using Equation (1) (at Line 6). Let us assume, the score of current kth place p is Scorek ⁠, next lemma shows that if the ∣∣q,p∣∣≥(r−(Scorek−(1−α))α) ⁠, the process stops (at Line 5) since every subsequent place p entry in Heap is further than the current place p entry from q ⁠. Next, we introduce our second pruning rule in Lemma 2. Lemma 4.2 Every place pthat has distance ∣∣q,p∣∣from query qgreater than current Scorek ⁠, cannot be in the Top-k places. Algorithm 2 Spatial-First. Algorithm 2 Spatial-First. Proof Given a query user q ⁠, a range r ⁠, preference factor α ⁠, to be the candidate for the Top-k places, a place p’s score must be greater than current Scorek ⁠, by using Equations (1) and (2), we get: Scorek≤Score(p) (6) By substituting the value of Score(p) ⁠, we get: Scorek≤α(r−∣∣q,p∣∣)+(1−α)Fq∩Vp∣Fq∣ (7) Since the maximum possible social score of given place can be 1, we get: Scorek≤α(r−∣∣q,p∣∣)+(1−α)*1∣∣q,p∣∣≤r−(Scorek−(1−α))α (8) □ 4.3. Hybrid approach 4.3.1. Friends Check-ins R-Tree To optimize pruning of irrelevant friends and places, we propose a spatial indexing structure, the Friends Check-ins R-tree (FCR-Tree), that supports simultaneous pruning of friends and places. It is an R-Tree-based structure which is constructed for each user u∈U and is able to prune the search space. FCR-Tree stores check-in information of each friend u∈Fq of q ⁠, thus representing the check-in summary of all friends of q ⁠. The objects of FCR-Tree are the root MBRs of each friend’s Check-in R-Tree. The update of the index in case of new check-in entry of any friend u ⁠, is not costly since these are being bulk updated after certain period of time. Let us assume a query q∈U where the friends of q are Fq={u2,u3,u4,u5,u6} ⁠. Figure 2 shows the conceptual view of the FCR-Tree of u1 ⁠. In the following section, we describe our proposed technique in detail. Figure 2. View largeDownload slide Check-Ins example. (a) Check-In information and (b) Friends Check-Ins summary. Figure 2. View largeDownload slide Check-Ins example. (a) Check-In information and (b) Friends Check-Ins summary. In this approach, the score of each place p in range r is computed by processing social and spatial components of query q together. To answer TkFP queries efficiently, this approach leverages the Friends Check-ins R-Tree of query q to prune the friends who did not visit the top-k places. Similarly, a Grid Spatial Index is used to prune the places in given range r which cannot be the part of candidate place for top-k result set. More specifically, to compute the social and spatial scores of a place p ⁠, this approach supports simultaneous pruning of q’s friends and places p∈P in given range r ⁠. We next elaborate the technique in detail with pseudocode given in Algorithm 3. The algorithm employs a grid partitioning approach to divide the region formed by given range r ⁠. In addition, for each grid cell cij ⁠, a set of places Pc∈P that lie in the cell is maintained by using the Facility R-Tree (at Line 4) and distance of the closest place p to q in a cell is recorded as the cell distance from q (at Line 5). In second loop (at Line 8), a set of friends Vcell who might have visited a cell is computed for each cell by exploiting Friends Check-ins R-Tree (⁠ FCR−Tree ⁠) of q and counting the number of overlapping objects of FCR−Tree with the cell (at Line 9). Figure 3 illustrates an example of a cell and overlapping objects in which four objects are overlapping with the cell and therefore, the overlap count of the cell is 4. This overlapping count is considered as social score of the cell and servers as an upper bound on social score of all places that lie in the cell. Figure 3. View largeDownload slide Cell overlap. Figure 3. View largeDownload slide Cell overlap. Once the social and spatial score of each cell cij is computed, a ranking score of each cell is computed using Equation (9) (at Line 9), which serves as an upper bound on score of any place in the cell. Further, in third loop (at Line 10), for each cell in descending order of cell score, the algorithm accesses all the places that lie in the cell (at Line 13) to compute social score and score of each place p while maintaining the current kth place score (at Lines 14 and 15). If the current kth place score is greater than the next cell’s score, the process stops since all subsequent cells cannot contain a place with higher ranking score than current kth place (at Line 12): Scorecell=α(r−cell.distance)+(1−α)∣Vcell∣∣Fq∣ (9) 5. SOCIO-SPATIAL SKYLINE QUERIES In this section, we present our algorithms to answer socio-spatial skyline queries (SSSQ). First, we formally define the problem, and introduce terms and notations, then we present three algorithms to answer these queries. 5.1. Preliminaries 5.1.1. Problem definition In this paper, we are interested in retrieving places in given range r based on their social and spatial scores. The top-k query uses a scoring function that combines social and spatial scores to rank the objects. However, users must have adequate domain knowledge to be able to decide upon a good value of α ⁠. In particular, it is not easy to define a scoring function (e.g. due to incompatible attributes, different distributions of attributes, the inability of users to choose a good scoring function) [13]. Therefore, to complement our Top-k famous places query, we extend our work to study skyline queries which return the objects that are within the range r (i.e. ∣∣q,p∣∣<r ⁠) and are not dominated by any other object. In order to answer these queries, we compute psocial and pspatial scores of each place as defined in Section 3.1.1. The intuition behind using a range r is that sometimes users are not interested in places that are too far. Below we formally define our query. Dominance: A place p is dominated by a place p′ if psocial≤psocial′ and pspatial≤pspatial′ and for at least one of the following two holds: psocial<psocial′ and pspatial<pspatial′ ⁠. We denote the dominance relationship as p′≺p which implies that place p is dominated by place p′ ⁠. Socio-Spatial Skyline Query (SSSQ): Given a query q and a range r ⁠, an SSSQ returns every place p for which ∣∣q,p∣∣<r and p is not dominated by any other place p′ ⁠. Algorithm 3 Hybrid algorithm. Algorithm 3 Hybrid algorithm. We use an example in Figs 4 and 5 to illustrate the problem definition. Let us assume that we have a set of places P={p1,p2,p3,p4,p5,p6,p7,p8} inside range r=0.15 ⁠, given by query q and a set of friends of q ⁠, Fq={u1,u2,u3,u4,u5,u6,u7,u8,u9,u10} ⁠. For each place, we compute its spatial and social scores based on its distance from query q and number of friends of q who checked-in at this place. Next, we map each place to a space where x-coordinate refers to spatial score and y-coordinate refers to social score as illustrated in Fig. 5b. Figure 4. View largeDownload slide Sample dataset. Figure 4. View largeDownload slide Sample dataset. Figure 5. View largeDownload slide Mapping. (a) Places in range and (b) mapping to 2D-space. Figure 5. View largeDownload slide Mapping. (a) Places in range and (b) mapping to 2D-space. For example in Fig. 4, the social and spatial scores of place p3 are 0.8 and 0.01, respectively, and using these scores, p3 is mapped to the space as shown in Fig. 5b. To retrieve query result, we utilize this space to find such places that are not dominated by any other place. For example, place p7 dominates p5 because p7.social>p5.social and p7.spatial>p5.spatial ⁠. Hence, the SSSQ query returns p7 and p3 which are not dominated by any other place. 6. PROPOSED TECHNIQUES 6.1. Social-first based algorithm Social-First-based approach accesses only those places that are visited by q’s friends rather than accessing each place in range r ⁠. This approach first looks at the check-ins of each friend to compute social score of each visited place p∈P in given range r ⁠. Then we only use the visited places in the range to compute skyline places as described in Algorithm 4. Initially, in the first loop of the algorithm (at Line 1), it exploits Check-in R-Tree of each friend u∈Fq of query q to get the places in range r followed by computing social score of each candidate place p in r (at Line 3). In addition to this, we also compute spatial score of each candidate place p ⁠. Next, each candidate place p is accessed in descending order of the sum of two scores (at Line 5) because accessing the places in this order guarantees that a place is skyline if and only if it is not dominated by any place in S [16], where S is the set of skyline places obtained so far. Then each candidate place p is examined for the dominance (at Line 6). Finally, the nearest neighbor of query q is computed (at Line 8) and is added to the skyline places if it is not checked-in by her friends. Below Lemma 6.1 shows that nearest neighbor of query q is always a skyline object. Lemma 6.1 A nearest neighbor (NN) of the query is always a skyline place. Algorithm 4 Skyline: Social-First algorithm. Algorithm 4 Skyline: Social-First algorithm. Proof There cannot be any place p′ that has a smaller distance than the nearest neighbor p of the query q ⁠. If there are more than one nearest neighbors, then the nearest neighbor with highest social score is not dominated by any other nearest neighbor and is considered as a skyline place.□ 6.2. Spatial-first-based algorithm This approach first gets all places in range r by issuing a range query on Facility R-Tree (at Line 1) in Algorithm 5. Then, in the first loop of the algorithm (at Line 2), it computes spatial and social scores of each place in given range r ⁠. Next, each place in range is accessed in descending order of the sum of two scores (at Line 5) and then is examined for the dominance (at Line 6). If the place is not dominated by skyline places obtained so far, it is inserted into the skyline places set S ⁠. Spatial-First approach accesses only one R-tree index (i.e. Facility R-Tree), while Social-First approach has to access as many R-tree indices as the number of friends of query q ⁠. On Contrary, the down side of Spatial-First approach is that it retrieves all the places in given range r and computes their social score while Social-First approach computes the social score of only those places in the range r that are visited by q’s friends. Next, we address the weakness of both in below section. 6.3. Hybrid algorithm This section focuses on our third approach (i.e. Hybrid) to process SSSQ which is capable of processing both social and spatial aspects simultaneously. Before presenting the technique, first we describe our index and record-keeping structures. 6.3.1. Two grids We first introduce two grid indices that are employed to speed-up retrieval and pruning process. 1. Range Grid: This grid is built upon the region formed by given range r by splitting it into small cells as shown in Fig. 6a. Each cell has following information associated with it: Places that lie in the cell. Number of overlapping Friends’ Check-In R-tree (FCR-Tree) object rectangles (root MBR of Check-In R-Trees) with that particular cell. As stated in Section 4.3, this information is used to compute a bound on the social scores of the places in the cell. Figure 6. View largeDownload slide Sample dataset and skyline mapping. (a) Places in range and (b) mapping to skyline workspace. Figure 6. View largeDownload slide Sample dataset and skyline mapping. (a) Places in range and (b) mapping to skyline workspace. 2. Skyline Workspace Grid: As described earlier, for each place inside range r ⁠, we compute social (⁠ Psocial ⁠) and spatial (⁠ Pspatial ⁠) scores and then maps the place to a 2D space where Psocial is mapped along y-axis and Pspatial is mapped along x-axis. This 2D space is called skyline workspace. Similar to Range Grid, we divide our skyline workspace into a grid as shown in Fig. 6b to index each object based on its social and spatial scores. This aids in retrieving, examining dominance and filtering objects efficiently. Note that to avoid disambiguity, we denote a cell of range grid as a cell and a cell of skyline workspace grid as a block in rest of the paper. Figure 6b shows the mapping of all places in the range to their corresponding skyline workspace grid blocks based on their social and spatial scores. For example, denoting bottom-left block of the skyline workspace grid bij (where i is a row number staring with 0 and j is a column number starting with 0 ⁠) as b0,0 ⁠, place p3 is mapped to block b3,0 and place p7 is mapped to block b3,4 ⁠. Algorithm 5 Skyline: Spatial-First algorithm. Algorithm 5 Skyline: Spatial-First algorithm. 6.3.2. Mapping range grid cell In addition to the mapping of places to skyline grid blocks, each range grid cell Cij is mapped to skyline workspace grid. To achieve this, first we compute social and spatial scores (i.e., csocial ⁠, cspatial ⁠) of each cell. To understand further, let us take an example of range grid cell C1,2 with three places (i.e. p2,p6,p8 ⁠) inside it as shown in Fig. 7a with their social and spatial scores listed in Fig. 7b. Assuming, the cell C1,2 overlaps with six objects of FCR-Tree that is considered as social score (⁠ csocial=0.6 ⁠) of the cell. In addition, the spatial score of place P8 that lies in the cell is largest among all and is considered as the cell’s spatial score (⁠ cspatial=0.10 ⁠). Therefore, these scores serve as an upper bound on scores of any place inside the cell and by using these scores, the cell is mapped to its corresponding skyline workspace gird block b2,3 as a point C1,2(csocial ⁠, cspatial) as shown in Fig. 8. Figure 7. View largeDownload slide Social and spatial score of a Range Grid Cell. (a) Cell C1,2 and (b) places’ scores. Figure 7. View largeDownload slide Social and spatial score of a Range Grid Cell. (a) Cell C1,2 and (b) places’ scores. Figure 8. View largeDownload slide Range cell mapping to skyline workspace grid. Figure 8. View largeDownload slide Range cell mapping to skyline workspace grid. Based on this, we can conclude that the cell point C1,2(0.10,0.6) in skyline workspace clearly dominates all the places (i.e. p2,p6,p8 ⁠) that lie in it. In contrast, if cell C1,2 in skyline workspace is dominated by any other object (e.g. p7 ⁠), the cell is immediately pruned along with all the places inside it due to having smaller social and spatial scores than the cell’s. Therefore, the places lie inside the cell cannot be the part of skyline places hence, this pruning considerably improves the query processing time. Algorithm 6 describes the indexing of range and skyline workspace along with mapping of each range grid cell Cij to the skyline workspace grid. Initially, we start by constructing a grid index in region formed by range r (at Line 1). Then in the first loop (at Line 3), we index each place in range to its corresponding range grid cell C along with updating the cell’s spatial score (⁠ cspatial ⁠) (at Line 4). In addition, the skyline workspace is divided into a grid (at Line 5) and a range query is issued on FCR-Tree to get all the friends of query q who might have visited any place in the range (at Line 6). Finally, in second loop (at Line 7), for each range grid cell C ⁠, the upper bound (⁠ csocial ⁠) on social score of the places that lie in the cell is computed and then the cell is mapped to its corresponding skyline workspace block b using the cell’s social and spatial scores. Each block bij of skyline workspace grid is associated with two types of lists, one of which contains range grid cell objects that lie inside and the second one contains actual places p inside that block as shown in Fig. 9. Figure 9. View largeDownload slide Grid Index and record-keeping structures. Figure 9. View largeDownload slide Grid Index and record-keeping structures. 6.3.3. Computation module Intuition: Let us assume that we have a place p in skyline workspace grid as illustrated in Fig. 10. Note that the block b2,2 is dominated by place p ⁠; therefore, no place or range grid cell in the block can contain a skyline place p ⁠. Similarly, all the blocks in the shaded area R do not need to be accessed if we have already seen the place p ⁠. Figure 10. View largeDownload slide Dominated blocks. Figure 10. View largeDownload slide Dominated blocks. To make blocks access efficient, we need to access them in a particular order where order is determined by maxScore which is defined in Definition 6.1. For each block, we en-heap the blocks below and towards left of it. If a block is dominated, it is pruned. Definition 6.1 maxscore(b) maxScore(b)of any given block of skyline workspace grid is a summation of its top-right corner coordinates (i.e., Psocial ⁠, Pspatial ⁠). For example, in Fig. 11, the maxScore of block b4,4 is 2 (sum of top-right corner coordinates i.e. 1,1) and the maxScore of b3,3 is 0.8 + 0.8 = 1.6. Since the top-right corner of skyline workspace P(1,1) has the highest maxScore, the block b4,4 is selected first for processing. Now Consider two points p1 and p2 at the low-right and at the top-left corner of b4,4 respectively. Note that points p1 and p2 have higher maxScore than any object in shaded region. Precisely, for every unprocessed block bij ⁠, maxScore(bij)≤max(maxScore(b3,4),maxScore(b4,3)) ⁠. Consequently, the block to be processed after b4,4 is either b3,4 or b4,3 and let us assume that maxScore(b4,3≤maxScore(b3,4) ⁠, b3,4 is the second one to be processed. Further, the blocks with the third highest maxScore is determined among b4,3 ⁠, b3,3 and b2,4 ⁠. We next describe the technique in detail with pseudocode given in Algorithm 7. Algorithm 6 Indexing and mapping. Algorithm 6 Indexing and mapping. Algorithm 7 Skyline: Hybrid Algorithm 7 Skyline: Hybrid Figure 11. View largeDownload slide Block visiting order. Figure 11. View largeDownload slide Block visiting order. Algorithm: Initially, Algorithm 7 invokes Indexing and Mapping algorithm (Algorithm 6) to compute social and spatial scores of each range grid cell and to index them to their corresponding skyline workspace grid blocks (at Line 1). Further, the algorithm employs the method described above to process blocks in descending maxScore(b) order, retaining the property of visiting the minimal set of blocks. To handle this, a maxHeap is initialized with the top-right block b4,4 of skyline workspace grid with its maxScore as a sorting key (at Line 2). Then, algorithm starts de-heaping the blocks iteratively (at Line 3) and if a block is dominated by any skyline place p (at Line 5), it is immediately pruned and consequently, the blocks below and left of it are not en-heaped. Since at this stage, only range grid cells are indexed to the skyline workspace, it first examines each cell object Cij for dominance which lie inside the de-heaped block (at Line 6) and if a cell is dominated by any already found skyline place so far (at Line 7), it is pruned. Consequently, all the places that lie in the pruned cell object are also discarded and are not processed further. Further, if a cell object Cij is not dominated, then for each place that lie in it (loop at Line 8), the algorithm computes social score (⁠ Psocial ⁠) of it (at Line 9). Then, the place is indexed to its corresponding skyline workspace grid block b provided that the place is not dominated by any skyline place found so far (at Line 10). However, if a corresponding block is dominated, it is marked to avoid being en-heaped in maxHeap. Subsequently, after indexing each place in the range to its corresponding blocks, the algorithm starts examining each place p that lie in the current de-heaped block for dominance (at Line 12) in descending order of Psocial+Pspatial and updates the skyline result set. The algorithm also en-heaps the blocks below and to the left of current de-heaped block using their maxScores provided that neither of them have been en-heaped before nor are they dominated by any skyline place found so far (at Line 14). The algorithm terminates when all the blocks in maxHeap are examined and returns the skyline places (at Line 16). 7. EXPERIMENTS 7.1. Experimental setup To the best of our knowledge, this problem has not been studied before and in the existing work, their techniques are either not applicable or cannot be efficiently extended to solve TkFP and SSSQ queries. However, although the paper [31] studies a different problem, we have implemented their algorithm and compared our techniques against it (Table 2). Table 2. Parameters (default shown in bold). Parameters Values Number of queries 50, 100, 150, 200 Range (km) 50, 100, 200, 400 Dataset size (places in thousands) 100, 200, 300, 400, 500, 1300 Grid size 2, 4, 8, 16, 32, 64 Average friends 200, 400, 600, 800 k 5, 10, 15, 20 Parameters Values Number of queries 50, 100, 150, 200 Range (km) 50, 100, 200, 400 Dataset size (places in thousands) 100, 200, 300, 400, 500, 1300 Grid size 2, 4, 8, 16, 32, 64 Average friends 200, 400, 600, 800 k 5, 10, 15, 20 View Large Table 2. Parameters (default shown in bold). Parameters Values Number of queries 50, 100, 150, 200 Range (km) 50, 100, 200, 400 Dataset size (places in thousands) 100, 200, 300, 400, 500, 1300 Grid size 2, 4, 8, 16, 32, 64 Average friends 200, 400, 600, 800 k 5, 10, 15, 20 Parameters Values Number of queries 50, 100, 150, 200 Range (km) 50, 100, 200, 400 Dataset size (places in thousands) 100, 200, 300, 400, 500, 1300 Grid size 2, 4, 8, 16, 32, 64 Average friends 200, 400, 600, 800 k 5, 10, 15, 20 View Large All algorithms are implemented in C++ and experiments are run on Intel Core I3 2.4 GHz PC with 8 GB memory running on 64-bit Ubuntu Linux. We use real dataset of Gowalla [51] along with five synthetic datasets with characteristics as shown in Table 3. Gowalla is a location-based social network which later was acquired by Facebook. It contains 196 591 users, 950 327 friendships, 6 442 890 check-ins and 1 280 956 checked-in places across the world. The page size of each Facility R-Tree index is set to 4096 Bytes and 1024 Bytes for Check-in R-Tree and FCR-Tree indexes. For each experiment, we randomly select 100 users and treat them as query points. The cost in the experiments corresponds to the average cost of 100 queries. The default value of range r is 100 km and the default value of k is set to 10 unless mentioned otherwise. Table 3. Datasets characteristics. DataSet Places Users Friendships Check-Ins Gowalla 1280 956 196 591 950 327 6 442 890 Synthetic 01 100 000 17 162 77 254 503 000 Synthetic 02 200 000 39 223 152 957 1 100 698 Synthetic 03 300 000 61 394 241 369 1 830 235 Synthetic 04 400 000 86 687 301 887 2 536 897 Synthetic 05 500 000 103 856 395 745 3 258 659 DataSet Places Users Friendships Check-Ins Gowalla 1280 956 196 591 950 327 6 442 890 Synthetic 01 100 000 17 162 77 254 503 000 Synthetic 02 200 000 39 223 152 957 1 100 698 Synthetic 03 300 000 61 394 241 369 1 830 235 Synthetic 04 400 000 86 687 301 887 2 536 897 Synthetic 05 500 000 103 856 395 745 3 258 659 View Large Table 3. Datasets characteristics. DataSet Places Users Friendships Check-Ins Gowalla 1280 956 196 591 950 327 6 442 890 Synthetic 01 100 000 17 162 77 254 503 000 Synthetic 02 200 000 39 223 152 957 1 100 698 Synthetic 03 300 000 61 394 241 369 1 830 235 Synthetic 04 400 000 86 687 301 887 2 536 897 Synthetic 05 500 000 103 856 395 745 3 258 659 DataSet Places Users Friendships Check-Ins Gowalla 1280 956 196 591 950 327 6 442 890 Synthetic 01 100 000 17 162 77 254 503 000 Synthetic 02 200 000 39 223 152 957 1 100 698 Synthetic 03 300 000 61 394 241 369 1 830 235 Synthetic 04 400 000 86 687 301 887 2 536 897 Synthetic 05 500 000 103 856 395 745 3 258 659 View Large 7.2. Performance evaluation 7.2.1. Top-k famous places queries (⁠ Tk FP) Effect of Range: We analyse the performance of our algorithms for various range values ranging from 100 to 400 km. The size of the area formed by given range determines the number of places it contains (ranging from 1500 to 94 000). In addition, we analyse the performance of our techniques by comparing them with [31] and found that our Hybrid algorithm is at least 8–10 times faster than their algorithm as depicted in Fig. 12 and, SkSK and Spatial-First algorithms are most affected at bigger range values hence their performance deteriorates due to large number of places. Further SkSK incurs considerably more IO cost as shown in Fig. 12b for higher range values due to large number of places in range which result in higher index access rate. Figure 12. View largeDownload slide Effect of varying range (number of places). (a) CPU cost and (b) I/O cost. Figure 12. View largeDownload slide Effect of varying range (number of places). (a) CPU cost and (b) I/O cost. Effect of Average number of Friends: In Fig. 13, we study the effect of average number of friends of each query. Note that the size of FCR-Tree depends on the size of friends’ set of each user in dataset which essentially affects the Hybrid algorithm. Similarly, data structure proposed in SkSK [31] stores users’ information with each node at each level which basically is a union of all users’ sets of each child node which considerably affects the performance of the algorithm for large number of users. Figure 13a illustrates the processing time of each algorithm where Hybrid is much faster (8–10 times) than SkSK typically for higher number of friends. Similarly, Fig. 13b shows the I/O cost of each method for varying average number of friends and we found that SkSK incurs much higher cost due to very large size of the data structure used. The average number of places in given range r is 38 319. Note that when average number of friends increases, the CPU and I/O cost of all four algorithms increases since each friend’s check-in information is required to be accessed to get candidate places. Figure 13. View largeDownload slide Performance comparison on different numbers of friends. (a) CPU cost and (b) I/O cost. Figure 13. View largeDownload slide Performance comparison on different numbers of friends. (a) CPU cost and (b) I/O cost. Effect of concurrent number of Queries: Geo-Social services seek to answer large number of incoming queries simultaneously due to the enormous size of registered users. Therefore, the number of concurrent queries ranging from 50 to 200 is analyzed for all the four algorithm. In addition, each experiment involves average number of friends ranging from 200 to 800 and approximately 10 000 average number of places in given range r ⁠. The three proposed algorithms need to traverse the Facility-RTree every time a TkFP query is issued to retrieve candidate places. The Social-First algorithm also needs to traverse the Check-in R-Tree of each friend. On the other hand, Hybrid algorithm leverages the FCR-Tree and both Spatial-First and Hybrid greatly rely on the visitors set of the places. In addition, we compare the performance of our algorithms with SkSK [31] and observed that our Hybrid algorithm is at least 10 times faster than SkSK as shown in Fig. 14. We report the CPU and I/O cost of each algorithm on Gowalla dataset for different numbers of queries. As expected, the I/O cost of Social-First algorithm is less than the other three due to low dependency on indexes and SkSK incurs much higher IO cost than any of the proposed techniques. Figure 14. View largeDownload slide Effect of number of queries. (a) CPU cost and (b) I/O cost. Figure 14. View largeDownload slide Effect of number of queries. (a) CPU cost and (b) I/O cost. Effect of Grid Size: In Fig. 15, we study the effect of the size of grid partitioning ranges from 2 to 64 on Hybrid algorithm. The size of grid affects the CPU cost since the size of a cell defines how many places need to be processed/pruned at once. Similarly, it also affects the termination condition on the algorithm. Note that the best CPU performance can be achieved by dividing the region into grid of size 4×4 ⁠. Figure 15. View largeDownload slide Effect of grid size. Figure 15. View largeDownload slide Effect of grid size. Effect of k: In previous experiments, the value of k is set to 10. Next, we analyse the performance of the four algorithms for various values of k ⁠. Note that in Fig. 16a, Hybrid is 8–10 times faster than SkSK and all four algorithms are nearly independent of k ⁠. The reason is that we have to update the result set every time we update the score of a place. Therefore, the size of the result set does not impose great computation load. In terms of I/O cost, Fig. 16b shows that all four algorithms do not get affected by the value of k since the higher value of k does not incur more disk access. In addition, SkSK’s disk access is up to 25 times more than the proposed algorithms due to very large index size. Figure 16. View largeDownload slide Effect of varying number of requested places (⁠ k ⁠). (a) CPU cost and (b) I/O cost. Figure 16. View largeDownload slide Effect of varying number of requested places (⁠ k ⁠). (a) CPU cost and (b) I/O cost. Effect of Dataset Size: In Fig. 17a and b, we study the effect of dataset size on the performance of the four algorithms. Specifically, we conduct experiments on synthetic datasets of different sizes containing places ranging from 100k to 500k. In Fig. 17a, note that the SkSK algorithm is most effected by the number of places because the more number of places, the higher the number of visitors will be associated to the nodes of index structure at each level. Due to this, SkSK suffers in better processing time and I/O. Similarly, in Fig. 17b, SkSK has higher I/O cost due to the intersection performed on visitors set of nodes/places and friends set of query q ⁠. Figure 17. View largeDownload slide Effect of varying dataset sizes (number of places). (a) CPU cost and (b) I/O cost. Figure 17. View largeDownload slide Effect of varying dataset sizes (number of places). (a) CPU cost and (b) I/O cost. 7.2.2. Socio-spatial skyline queries (SSSQ) Effect of Range: We analyse the performance of our algorithms for various range values ranging from 100 to 400 km. The size of the area formed by given range determines the number of places it contains (ranging from 1500 to 94000). Figure 18 shows that Spatial-First algorithm is most affected at bigger range values due to more number of places to be processed hence, its performance deteriorates. Similarly, Social-First approach does not get affected much by the range because it only takes into account the visited places by query q’s friends. Note that, the Hybrid algorithm performs better for bigger range since it is more likely to find skyline places by processing fewer blocks and by pruning more cells including their corresponding places which lie in them simultaneously. Figure 18b shows that I/O cost increases for bigger range values due to large number of places in given range and large number of visitors (in spatial-first approach) which results in higher index access rate. Figure 18. View largeDownload slide Effect of varying range (number of places). (a) CPU cost and (b) I/O cost. Figure 18. View largeDownload slide Effect of varying range (number of places). (a) CPU cost and (b) I/O cost. Effect of Average number of Friends: In Fig. 19, we study the effect of the average number of friends of each query. Note that the size of FCR-Tree depends on the number of friends of each user in the dataset which essentially affects the Hybrid algorithm to some extent. Further, Spatial-First algorithm is greatly affected by it because the intersection of two large sets i.e. visitors set of each place in range and friends set of query q is more expensive. Similarly, Social-First algorithm is greatly affected by the number of friends since it has to process more Check-in R-trees. Specifically, Fig. 19a shows the CPU cost and Fig. 19b shows the I/O cost of each method for varying average number of friends. The average number of places in given range r is 38 319. Note that when the average number of friends increases, the CPU and I/O cost of all three algorithms increases since each friend’s check-in information is required to verify the candidate places. Figure 19. View largeDownload slide Performance comparison on different number of friends. (a) CPU cost and (b) I/O cost. Figure 19. View largeDownload slide Performance comparison on different number of friends. (a) CPU cost and (b) I/O cost. Effect of concurrent number of Queries: The number of concurrent queries ranging from 50 to 200 are analysed for all three algorithm. In addition, each experiment involves average number of friends ranging from 200 to 800 and approximately 10 000 average number of places in given range r ⁠. All three algorithms need to traverse the Facility-RTree every time an SSSQ is issued to retrieve candidate places. In addition, Social-First algorithm also traverses the Check-in R-Tree that belongs to each friend and as we increase the number of queries, the number of friends to be processed, also increase. Therefore, Social-First algorithm exhibits more CPU cost for large number of queries. On the other hand, Hybrid algorithm leverages the FCR-Tree and both Spatial-First and Hybrid greatly rely on the size of visitors set of the places. In Fig. 20, we report the CPU and I/O cost of each algorithm on Gowalla dataset for different number of queries. As expected, the I/O cost of Social-First algorithm is less than the other two due to low dependency on indexes. Note that Hybrid is up to eight times better than Social-First and Spatial-First algorithms. Figure 20. View largeDownload slide Effect of number of queries. (a) CPU cost and (b) I/O cost. Figure 20. View largeDownload slide Effect of number of queries. (a) CPU cost and (b) I/O cost. Effect of Grid Size: In Fig. 21, we study the effect of the size of grid partitioning ranges from 2 to 16 on Hybrid algorithm. For region grid, the size of grid affects the CPU cost since the size of a cell defines how many places will be processed/pruned simultaneously. Similarly, it also affects the termination condition on the algorithm. Note that the best CPU performance can be achieved by dividing the area into grid of size 4×4 ⁠. In addition, the Skyline workspace is also partitioned into 4×4 grid because algorithm achieves optimal performance at this granularity. Figure 21. View largeDownload slide Effect of grid size. Figure 21. View largeDownload slide Effect of grid size. Effect of Dataset Size: In Fig. 22a and b, we study the effect of dataset size on the performance of the three algorithms. Specifically, we conduct experiments on synthetic datasets of different sizes containing places ranging from 100k to 500k. In Fig. 22a, note that the Spatial-First algorithm is most effected by the number of places. Similarly, in Fig. 22b, Hybrid and Spatial-First have higher I/O cost due to the intersection performed on visitors set of each place and friends set of query q ⁠. Figure 22. View largeDownload slide Effect of varying dataset sizes. (a) CPU cost and (b) I/O cost. Figure 22. View largeDownload slide Effect of varying dataset sizes. (a) CPU cost and (b) I/O cost. 7.3. Analysis of results quality Top- k queries and skyline queries both have been extensively studied in the past. The advantage of a top-k query is that the number of objects to be returned is controlled by the user (by giving a value of k ⁠). However, the top- k query assumes that the user is able to define a suitable scoring function (e.g. a suitable value of α in this paper). This may be challenging because the user may not be able to choose a suitable scoring function mainly because of the incompatibility of the attributes involved in top- k queries and their distributions [13]. A skyline query addresses this problem and does not require a scoring function to be defined. However, the user cannot control the number of objects returned by the query and, in the worst case, the number of skyline objects may be equal to the total number of objects in the dataset. Therefore, top- k queries and skyline queries complement each other. In this section, we analyse the size of socio-spatial skyline queries and compare the results returned by top- k queries and skyline queries. Size of Skyline: In Fig. 23, we run 100 skyline queries for each setting and report the average size of skyline. Figure 23a shows that the average size of skyline is 2–5 as we vary the average number of friends for the query user. Note that, on an average, the total number of places in the query range is more than 7000 and skyline shortlists up to five places, on an average, that dominate all other places in terms of both spatial score and social score. One reason for such a small skyline size is that the data is sparse and there may not be many check-ins in the given range by all of the query users’ friends and, as a result, the social score for most of the places may be zero. Figure 23. View largeDownload slide Effect of number of friends. (a) Small number of average friends and (b) large number of average friends. Figure 23. View largeDownload slide Effect of number of friends. (a) Small number of average friends and (b) large number of average friends. In Fig. 23b, we evaluate the size of skyline for a more challenging case where the average number of friends for the query user is varied from 25 000 to 100 000. We remark that this is a realistic setting and many users may have such a large number of friends, e.g. query user is a page ‘Germany’ and its friends represent the people who were born in Germany. Figure 23b shows that the size of skyline increases with the average number of friends but the size is still much smaller compared to the total number of places in the range. This shows that the skyline query studied in this paper is useful and returns only a small number of objects to the user. In the rest of the experiments, we choose 50 000 as default for the average number of friends of the query user. Results Returned by Top-k vs. Skyline: In this section, we compare and analyse the results returned by skyline queries and top- k queries. In Fig. 24, we run 100 queries for each setting and report the average number of result objects returned by top- k queries, skyline queries and the average number of objects that are returned by both of the queries (shown as ‘# Common Places’). Specifically, Fig. 24a studies the effect of k and Fig. 24b compares skyline and top- 10 queries for varying α ⁠. Figure 24 demonstrates that the results returned by both top- k and skyline queries share many objects but, at the same time, each query reports several places that the other query fails to return. This shows that the two queries complement each other. Figure 24. View largeDownload slide # common places returned by both queries. (a) Effect of k and (b) effect of α ⁠. Figure 24. View largeDownload slide # common places returned by both queries. (a) Effect of k and (b) effect of α ⁠. In Fig. 25, we further analyse the results returned by the two types of queries. Specifically, the result places are mapped to a 2D space where x-axis corresponds to their social scores and y-axis corresponds to their spatial scores. In Fig. 25a, the skyline query returns 15 places. The top-5 query with α=0.1 (high preference for social score) returns the places shown with small red circles. Three of these top-5 places are the skyline points and the other two places are not the skyline points because they are dominated by other places. For the top-5 queries with α=0.5 (equal preference for both social and spatial scores) and α=0.9 (high preference for spatial score), the top-5 places are the places on the top-left of the figure (having high spatial scores but low social scores). Figure 25b shows similar results except that some of the top-5 places for α=0.5 (equal preference) are the places in bottom-right of the figure and some are in the top-left of the figure. Figure 25. View largeDownload slide Analysis of results. (a) Skyline vs. top-k for User 1 and (b) skyline vs. top-k for User 2. Figure 25. View largeDownload slide Analysis of results. (a) Skyline vs. top-k for User 1 and (b) skyline vs. top-k for User 2. Figure 25 shows that the top- k queries may sometimes fail to capture the users’ requirements, e.g., for example, by choosing α=0.5 ⁠, a user may have wanted to obtain the places that have reasonably high values on both social and spatial scores but the results may contain places with either high social scores but very low spatial scores or high spatial scores but very low social scores (as in Fig. 25). The skyline query addresses this problem to some extent and gives a better coverage of the results. However, it fails to capture the requirements of users who have chosen α to be too high or too low. For example, in Fig. 25b, the skyline contains only one object that has a high social score, therefore, it would fail to capture the requirements of a user who prefers social score much more than the spatial score (e.g. α=0.1 ⁠) and wants to obtain several places with high social scores. In contrast, the top-5 query with α=0.1 returns five objects each having a high social score. Also, as pointed out earlier, the number of skyline objects may be arbitrarily large and the user may not be able to control the number of objects returned. 8. CONCLUSIONS We are the first to formalize a problem namely, Top-k famous places TkFP query and propose efficient query processing techniques. In addition, we extend our work to propose another query that is, Socio-Spacial Skyline Query SSSQ. We present three approaches to process the queries called, (1) Social-First, (2) Spatial-First and (3) Hybrid. The first two approaches separately process the social and spatial components of the query and do not require a specialized index. The third approach (Hybrid) is capable of processing social and spatial components simultaneously by utilizing a hybrid index specifically designed to handle TkFP queries. We analyse the performance of our techniques and found them better than previous techniques. REFERENCES 1 Curtiss , M. et al. ( 2013 ) Unicorn: a system for searching the social graph . PVLDB , 6 , 1150 – 1161 . 2 Ahuja , R. , Armenatzoglou , N. , Papadias , D. and Fakas , G.J. ( 2015 ) Geo-Social Keyword Search. Advances in Spatial and Temporal Databases—14th International Symposium, SSTD 2015, Hong Kong, China, August 26–28, 2015. Proceedings, pp. 431–450. Springer, Berlin Heidelberg. 3 Armenatzoglou , N. , Ahuja , R. and Papadias , D. ( 2015 ) Geo-social ranking: functions and query processing . VLDB J. , 24 , 783 – 799 . Google Scholar Crossref Search ADS 4 Emrich , T. , Franzke , M. , Mamoulis , N. , Renz , M. and Züfle , A. ( 2014 ) Geo-Social Skyline Queries. Database Systems for Advanced Applications—19th International Conference, DASFAA 2014, Bali, Indonesia, April 21–24, 2014. Proceedings, Part II, pp. 77–91. Springer, Berlin Heidelberg. 5 Doytsher , Y. , Galon , B. and Kanza , Y. ( 2012 ) Managing Socio-spatial Data as Large Graphs. 21st Int. World Wide Web Conf. 6 Liu , W. , Sun , W. , Chen , C. , Huang , Y. , Jing , Y. and Chen , K. ( 2012 ) Circle of Friend Query in Geo-social Networks. Database Systems for Advanced Applications—17th International Conference, DASFAA 2012, Busan, South Korea, April 15–19, 2012, Proceedings, Part II, pp. 126–137. Springer, Berlin Heidelberg. 7 Yang , D. , Shen , C. , Lee , W. and Chen , M. ( 2012 ) On Socio-spatial Group Query for Location-Based Social Networks. The 18th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, KDD ‘12, Beijing, China, August 12–16, 2012, pp. 949–957. ACM, New York, NY, USA. 8 Fond , T.L. and Neville , J. ( 2010 ) Randomization Tests for Distinguishing Social Influence and Homophily Effects. Proc. 19th Int. Conf. World Wide Web, WWW 2010, Raleigh, NC, USA, April 26–30, 2010, pp. 601–610. ACM, New York, NY, USA. 9 Singla , P. and Richardson , M. ( 2008 ) Yes, There is a Correlation: From Social Networks to Personal Behavior on the Web. Proc. 17th Int. Conf. World Wide Web, WWW 2008, Beijing, China, April 21–25, 2008, pp. 655–664. ACM, New York, NY, USA. 10 Chua , F.C.T. , Lauw , H.W. and Lim , E. ( 2011 ) Predicting Item Adoption using Social Correlation. Proc. Eleventh SIAM Int. Conf. on Data Mining, SDM 2011, April 28–30, 2011, Mesa, AZ, USA, pp. 367–378. 11 Ma , H. , King , I. and Lyu , M.R. ( 2009 ) Learning to Recommend with Social Trust Ensemble. Proc. 32nd Annu. Int. ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2009, Boston, MA, USA, July 19–23, 2009, pp. 203–210. 12 Ye , M. , Liu , X. and Lee , W. ( 2012 ) Exploring Social Influence for Recommendation: A Generative Model Approach. The 35th Int. ACM SIGIR Conf. Research and Development in Information Retrieval, SIGIR ‘12, Portland, OR, USA, August 12–16, 2012, pp. 671–680. ACM, New York, NY, USA. 13 Fagin , R. , Kumar , R. and Sivakumar , D. ( 2003 ) Efficient Similarity Search and Classification via Rank Aggregation. Proc. 2003 ACM SIGMOD Int. Conf. Management of Data, San Diego, CA, USA, June 9–12, 2003, pp. 301–312. ACM, New York, NY, USA. 14 Börzsönyi , S. , Kossmann , D. and Stocker , K. ( 2001 ) The Skyline Operator. Proc. 17th Int. Conf. Data Engineering, April 2–6, 2001, Heidelberg, Germany, pp. 421–430. Springer, Berlin Heidelberg. 15 Kossmann , D. , Ramsak , F. and Rost , S. ( 2002 ) Shooting Stars in the Sky: An Online Algorithm for Skyline Queries. VLDB 2002, Proc. 28th Int. Conf. Very Large Data Bases, August 20–23, 2002, Hong Kong, China, pp. 275–286. ACM, New York, NY, USA. 16 Papadias , D. , Tao , Y. , Fu , G. and Seeger , B. ( 2003 ) An Optimal and Progressive Algorithm for Skyline Queries. Proc. 2003 ACM SIGMOD Int. Conf. Management of Data, San Diego, CA, USA, June 9–12, 2003, pp. 467–478. ACM, New York, NY, USA. 17 Deng , K. , Zhou , X. and Shen , H.T. ( 2007 ) Multi-source Skyline Query Processing in Road Networks. Proc. 23rd Int. Conf. Data Engineering, ICDE 2007, The Marmara Hotel, Istanbul, Turkey, April 15–20, 2007, pp. 796–805. IEEE, New York, NY, USA. 18 Sacharidis , D. , Bouros , P. and Sellis , T.K. ( 2008 ) Caching Dynamic Skyline Queries. Scientific and Statistical Database Management, 20th Int. Conf., SSDBM 2008, Hong Kong, China, July 9–11, 2008, Proceedings, pp. 455–472. IEEE, New York, NY, USA. 19 Sharifzadeh , M. and Shahabi , C. ( 2006 ) The Spatial Skyline Queries. Proc. 32nd Int. Conf. Very Large Data Bases, Seoul, Korea, September 12–15, 2006, pp. 751–762. ACM, New York, NY, USA. 20 Tan , K. , Eng , P. and Ooi , B.C. ( 2001 ) Efficient Progressive Skyline Computation. VLDB 2001, Proc. 27th Int. Conf. Very Large Data Bases, September 11–14, 2001, Roma, Italy, pp. 301–310. ACM, New York, NY, USA. 21 Armenatzoglou , N. , Papadopoulos , S. and Papadias , D. ( 2013 ) A general framework for geo-social query processing . PVLDB , 6 , 913 – 924 . 22 Ference , G. , Ye , M. and Lee , W. ( 2013 ) Location Recommendation for Out-of-Town Users in Location-Based Social Networks. 22nd ACM Int. Conf. Information and Knowledge Management, CIKM’13, San Francisco, CA, USA, October 27–November 1, 2013, pp. 721–726. ACM, New York, NY, USA. 23 Mouratidis , K. , Li , J. , Tang , Y. and Mamoulis , N. ( 2015 ) Joint search by social and spatial proximity . IEEE Trans. Knowl. Data Eng. , 27 , 781 – 793 . Google Scholar Crossref Search ADS 24 Doytsher , Y. , Galon , B. and Kanza , Y. ( 2010 ) Querying Geo-social Data by Bridging Spatial Networks and Social Networks. Proc. 2010 Int. Workshop on Location Based Social Networks, LBSN 2010, November 2, 2010, San Jose, CA, USA, Proceedings, pp. 39–46. ACM, New York, NY, USA. 25 Huang , Q. and Liu , Y. ( 2009 ) On Geo-social Network Services. Geoinformatics, 2009 17th Int. Conf., pp. 1–6. IEEE, New York, NY, USA. 26 Ye , M. , Yin , P. and Lee , W.-C. ( 2010 ) Location Recommendation for Location-Based Social Networks. Proc. 18th SIGSPATIAL Int. Conf. Advances in Geographic Information Systems, pp. 458–461. ACM, New York, NY, USA. 27 Sarwat , M. , Levandoski , J.J. , Eldawy , A. and Mokbel , M.F. ( 2014 ) Lars*: an efficient and scalable location-aware recommender system . IEEE Trans. Knowl. Data Eng. , 26 , 1384 – 1399 . Google Scholar Crossref Search ADS 28 Gao , H. and Liu , H. ( 2014 ) Data Analysis on Location-Based Social Networks. Mobile Social Networking , pp. 165 – 194 . Springer , Berlin, Heidelberg . 29 Li , J. and Cardie , C. ( 2014 ) Timeline generation: tracking individuals on twitter. 23rd International World Wide Web Conference, WWW ‘14, Seoul, Republic of Korea, April 7–11, 2014, pp. 643–652. ACM, New York, NY, USA. 30 Li , G. , Chen , S. , Feng , J. , Tan , K. and Li , W. ( 2014 ) Efficient Location-Aware Influence Maximization. Int. Conf. Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22–27, 2014, pp. 87–98. ACM, New York, NY, USA. 31 Wu , D. , Li , Y. , Choi , B. and Xu , J. ( 2014 ) Social-Aware Top-k Spatial Keyword Search. IEEE 15th Int. Conf. Mobile Data Management, MDM 2014, Brisbane, Australia, July 14–18, 2014—Volume 1, pp. 235–244. IEEE, New York, NY, USA. 32 Shen , Z. , Cheema , M.A. , Lin , X. , Zhang , W. and Wang , H. ( 2012 ) A generic framework for top-k pairs and top-k objects queries over sliding windows . IEEE Trans. Knowl. Data Eng. , 26 , 1349 – 1366 . Google Scholar Crossref Search ADS 33 Sohail , A. , Murtaza , G. and Taniar , D. ( 2016 ) Retrieving Top-k Famous Places in Location-Based Social Networks. Databases Theory and Applications—27th Australasian Database Conference, ADC 2016, Sydney, NSW, September 28–29, 2016, Proceedings, pp. 17–30. Springer, Berlin, Heidelberg. 34 Ilyas , I.F. , Beskales , G. and Soliman , M.A. ( 2008 ) A survey of top-k query processing techniques in relational database systems . ACM Comput. Surv. , 40 , 11:1 – 11:58 . Google Scholar Crossref Search ADS 35 Cheema , M.A. , Shen , Z. , Lin , X. and Zhang , W. ( 2014 ) A Unified Framework for Efficiently Processing Ranking Related Queries. Proc. 17th Int. Conf. Extending Database Technology, EDBT 2014, Athens, Greece, March 24–28, 2014, pp. 427–438. OpenProceedings, Konstanz, Germany. 36 Fagin , R. , Lotem , A. and Naor , M. ( 2003 ) Optimal aggregation algorithms for middleware . J. Comput. Syst. Sci. , 66 , 614 – 656 . Google Scholar Crossref Search ADS 37 Nepal , S. and Ramakrishna , M.V. ( 1999 ) Query Processing Issues in Image (Multimedia) Databases. Proc. 15th Int. Conf. Data Engineering, Sydney, Australia, March 23–26, 1999, pp. 22–29. IEEE, New York, NY, USA. 38 Güntzer , U. , Balke , W. and Kießling , W. ( 2000 ) Optimizing Multi-feature Queries for Image Databases. VLDB 2000, Proc. 26th Int. Conf. Very Large Data Bases, September 10–14, 2000, Cairo, Egypt, pp. 419–428. ACM, New York, NY, USA. 39 Jiang , J. , Lu , H. , Yang , B. and Cui , B. ( 2015 ) Finding Top-k Local Users in Geo-tagged Social Media Data. 31st IEEE Int. Conf. Data Engineering, ICDE 2015, Seoul, South Korea, April 13–17, 2015, pp. 267–278. IEEE, New York, NY, USA. 40 Zhang , W. , Lin , X. , Zhang , Y. , Cheema , M.A. and Zhang , Q. ( 2012 ) Stochastic skylines . ACM Trans. Database Syst. , 37 , 14:1 – 14:34 . Google Scholar Crossref Search ADS 41 Cheema , M.A. , Lin , X. , Zhang , W. and Zhang , Y. ( 2013 ) A Safe Zone Based Approach for Monitoring Moving Skyline Queries. Joint 2013 EDBT/ICDT Conferences, EDBT ‘13 Proceedings, Genoa, Italy, March 18–22, 2013, pp. 275–286. OpenProceedings, Konstanz, Germany. 42 Morse , M.D. , Patel , J.M. and Jagadish , H.V. ( 2007 ) Efficient Skyline Computation over Low-Cardinality Domains. Proc. 33rd Int. Conf. Very Large Data Bases, University of Vienna, Austria, September 23–27, 2007, pp. 267–278. ACM, New York, NY, USA. 43 Godfrey , P. , Shipley , R. and Gryz , J. ( 2005 ) Maximal Vector Computation in Large Data Sets. Proc. 31st Int. Conf. Very Large Data Bases, Trondheim, Norway, August 30–September 2, 2005, pp. 229–240. ACM, New York, NY, USA. 44 Balke , W. , Güntzer , U. and Zheng , J.X. ( 2004 ) Efficient Distributed Skylining for Web Information Systems. Advances in Database Technology—EDBT 2004, 9th Int. Conf. Extending Database Technology, Heraklion, Crete, Greece, March 14–18, 2004, Proceedings, pp. 256–273. Springer, Berlin, Germany. 45 Chan , C.Y. , Eng , P. and Tan , K. ( 2005 ) Stratified Computation of Skylines with Partially-Ordered Domains. Proc. ACM SIGMOD Int. Conf. Management of Data, Baltimore, Maryland, USA, June 14–16, 2005, pp. 203–214. ACM, New York, NY, USA. 46 Tao , Y. and Papadias , D. ( 2006 ) Maintaining sliding window skylines on data streams . IEEE Trans. Knowl. Data Eng. , 18 , 377 – 391 . Google Scholar Crossref Search ADS 47 Guttman , A. ( 1984 ) R-trees: A Dynamic Index Structure for Spatial Searching. SIGMOD’84, Proc. Annual Meeting, Boston, MA, June 18–21, 1984, pp. 47–57. IEEE, New York, NY, USA. 48 Memcached . http://memcached.org/. 49 Twitter: Real-time Geo . http://slideshare.net/raffikrikorian/rtgeo-where-20-2011. 50 GeoSpatial indexes in MongoDB . http://docs.mongodb.org/manual/core/geospatial-indexes/. 51 Cho , E. , Myers , S.A. and Leskovec , J. ( 2011 ) Friendship and Mobility: User Movement in Location-Based Social Networks. Proc. 17th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, San Diego, CA, USA, August 21–24, 2011, pp. 1082–1090. ACM, New York, NY, USA. © The British Computer Society 2018. All rights reserved. For permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Journal

The Computer JournalOxford University Press

Published: Nov 1, 2018

References

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off