The VLDB Journal (2016) 25:243–268
Processing SPARQL queries over distributed RDF graphs
· Lei Zou
· M. Tamer Özsu
· Lei Chen
· Dongyan Zhao
Received: 30 March 2015 / Revised: 10 September 2015 / Accepted: 17 November 2015 / Published online: 4 January 2016
© Springer-Verlag Berlin Heidelberg 2016
Abstract We propose techniques for processing SPARQL
queries over a large RDF graph in a distributed environment.
We adopt a “partial evaluation and assembly” framework.
Answering a SPARQL query Q is equivalent to ﬁnding
subgraph matches of the query graph Q over RDF graph
G. Based on properties of subgraph matching over a dis-
tributed graph, we introduce local partial match as partial
answers in each fragment of RDF graph G. For assembly,
we propose two methods: centralized and distributed assem-
bly. We analyze our algorithms from both theoretically and
experimentally. Extensive experiments over both real and
benchmark RDF repositories of billions of triples conﬁrm
that our method is superior to the state-of-the-art methods in
both the system’s performance and scalability.
Electronic supplementary material The online version of this
article (doi:10.1007/s00778-015-0415-0) contains supplementary
material, which is available to authorized users.
M. Tamer Özsu
Institute of Computer Science and Technology,
Peking University, Beijing, China
David R. Cheriton School of Computer Science,
University of Waterloo, Waterloo, Canada
Department of Computer Science and Engineering,
Hong Kong University of Science and Technology,
Clear Water Bay, Hong Kong, China
Keywords RDF · SPARQL · RDF graph ·
The semantic Web data model, called the “Resource Descrip-
tion Framework,” or RDF, represents data as a collection of
triples of the form subject, property, object. A triple can be
naturally seen as a pair of entities connected by a named rela-
tionship or an entity associated with a named attribute value.
Hence, an RDF dataset can be represented as a graph where
subjects and objects are vertices, and triples are edges with
property names as edge labels. With the increasing amount
of RDF data published on the Web, system performance and
scalability issues have become increasingly pressing. For
example, Linking Open Data (LOD) project builds an RDF
data cloud by linking more than 3000 datasets, which cur-
rently have more than84 billion triples
. The recent work 
shows that the number of data sources has doubled within 3
years (2011–2014). Obviously, the computational and stor-
age requirements coupled with rapidly growing datasets have
stressed the limits of single machine processing.
There have been a number of recent efforts in dis-
tributed evaluation of SPARQL queries over large RDF
datasets . We broadly classify these solutions into
three categories: cloud-based, partition-based and federated
approaches. These are discussed in detail in Sect. 2; the high-
lights are as follows.
Cloud-based approaches (e.g., [23,27,33,34,37,48,49])
maintain a large RDF graph using existing cloud comput-
ing platforms, such as Hadoop (http://hadoop.apache.org)or
Cassandra (http://cassandra.apache.org), and employ triple
The statistic is reported in http://stats.lod2.eu/.