Rya: A Scalable RDF Triple Store for the Clouds
US Naval Academy
Laboratory for Telecommunication
Resource Description Framework (RDF) was designed with the ini-
tial goal of developing metadata for the Internet. While the Internet
is a conglomeration of many interconnected networks and comput-
ers, most of today’s best RDF storage solutions are conﬁned to a
single node. Working on a single node has signiﬁcant scalability
issues, especially considering the magnitude of modern day data.
In this paper we introduce a scalable RDF data management system
that uses Accumulo, a Google Bigtable variant. We introduce stor-
age methods, indexing schemes, and query processing techniques
that scale to billions of triples across multiple nodes, while pro-
viding fast and easy access to the data through conventional query
mechanisms such as SPARQL. Our performance evaluation shows
that in most cases, our system outperforms existing distributed RDF
solutions, even systems much more complex than ours.
Categories and Subject Descriptors: H.3.2 Information Storage,
H.3.3 Information Search and Retrieval, H.3.4 Systems and Soft-
ware - Distributed Systems H.2.4 Systems - Distributed Databases,
General Terms: Algorithms, Management, Performance.
Keywords: RDF triple store, distributed, scalable.
The Resource Description Framework (RDF)  is a family of
W3C speciﬁcations traditionally used as a metadata data model,
a way to describe and model information, typically of the World
Wide Web. In the most fundamental form, RDF is based on the
idea of making statements about resources in the form of <subject,
predicate, object> expressions called triples. To specify the title of
the main US Naval Academy web page, one could write the triple
<USNA Home, :titleOf , http://www.usna.edu/homepage.php>. As
RDF is meant to be a standard for describing the Web resources, a
large and ever expanding set of data, methods must be devised to
store and retrieve such a large data set.
While very efﬁcient, most existing RDF stores [19, 5, 11, 8, 2]
rely on a centralized approach, with one server running very spe-
cialized hardware. With the tremendous increase in data size, such
solutions will likely not be able to scale up.
With improvements in parallel computing, new methods can be
devised to allow storage and retrieval of RDF across large compute
(c) 2012 Association for Computing Machinery. ACM acknowledges that
this contribution was authored or co-authored by an employee, contractor or
afﬁliate of the United States government. As such, the United States Gov-
ernment retains a nonexclusive, royalty-free right to publish or reproduce
this article, or to allow others to do so, for Government purposes only.
Cloud-I ’12, August 31 2012, Istanbul, Turkey.
Copyright 2012 ACM 978-1-4503-1596-8/12/08 ...$15.00.
clusters; this allows handling data of unprecedented magnitude.
In this paper, we propose Rya, a new scalable system for storing
and retrieving RDF data in a cluster of nodes. We introduce a new
serialization format for storing the RDF data, an indexing method
to provide fast access to data, and query processing techniques for
speeding up the evaluation of SPARQL queries. Our methods take
advantage of the storing, sorting, and grouping of data that Ac-
cumulo [1, 13] provides. We show through experiments that our
system scales RDF storage to billions of records and provides mil-
lisecond query times.
Accumulo [1, 13] is an open-source, distributed, column-oriented
store modeled after Google’s Bigtable . Accumulo provides ran-
dom, realtime, read/write access to large datasets atop clusters of
commodity hardware. Accumulo leverages Apache Hadoop Dis-
tributed File System , the open source implementation of the
Google File System . In addition to Google Bigtable features,
Accumulo features automatic load balancing and partitioning, data
compression, and ﬁne grained security labels .
Accumulo is essentially a distributed key-value store that pro-
vides sorting of keys in lexicographical ascending order. Each key
is composed of (Row ID, Column, Timestamp) as shown in Table 1.
Rows in the table are stored in contiguous ranges (sorted by key)
called tablets, so reads of short ranges are very fast. Tablets are
managed by tablet servers, with a tablet server running on each
node in a cluster. Section 3 describes how the locality proper-
ties provided by sorting of the Row ID is used to provide efﬁcient
lookups of triples in Rya.
Family Qualiﬁer Visibility
Table 1: Accumulo Key-Value
We chose Accumulo as the backend persistence layer for Rya
over other notable Google BigTable variants such as HBase  be-
cause it provides a few extra important features. First, Accumulo
provides a server side Iterator model that helps increase perfor-
mance by performing large computing tasks directly on the servers
and not on the client machine, thus avoiding the need to send large
amounts of data across the network. Second, Accumulo provides a
simple cell level security mechanism for clients that are interested
in such ﬁne grained security. Triples loaded with speciﬁc security
labels can also be queried with the same labels. Third, Accumulo