# I/O efficient: computing SCCs in massive graphs

I/O efficient: computing SCCs in massive graphs A strongly connected component ( \$\$\mathsf {SCC}\$\$ SCC ) is a maximal subgraph of a directed graph \$\$G\$\$ G in which every pair of nodes is reachable from each other in the \$\$\mathsf {SCC}\$\$ SCC . With such a property, a general directed graph can be represented by a directed acyclic graph ( DAG ) by contracting every \$\$\mathsf {SCC}\$\$ SCC of \$\$G\$\$ G to a node in DAG . In many real applications that need graph pattern matching, topological sorting, or reachability query processing, the best way to deal with a general directed graph is to deal with its DAG representation. Therefore, finding all \$\$\mathsf {SCC}\$\$ SCC s in a directed graph \$\$G\$\$ G is a critical operation. The existing in-memory algorithms based on depth first search ( DFS ) can find all \$\$\mathsf {SCC}\$\$ SCC s in linear time with respect to the size of a graph. However, when a graph cannot reside entirely in the main memory, the existing external or semi-external algorithms to find all \$\$\mathsf {SCC}\$\$ SCC s have limitation to achieve high I/O efficiency. In this paper, we study new I/O-efficient semi-external algorithms to find all \$\$\mathsf {SCC}\$\$ SCC s for a massive directed graph \$\$G\$\$ G that cannot reside in main memory entirely. To overcome the deficiency of the existing DFS -based semi-external algorithm that heavily relies on a total order, we explore a weak order based on which we investigate new algorithms. We propose a new two-phase algorithm, namely, tree construction and tree search. In the tree construction phase, a spanning tree of \$\$G\$\$ G can be constructed in bounded number of sequential scans of \$\$G\$\$ G . In the tree search phase, it needs to sequentially scan the graph once to find all \$\$\mathsf {SCC}\$\$ SCC s. In addition, we propose a new single-phase algorithm, which combines the tree construction and tree search phases into a single phase, with three new optimization techniques. They are early acceptance, early rejection, and batch processing. By the single-phase algorithm with the new optimization techniques, we can significantly reduce the number of I/Os and the CPU cost. We prove the correctness of the algorithms. We conduct extensive experimental studies using 4 real datasets including a massive real dataset and several synthetic datasets to confirm the I/O efficiency of our approaches. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png The VLDB Journal Springer Journals

# I/O efficient: computing SCCs in massive graphs

, Volume 24 (2) – Apr 1, 2015
26 pages

/lp/springer_journal/i-o-efficient-computing-sccs-in-massive-graphs-UzGBf1BJm7
Publisher
Springer Berlin Heidelberg
Subject
Computer Science; Database Management
ISSN
1066-8888
eISSN
0949-877X
D.O.I.
10.1007/s00778-014-0372-z
Publisher site
See Article on Publisher Site

### Abstract

A strongly connected component ( \$\$\mathsf {SCC}\$\$ SCC ) is a maximal subgraph of a directed graph \$\$G\$\$ G in which every pair of nodes is reachable from each other in the \$\$\mathsf {SCC}\$\$ SCC . With such a property, a general directed graph can be represented by a directed acyclic graph ( DAG ) by contracting every \$\$\mathsf {SCC}\$\$ SCC of \$\$G\$\$ G to a node in DAG . In many real applications that need graph pattern matching, topological sorting, or reachability query processing, the best way to deal with a general directed graph is to deal with its DAG representation. Therefore, finding all \$\$\mathsf {SCC}\$\$ SCC s in a directed graph \$\$G\$\$ G is a critical operation. The existing in-memory algorithms based on depth first search ( DFS ) can find all \$\$\mathsf {SCC}\$\$ SCC s in linear time with respect to the size of a graph. However, when a graph cannot reside entirely in the main memory, the existing external or semi-external algorithms to find all \$\$\mathsf {SCC}\$\$ SCC s have limitation to achieve high I/O efficiency. In this paper, we study new I/O-efficient semi-external algorithms to find all \$\$\mathsf {SCC}\$\$ SCC s for a massive directed graph \$\$G\$\$ G that cannot reside in main memory entirely. To overcome the deficiency of the existing DFS -based semi-external algorithm that heavily relies on a total order, we explore a weak order based on which we investigate new algorithms. We propose a new two-phase algorithm, namely, tree construction and tree search. In the tree construction phase, a spanning tree of \$\$G\$\$ G can be constructed in bounded number of sequential scans of \$\$G\$\$ G . In the tree search phase, it needs to sequentially scan the graph once to find all \$\$\mathsf {SCC}\$\$ SCC s. In addition, we propose a new single-phase algorithm, which combines the tree construction and tree search phases into a single phase, with three new optimization techniques. They are early acceptance, early rejection, and batch processing. By the single-phase algorithm with the new optimization techniques, we can significantly reduce the number of I/Os and the CPU cost. We prove the correctness of the algorithms. We conduct extensive experimental studies using 4 real datasets including a massive real dataset and several synthetic datasets to confirm the I/O efficiency of our approaches.

### Journal

The VLDB JournalSpringer Journals

Published: Apr 1, 2015

## You’re reading a free preview. Subscribe to read the entire article.

### DeepDyve is your personal research library

It’s your single place to instantly
that matters to you.

over 18 million articles from more than
15,000 peer-reviewed journals.

All for just \$49/month

### Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

### Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

### Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

### Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

DeepDyve

DeepDyve

### Pro

Price

FREE

\$49/month
\$360/year

Save searches from
PubMed

Create lists to

Export lists, citations

Abstract access only

18 million full-text articles

Print

20 pages / month

PDF Discount

20% off