Technical PerspectiveCormode, Graham
doi: 10.1145/3471485.3471487pmid: N/A
Over the past two decades the data management community has devoted particular attention to handling data that arrives as a stream of updates. This captures a number of "big data" scenarios, ranging from monitoring networks to processing high volumes of transactions in commerce and finance. This has led to data streams becoming a mainstream data management topic, with many systems offering explicit support for handling such inputs. Within these systems, streaming algorithms are used to approximate various statistical and modeling queries, which would traditionally require random access to the full data to compute exactly.
A Framework for Adversarially Robust Streaming AlgorithmsBen-Eliezer, Omri; Jayaram, Rajesh; Woodruff, David P.; Yogev, Eylon
doi: 10.1145/3471485.3471488pmid: N/A
We investigate the adversarial robustness of streaming algorithms. In this context, an algorithm is considered robust if its performance guarantees hold even if the stream is chosen adaptively by an adversary that observes the outputs of the algorithm along the stream and can react in an online manner. While deterministic streaming algorithms are inherently robust, many central problems in the streaming literature do not admit sublinear-space deterministic algorithms; on the other hand, classical space-efficient randomized algorithms for these problems are generally not adversarially robust. This raises the natural question of whether there exist efficient adversarially robust (randomized) streaming algorithms for these problems.
ChillerFekete, Alan D.
doi: 10.1145/3471485.3471489pmid: N/A
Many computing researchers and practitioners may be surprised to find a "research highlight" which innovates on the way to process database transactions. Work in the early 1970s, by Turing winner Jim Gray and others, established a standard set of techniques for transaction management. These remain the basis of most commercial and open-source platforms [1], and they are still taught in university database classes. So why is important research still needed in this topic? The technology environment keeps evolving, and new performance characteristics mean that new algorithms and system designs become appropriate. This perspective will summarise the early work, and point to how the field has continued to progress.
ChillerZamanian, Erfan; Shun, Julian; Binnig, Carsten; Kraska, Tim
doi: 10.1145/3471485.3471490pmid: N/A
Distributed transactions on high-overhead TCP/IP-based networks were conventionally considered to be prohibitively expensive. In fact, the primary goal of existing partitioning schemes is to minimize the number of cross-partition transactions. However, with the new generation of fast RDMAenabled networks, this assumption is no longer valid. In this paper, we first make the case that the new bottleneck which hinders truly scalable transaction processing in modern RDMA-enabled databases is data contention, and that optimizing for data contention leads to different partitioning layouts than optimizing for the number of distributed transactions. We then present Chiller, a new approach to data partitioning and transaction execution, which aims to minimize data contention for both local and distributed transactions.
Technical Perspective DIAMetricsBoncz, Peter
doi: 10.1145/3471485.3471491pmid: N/A
Benchmarking database systems has a long and successful history in making industrial database systems comparable, and is also a cornerstone of quantifiable experimental data systems research. Creating good benchmarks has been described as something of an art [3]. One can inspire dataset and workload design from"representative" use cases queries, typically informed by domain experts; but also exploit technical insights from database architects in what features, operations, and data distributions should come together in order to invoke a particularly challenging task1.
DIAMetricsDeep, Shaleen; Gruenheid, Anja; Nagaraj, Kruthi; Naito, Hiro; Naughton, Jeff; Viglas, Stratis
doi: 10.1145/3471485.3471492pmid: N/A
This paper introduces DIAMetrics: a novel framework for end-to-end benchmarking and performance monitoring of query engines. DIAMetrics consists of a number of components supporting tasks such as automated workload summarization, data anonymization, benchmark execution, monitoring, regression identification, and alerting. The architecture of DIAMetrics is highly modular and supports multiple systems by abstracting their implementation details and relying on common canonical formats and pluggable software drivers. The end result is a powerful unified framework that is capable of supporting every aspect of benchmarking production systems and workloads. DIAMetrics has been developed in Google and is being used to benchmark various internal query engines. In this paper, we give an overview of DIAMetrics and discuss its design and implementation. Furthermore, we provide details about its deployment and example use cases. Given the variety of supported systems and use cases within Google, we argue that its core concepts can be used more widely to enable comparative end-to-end benchmarking in other industrial environments.
Technical Perspective of Efficient Directed Densest Subgraph DiscoveryTao, Yufei
doi: 10.1145/3471485.3471493pmid: N/A
The problem is useful in graph mining because dense subgraphs often represent patterns deserving special attention. They could indicate, for example, an authoritative community in a social network, a building brick of more complex biology structures, or even a type of malicious behavior such as spamming. See [1, 3] and the references therein for an extensive discussion on the applications of DDS.
Efficient Directed Densest Subgraph DiscoveryMa, Chenhao; Fang, Yixiang; Cheng, Reynold; Lakshmanan, Laks V.S.; Zhang, Wenjie; Lin, Xuemin
doi: 10.1145/3471485.3471494pmid: N/A
Given a directed graph G, the directed densest subgraph (DDS) problem refers to the finding of a subgraph from G, whose density is the highest among all the subgraphs of G. The DDS problem is fundamental to a wide range of applications, such as fraud detection, community mining, and graph compression. However, existing DDS solutions suffer from efficiency and scalability problems: on a threethousand- edge graph, it takes three days for one of the best exact algorithms to complete. In this paper, we develop an efficient and scalable DDS solution. We introduce the notion of [x, y]-core, which is a dense subgraph for G, and show that the densest subgraph can be accurately located through the [x, y]-core with theoretical guarantees. Based on the [x, y]-core, we develop both exact and approximation algorithms. We have performed an extensive evaluation of our approaches on eight real large datasets. The results show that our proposed solutions are up to six orders of magnitude faster than the state-of-the-art.
Technical PerspectiveZhang, Qin
doi: 10.1145/3471485.3471495pmid: N/A
One of the most important functionalities of a database system is to answer queries. We are interested in the following question: If there exists more than one answer to the given query, which one should the database report? There are two apparent choices: to return all the valid answers or to return one of them. The problem with the former choice is that it is often time-prohibitive to search for all valid answers. In the latter choice, fairness may become an issue, since the index built for fast search may introduce bias to the query result. For example, the index may favor a certain portion of the input data (e.g., nodes near the root of a tree index) and with a higher chance, output an answer related to that portion than other portions. Such bias can sometimes lead to undesirable consequences.
Fair near neighbor search via samplingAumuller, Martin; Har-Peled, Sariel; Mahabadi, Sepideh; Pagh, Rasmus; Silvestri, Francesco
doi: 10.1145/3471485.3471496pmid: N/A
Similarity search is a fundamental algorithmic primitive, widely used in many computer science disciplines. Given a set of points S and a radius parameter r > 0, the rnear neighbor (r-NN) problem asks for a data structure that, given any query point q, returns a point p within distance at most r from q. In this paper, we study the r-NN problem in the light of individual fairness and providing equal opportunities: all points that are within distance r from the query should have the same probability to be returned. In the low-dimensional case, this problem was first studied by Hu, Qiao, and Tao (PODS 2014). Locality sensitive hashing (LSH), the theoretically strongest approach to similarity search in high dimensions, does not provide such a fairness guarantee.