TY - JOUR AU - Chitouras, George AB - GPText: Greenplum Parallel Statistical Text Analysis Framework University of Florida E403 CSE Building Gainesville, FL 32611 USA Kun Li kli@cise.ufl.edu cgrant@cise.ufl.edu daisyw@cise.ufl.edu George Chitouras Sunny Khatri Sunny.Khatri@emc.com Greenplum 1900 S Norfolk St #125 San Mateo, CA 94403 USA University of Florida E457 CSE Building Gainesville, FL 32611 USA Christan Grant University of Florida E456 CSE Building Gainesville, FL 32611 USA Daisy Zhe Wang George.Chitouras@emc.com Keywords RDBMS, Massive parallel processing, Text analytics Greenplum 1900 S Norfolk St #125 San Mateo, CA 94403 USA ABSTRACT Many companies keep large amounts of text data inside of relational databases. Several challenges exist in using state-of-the-art systems to perform analysis on such datasets. First, expensive big data transfer cost must be paid up front to move data between databases and analytics systems. Second, many popular text analytics packages do not scale up to production sized datasets. In this paper, we introduce GPText, Greenplum parallel statistical text analysis framework that addresses the above problems by supporting statistical inference and learning algorithms natively in a massively parallel processing database system. GPText seamlessly integrates the Solr search engine and applies statistical algorithms such as k-means and LDA using MADLib, an open source library for scalable in-database TI - GPText: Greenplum parallel statistical text analysis framework DA - 2013-06-23 UR - https://www.deepdyve.com/lp/association-for-computing-machinery/gptext-greenplum-parallel-statistical-text-analysis-framework-9ce0iHz8WC DP - DeepDyve ER -