TY - JOUR AU - Koehler,, Martin AB - Abstract Data wrangling is the process whereby data are cleaned and integrated for analysis. Data wrangling, even with tool support, is typically a labour intensive process. One aspect of data wrangling involves carrying out format transformations on attribute values, for example so that names or phone numbers are represented consistently. Recent research has developed techniques for synthesizing format transformation programs from examples of the source and target representations. This is valuable, but still requires a user to provide suitable examples, something that may be challenging in applications in which there are huge datasets or numerous data sources. In this paper, we investigate the automatic discovery of examples that can be used to synthesize format transformation programs. In particular, we propose two approaches to identifying candidate data examples and validating the transformations that are synthesized from them. The approaches are evaluated empirically using datasets from open government data. 1. INTRODUCTION Data wrangling is the process of data collation and transformation that is required to produce a dataset that is suitable for analysis. Although data wrangling may be considered to include a range of activities, from source selection, through data extraction, to data integration and cleaning [1], here the focus is on format transformations. Format transformations carry out changes to the representation of textual information, with a view to reducing inconsistencies. As an example, consider a scenario in which information about issued building permits is aggregated from different data sources. There can be different conventions for most of the fields in a record (e.g. the format of the date when the permit was issued: 2013-05-03 vs. 05/03/2013; the cost of the building: 83319 vs. $83319.00; or the address of the building: 730 Grand Ave vs. Grand Ave, Nr. 730). Such representational inconsistencies are rife within and across data sources, and can usefully be reduced during data wrangling. Data wrangling is typically carried out manually (e.g. by data scientists) with tool support; indeed data wrangling is often cited as taking a significant portion of the time of data scientists.1 An example of a tool that supports the authoring of format transformations is Wrangler [2], commercialized by Trifacta, in which data scientists author transformation rules with support from an interactive tool that can both suggest, and illustrate the effect of, the rules. Such an approach to data wrangling should lead to good quality results, but is labour intensive where there are multiple data sources that manifest numerous inconsistencies. In this paper, we address the question can the production of such format transformations be automated? Automatic solutions are unlikely to be able to match the reach or quality of transformations produced by data scientists, but any level of automation provides the possibility of added value for minimal cost. We build on some recent work on the synthesis of transformation programs from examples, which was originally developed for use in spreadsheets (e.g. FlashFill [3], BlinkFill [4]). In the commercial realization of FlashFill as a plugin to Excel, the user provides example pairs of values that represent source and target representations, from which a program is synthesized that can carry out the transformations. The published evaluations have shown that effective transformations can often be produced from small numbers of examples. An issue with this is that there is a need for examples to be provided by users. While this is potentially fine for spreadsheets, where there is typically a single source and target, and the source is of manageable size, identifying suitable examples seems more problematic if there are large datasets or many sources. How do we scale program synthesis from examples to work with numerous sources? The approach investigated here is to identify examples automatically. To this end, we propose two techniques for identifying pairs of values from two different data sources that represent the same real-world entity. The first proposal makes use of matching candidates and hypothesized functional dependencies to identify pairs of equivalent values. Specifically, given two datasets S and T in which we identify two functional dependency candidates, S.a→S.b and T.c→T.d ⁠, and two matching candidates (S.a,T.c) and (S.b,T.d) ⁠, we pair together values from the right-hand sides of the functional dependencies, S.b and T.d ⁠, where their corresponding left-hand sides, S.a and T.c ⁠, have equal values. For example, in Fig. 2a and b from Section 3, in order to pair together the values from S.Date and T.Date ⁠, we need the functional dependency relationships S.Permit_Nr.→S.Date and T.Permit_Nr.→T.Date ⁠, and the matching instances (S.Permit_Nr.,T.Permit_Nr.) and (S.Date,T.Date) ⁠. Notice that the values of the Permit_Nr. columns have equal values. We argue that the resulting pairs can be used as examples for synthesis algorithms to transform the format(s) of the values from S.Date to the format represent in T.Date ⁠. FIGURE 2. View largeDownload slide Building permits example. (a) Source, (b) target and (c) partial intermediate and final results from the algorithm. FIGURE 2. View largeDownload slide Building permits example. (a) Source, (b) target and (c) partial intermediate and final results from the algorithm. While the previous solution proves to be effective when certain conditions are met, functional dependency candidates are often hard to obtain using specialized tools due to data inconsistencies or simply because such relationships do not exist. To address this possibility, we propose a second, less restrictive technique, based on string similarities and candidate matching relationships, which proves to have comparable effectiveness with better scalability. Specifically, for each matching pair candidate (S.a,T.b) ⁠, we do a string similarity-based fuzzy pairing of values from S.a and T.b ⁠. Then we assign a confidence measure to each pair of values and use it to select a subset of pairs as examples for synthesis algorithms. For instance, in Fig. 2 from Section 3, we pair the values of S.Date and T.Date that represent the same date using their similarity, without requiring a common tuple identifier such as Permit_Nr. In summary, the main contributions of this paper are: the identification of the opportunity to deploy format transformation by program synthesis more widely through automatic discovery of examples; the description of two approaches that support (i)—an effective, but conditional algorithm, and a less effective, more scalable technique; the evaluation of the above approaches with real-world datasets. Although this paper focuses on a fully automated approach, in which the user is not in-the-loop, the automated approach could be used to support the bootstrapping phase of pay-as-you-go approaches, in which users subsequently provide feedback on the results of the automatically synthesized transformations. The rest of the paper is organized as follows. Section 2 reviews the work on synthesis programming on which this paper builds. Section 3 describes the first proposal for generating examples based on functional dependency candidates. In Section 4, the previously mentioned technique is evaluated with real-world datasets. Section 5 describes an alternative to functional dependency-based examples generation in the form of a string similarity-based technique. We evaluate this method in Section 6 and finally discuss the related work in Section 7 and conclude in Section 8. This paper is an invited extended version of [5], the main additional contributions being the inclusion of an additional method for generating examples in Section 5, and its evaluation in Section 6. 2. TECHNICAL CONTEXT In this section, we briefly review the work on the synthesis of programs for data transformation on which we build; full details are provided with the individual proposals (e.g. [3, 4]). In essence, the approach has the following elements: (i) a domain-specific language within which transformations are expressed; (ii) a data structure that can be used to represent a collection of candidate transformation programs succinctly; (iii) an algorithm that can generate candidate transformation programs that correspond to examples; and (iv) a ranking scheme that supports the selection of the more general programs. Starting from a set of user-provided examples, where an example represents a pair (ein,eout) ⁠, with ein the value to be transformed, and eout the expected transformation result, this language can express a range of operations for performing syntactic manipulations of strings such as concatenation, e.g. Concatenate, which links together the results of two or more atomic expressions, e.g. SubStr, Pos, or more complex expressions that involve conditionals such as Switch and loops, e.g. Loop. A brief description of the available language constructors is provided in Table 1—more details are provided in [3]. Table 1. FlashFill expressions. Expressions Description Concatenate(e1...en) Concatenates the results of multiple expressions SubStr(v,p1,p2) The sub-string of v at p1 : p2−1 Pos(r1,r2,c) The index t of a given string s such that r1 matches some prefix s[0:t−1] ⁠, r2 matches some suffix s[t:length(s)−1] and t is the cth such match. SubStr2(v,r,c) The cth occurrence of regular expression r in v ConstStr(s) The constant string s Expressions Description Concatenate(e1...en) Concatenates the results of multiple expressions SubStr(v,p1,p2) The sub-string of v at p1 : p2−1 Pos(r1,r2,c) The index t of a given string s such that r1 matches some prefix s[0:t−1] ⁠, r2 matches some suffix s[t:length(s)−1] and t is the cth such match. SubStr2(v,r,c) The cth occurrence of regular expression r in v ConstStr(s) The constant string s View Large Table 1. FlashFill expressions. Expressions Description Concatenate(e1...en) Concatenates the results of multiple expressions SubStr(v,p1,p2) The sub-string of v at p1 : p2−1 Pos(r1,r2,c) The index t of a given string s such that r1 matches some prefix s[0:t−1] ⁠, r2 matches some suffix s[t:length(s)−1] and t is the cth such match. SubStr2(v,r,c) The cth occurrence of regular expression r in v ConstStr(s) The constant string s Expressions Description Concatenate(e1...en) Concatenates the results of multiple expressions SubStr(v,p1,p2) The sub-string of v at p1 : p2−1 Pos(r1,r2,c) The index t of a given string s such that r1 matches some prefix s[0:t−1] ⁠, r2 matches some suffix s[t:length(s)−1] and t is the cth such match. SubStr2(v,r,c) The cth occurrence of regular expression r in v ConstStr(s) The constant string s View Large To illustrate the approach in practice, in Fig. 1, the user-provided the first two rows as examples for transforming the Address column and the synthesizer will try to learn one or more programs consistent with the examples. To this end, for each example ei provided, the algorithm will start by tokenizing the input value, eiin ⁠, and the output value, eiout ⁠, using the character classes depicted in Table 2. Then, it will generate all possible programs, expressed using the domain-specific language, that are consistent with the provided examples. Given that the number of such expressions can be huge, the algorithm chooses the most suitable expressions according to the ranking scheme. For instance, for Row 1, in FlashFill, the following expression is a possible inferred program: View Large View Large FIGURE 1. View largeDownload slide Transformation scenario. FIGURE 1. View largeDownload slide Transformation scenario. Table 2. Token primitives. Primitive Regex Description Alph [a−zA−Z]+ One or more letters LowAlph [a−z]+ One or more lowercase letters UppAlph [A−Z]+ One or more uppercase letters Num [0−9]+ One or more digits AlphNum [a−zA−Z0−9]+ One ore more letters or digits Punct [⧹p{Punct}]+ One or more punctuation signs Space [⧹s]+ One or more spaces Primitive Regex Description Alph [a−zA−Z]+ One or more letters LowAlph [a−z]+ One or more lowercase letters UppAlph [A−Z]+ One or more uppercase letters Num [0−9]+ One or more digits AlphNum [a−zA−Z0−9]+ One ore more letters or digits Punct [⧹p{Punct}]+ One or more punctuation signs Space [⧹s]+ One or more spaces View Large Table 2. Token primitives. Primitive Regex Description Alph [a−zA−Z]+ One or more letters LowAlph [a−z]+ One or more lowercase letters UppAlph [A−Z]+ One or more uppercase letters Num [0−9]+ One or more digits AlphNum [a−zA−Z0−9]+ One ore more letters or digits Punct [⧹p{Punct}]+ One or more punctuation signs Space [⧹s]+ One or more spaces Primitive Regex Description Alph [a−zA−Z]+ One or more letters LowAlph [a−z]+ One or more lowercase letters UppAlph [A−Z]+ One or more uppercase letters Num [0−9]+ One or more digits AlphNum [a−zA−Z0−9]+ One ore more letters or digits Punct [⧹p{Punct}]+ One or more punctuation signs Space [⧹s]+ One or more spaces View Large The logic here is to extract the street name, as t1 ⁠, using the SubStr function, i.e. the sub-string that starts at the index specified by the first occurrence of an alphabet token and ends at the end of string, then to extract the street number, as t2 ⁠, using the SubStr2 function, i.e. the first occurring number, and concatenate these two sub-strings separated by the constant string ‘, Nr. ’ using the Concatenate function. Although for the scenario in Fig. 1, the above program is consistent with both examples, it is possible for the algorithm to synthesize more than one expression for a set of provided examples. This happens, especially, when the examples describe different formats according to the primitives from Table 2. In such cases, the final program will include all the synthesized expressions, joined by a conditional directive, e.g. Switch. This enables the interpretation of data that is in multiple formats. For the rest of the paper, we describe the formats of the examples using the notion of format descriptor: Definition 1 Aformat descriptoris a unique sequence of regular expression primitives, i.e. from Table2, that describes one or more values from the column to be transformed, i.e. the source column. For example, the format descriptor for values such as 730 Grand Ave would be describing a number followed by a space, followed by two words separated by another space. While the above example illustrates a case in which the program learned is able to correctly transform all the values, this is not always the case. In general, the generated transformations are more likely to succeed to the extent that the following hold: (i) the correct transformation program is expressible in the underlying transformation language, (ii) both elements in the example pairs denote values in the domain of the same real-world property, and (iii) taken together, the example pairs cover all or most of the formats used for the column. 3. DISCOVERING EXAMPLES—FD-BASED SCHEME The approach described in Section 2 synthesizes transformation programs from examples, where an example consists of pairs of source and target values, 〈s,t〉 ⁠, where s is a literal from the source and t is a literal from the target. In our running example, s=730GrandAve and t=GrandAve,Nr.730 ⁠. In the spreadsheet setting, given a column of source values, the user provides target examples in adjacent columns, and FlashFill synthesizes a program to transform the remaining values in ways that are consistent with the transformations in the examples. Of course, it may take several examples to enable a suitable transformation (or suitable transformations) to be synthesized. In fact, the user needs to provide enough examples to cover all the relevant patterns existing among the values to be transformed. 3.1. Examples generation In this section, we propose an approach to the automatic identification of examples, drawing on data from existing datasets. Our aim is to use transformation program synthesis in more complex scenarios than spreadsheets. For example, consider a scenario in which we would like to integrate information about issued building permits from several different sources, and for the result to be standardized to a single format per column. Specifically, we want to represent the columns of the resulting dataset using the formatting conventions used in one of the original datasets, which acts as the target. Providing manual examples to synthesize the transformations needed can be a tedious task that requires knowledge about the formats of the values existing in the entire dataset. In our first approach, the basic idea is to identify examples from the different sources, where the source and target values for an attribute can be expected to represent the same information. To continue with our running example, assume we have two descriptions of an issued building permit, as depicted in Fig. 2a and b. To generate a transformation that applies to the Address columns, we need to know which (different) values in the source and target Address columns are likely to be equivalent. There are different types of evidence that could be used to reach such a conclusion. In the approach described here, we would draw the conclusion that 730 Grand Ave and Grand Ave, Nr. 730 are equivalent from the following observations: (i) the names of the first columns in the two tables match (because of the identical sub-string Permit_Nr. they share); (ii) there is a functional dependency, on the instances given, Permit_Nr.→Address in each of the tables; (iii) the names of the fifth columns (⁠ Address ⁠) of the two tables match; and (iv) the values for the first columns in the two tuples are the same. Note that this is a heuristic that does not guarantee a correct outcome—it is possible for the given conditions to hold, and for the values not to be equivalent; for example, such a case could occur if the Address attributes of the source and target tables had different semantics. More formally, assume we have two datasets, source S and target T ⁠. S has the attributes (sa1,…,san) and T has the attributes (ta1,…,tam) ⁠. We want values from S to be formatted as in T ⁠. Further, assume that we know instances for S and T ⁠. Then we can run a functional dependency discovery algorithm (e.g. [6]) to hypothesize the functional dependencies that exist among the attributes of S and T ⁠. This gives rise to collections of candidates functional dependencies for S and T ⁠, S.FD={sai→saj,…} and T.FD={tau→tav,…} ⁠. Note that, in general, sai and tau can be lists of attributes. In addition, assume we have a function, Matches ⁠, that given S and T ⁠, returns a set of pairwise matches between the attribute names in S and T ⁠, Matches(S,T)={〈sai,taj〉,…} ⁠. Matches can be implemented using a schema matching algorithm, most likely in our context making use of both schema and instance-level matchers [7]. In the case where the left-hand sides of the two functional dependencies sai→saj and tau→tav are lists of attributes, we say that saimatches tau if both lists have the same number of elements and there are pairwise matches between the attributes of the two lists. Then Algorithm 1 can be used to compute a set of examples for transformations between S and T ⁠. If S.sa and T.ta are list of attributes, then the join condition used in the SQL query at lines 8–10 is a conjunction of comparisons between the matching elements. While this may seem restrictive, it is a necessary condition for finding suitable example candidates. For example, Fig. 2a–c illustrates the intermediate results used by or created in Algorithm 1. Algorithm 1 Example discovery using functional dependencies. 1: functionFdEgsGen(S,T) 2:   Egs←{} 3:  for all sa∈S and ta∈Tdo 4:   if 〈sa,ta〉∈Matches(S,T)then 5:    for all (⁠ sa→sad)∈S.FD and 6:     (⁠ sa→sad)∈T.FDdo 7:     if 〈sad,tad〉∈Matches(S,T)then 8:       EgVals←𝚜𝚎𝚕𝚎𝚌𝚝𝚍𝚒𝚜𝚝𝚒𝚗𝚌𝚝 9:     S.sad, T.tad from S, T where 10:      𝚂.𝚜𝚊=𝚃.𝚝𝚊 11:       EgPairs←(sad,tad,〈EgVals〉) 12:       Egs←Egs∪{EgPairs} 13:     end if 14:    end for 15:   end if 16:  end for 17:  return Egs 18: end function 1: functionFdEgsGen(S,T) 2:   Egs←{} 3:  for all sa∈S and ta∈Tdo 4:   if 〈sa,ta〉∈Matches(S,T)then 5:    for all (⁠ sa→sad)∈S.FD and 6:     (⁠ sa→sad)∈T.FDdo 7:     if 〈sad,tad〉∈Matches(S,T)then 8:       EgVals←𝚜𝚎𝚕𝚎𝚌𝚝𝚍𝚒𝚜𝚝𝚒𝚗𝚌𝚝 9:     S.sad, T.tad from S, T where 10:      𝚂.𝚜𝚊=𝚃.𝚝𝚊 11:       EgPairs←(sad,tad,〈EgVals〉) 12:       Egs←Egs∪{EgPairs} 13:     end if 14:    end for 15:   end if 16:  end for 17:  return Egs 18: end function View Large Algorithm 1 Example discovery using functional dependencies. 1: functionFdEgsGen(S,T) 2:   Egs←{} 3:  for all sa∈S and ta∈Tdo 4:   if 〈sa,ta〉∈Matches(S,T)then 5:    for all (⁠ sa→sad)∈S.FD and 6:     (⁠ sa→sad)∈T.FDdo 7:     if 〈sad,tad〉∈Matches(S,T)then 8:       EgVals←𝚜𝚎𝚕𝚎𝚌𝚝𝚍𝚒𝚜𝚝𝚒𝚗𝚌𝚝 9:     S.sad, T.tad from S, T where 10:      𝚂.𝚜𝚊=𝚃.𝚝𝚊 11:       EgPairs←(sad,tad,〈EgVals〉) 12:       Egs←Egs∪{EgPairs} 13:     end if 14:    end for 15:   end if 16:  end for 17:  return Egs 18: end function 1: functionFdEgsGen(S,T) 2:   Egs←{} 3:  for all sa∈S and ta∈Tdo 4:   if 〈sa,ta〉∈Matches(S,T)then 5:    for all (⁠ sa→sad)∈S.FD and 6:     (⁠ sa→sad)∈T.FDdo 7:     if 〈sad,tad〉∈Matches(S,T)then 8:       EgVals←𝚜𝚎𝚕𝚎𝚌𝚝𝚍𝚒𝚜𝚝𝚒𝚗𝚌𝚝 9:     S.sad, T.tad from S, T where 10:      𝚂.𝚜𝚊=𝚃.𝚝𝚊 11:       EgPairs←(sad,tad,〈EgVals〉) 12:       Egs←Egs∪{EgPairs} 13:     end if 14:    end for 15:   end if 16:  end for 17:  return Egs 18: end function View Large 3.2. Examples validation Although Algorithm 1 returns sets of examples that can be used for synthesizing transformations (for example, using FlashFill [3] or BlinkFill [4]) there is no guarantee that the transformation generation algorithm will produce effective transformations. The synthesized transformations may be unsuitable for various reasons, e.g. (i) the required transformation cannot be expressed using the available transformation language, (ii) the data is not amenable to homogenization (e.g. because there is no regular structure in the data), or (iii) there are errors in the data. As a result, there is a need for an additional validation step that seeks to determine (again automatically) whether or not a suitable transformation can be synthesized. In our approach, the set of examples returned by Algorithm 1 is discarded unless a k-fold cross-validation process is successful. In this process, the set of examples is randomly partitioned into k equally sized subsets. Then, transformations are synthesized, in k rounds, using the examples from the other k−1 partitions, and the synthesized transformation is tested on the remaining partition. Although k-fold cross-validation can be used with different thresholds on the fraction of correct results produced, in the experiments, we retain a set of examples only if the synthesized transformations behave correctly throughout the k-fold validation process. In our experiments, we used k=10 ⁠. 4. EVALUATION OF FD-BASED SCHEME In this paper, we hypothesize that the process of transforming data from one format into another using the recent work on program synthesis can be automated by replacing the user-provided examples with those identified by Algorithm 1. In a scenario in which information from multiple, heterogeneous datasets are to be integrated, in Wrangler [2] or FlashFill [3], it falls on the data scientists either to identify values that need to be transformed in one dataset and their corresponding versions in a second dataset, or to provide the target versions for some of the source values. Our approach removes the user from the process by automating the identification of examples for use by program synthesis. To illustrate and evaluate this method, we use open government web data, available as CSV files, and we build a process pipeline which makes use of several existing pieces of work. Specifically, the process starts by using an off-the-shelf matcher to identify column level matches between a source dataset and a target one (Match), then for each dataset an off-the-shelf profiling tool is used to identify functional dependencies, which are then used together with the previously discovered matching column pairs to generate input-output examples for the desired transformations using Algorithm 1. The resulting sets of examples are then validated using a k-fold cross-validation process, as described in Section 3. Note that, in practice, the first two steps of our pipeline are necessary only if the input data do not explicitly contain pre-discovered matching correspondences or functional dependency candidates, e.g. a relational database that contains explicit FDs for its relations. Finally, the validated pairs are fed to a synthesis algorithm. Note that only the generation of examples is claimed as our main contribution and, hence, evaluated in this section. Although they are relevant to the use of the approach in practice, we do not seek to directly evaluate the off-the-shelf components we use: (i) the effectiveness of the implementation of Match; (ii) the effectiveness of the functional dependency detector; or (iii) the effectiveness of the format transformation synthesizer. In all cases, these works have been evaluated directly by their original authors and/or in comparative studies, and we do not seek to replicate such studies here. Rather, we carry out experiments that investigate example discovery, validation and the quality of the synthesized program, for the approach described in Section 3. 4.1. Experimental setup In the experiments, we use data from the open government datasets listed in Table 3. For each domain, the URLs represent the location of the source and target datasets, respectively. The last two columns illustrate the size (cardinality and arity) of each dataset. Recall from Section 3 that the FD-based technique relies on the information overlap existing between source and target datasets. Consequently, for the evaluation we picked data on similar domains from different sources that have different levels of overlap. For example, the source dataset for the Food Hygiene domain contains details about food hygiene ratings from 2015 across the whole U.K., while the target dataset illustrates information from a certain area, e.g. a single council or county. Therefore the overlap between source and target datasets might not be very large. In other words, the size of the source/target is not an accurate estimator of the expected overlap. Table 3. Data sources. Domain URLs Card. Arity Food ratings.food.gov.uk, 12323 7 Hygiene data.gov.uk 61 10 Lobby data.cityofchicago.org, 7542 12 data.cityofchicago.org (different years) 210 19 Doctors/Addresses www.nhs.uk, 7696 19 data.gov.uk 17867 3 Building app.enigma.io, 10000 13 Permits data.cityofchicago.org 1245 13 Citations dl.acm.org/results.cfm, 1072 16 ieeexplore.ieee.org 2000 11 Domain URLs Card. Arity Food ratings.food.gov.uk, 12323 7 Hygiene data.gov.uk 61 10 Lobby data.cityofchicago.org, 7542 12 data.cityofchicago.org (different years) 210 19 Doctors/Addresses www.nhs.uk, 7696 19 data.gov.uk 17867 3 Building app.enigma.io, 10000 13 Permits data.cityofchicago.org 1245 13 Citations dl.acm.org/results.cfm, 1072 16 ieeexplore.ieee.org 2000 11 View Large Table 3. Data sources. Domain URLs Card. Arity Food ratings.food.gov.uk, 12323 7 Hygiene data.gov.uk 61 10 Lobby data.cityofchicago.org, 7542 12 data.cityofchicago.org (different years) 210 19 Doctors/Addresses www.nhs.uk, 7696 19 data.gov.uk 17867 3 Building app.enigma.io, 10000 13 Permits data.cityofchicago.org 1245 13 Citations dl.acm.org/results.cfm, 1072 16 ieeexplore.ieee.org 2000 11 Domain URLs Card. Arity Food ratings.food.gov.uk, 12323 7 Hygiene data.gov.uk 61 10 Lobby data.cityofchicago.org, 7542 12 data.cityofchicago.org (different years) 210 19 Doctors/Addresses www.nhs.uk, 7696 19 data.gov.uk 17867 3 Building app.enigma.io, 10000 13 Permits data.cityofchicago.org 1245 13 Citations dl.acm.org/results.cfm, 1072 16 ieeexplore.ieee.org 2000 11 View Large For Matching, we used COMA 3.0 Community Edition,2 with schema and instance-level matchers, and the following parameter values: MaxN=0 and Threshold=0.2 ⁠. The first parameter specifies that attributes can only be in 1:1 matching relationships with each other. The second value specifies the minimal confidence a correspondence between two elements must reach in order to be accepted as a match. For functional dependency discovery we used HyFD3 [6] with the Null Semantics setting set to null≠null ⁠. This last setting is needed because real-world data often contains null values. So for a schema S(A,B) ⁠, if there are two tuples r1=(null,1) and r2=(null,2) ⁠, if null=null then A→B is not a functional dependency. In order to avoid discarding functional dependencies in such situations we assume null≠null ⁠. For synthesizing and running transformations, we used a Java implementation of FlashFill described in [8]. All experiments were run on Ubuntu 16.04.1 LTS installed on a 3.40 GHz Intel Core i7-6700 CPU and 16 GB RAM machine. 4.2. Experiment 1 Can Algorithm1identify candidate column pairs? To evaluate this, we report the precision (TPTP+FP) and recall (TPTP+FN) of the algorithm over datasets where there are known suitable candidates. For this experiment, true positive (TP), false positive (FP) and false negative (FN) are defined as: TP—the two columns represent the same concept; FP—the two columns do not represent the same concept; and FN—a correct result that is not returned. The ground truth for computing precision and recall in this experiment was manually created for the domains in Table 3—a column from the source and a column from the target create a pair in the ground truth if they represent the same concept, e.g. Business Name columns in Food Hygiene, Author Names columns in Citations, etc. Examples of candidate columns pairs that are identified by Algorithm 1, are given in Table 4. The last column illustrates the number of unique example pairs generated by Algorithm 1 for each case. Table 4. Transformation candidates. # Domain Source semantics Source example Target semantics Target example # egs 1 Food Hyg. Rating date 2015-12-01 Rating date 01/12/2015 57 2 Food Hyg. Building name Royal Free Hospital Street Pond Street 59 3 Food Hyg. Business name Waitrose Business name Waitrose 60 4 Lobby Person name Mr. Neil G Bluhm Person name Bluhm Neil G 206 5 Lobby Phone nr. (312) 463–1000 Phone nr. (312) 463–1000 191 6 Docs./Addrs. Address 55 Swain Street,Watchet Street Swain Street 41 7 Docs./Addrs. City Salford City Manchester 28 8 Build. Permits Address 1885 Maud Ave Address Maud Ave,Nr. 1885 26 9 Build. Permits Cost 6048 Cost $6048.00 22 10 Build. Permits Issue date 2014-06-05 Issue date 06/05/2014 26 11 Citations Author names Sven Apel and Dirk Beyer Author names S. Apel; D. Beyer 56 12 Citations Conf. date ” ” Conf. date 2–8 May 2010 32 13 Citations Pub. year 2011 Pub. year 2011 56 14 Citations Nr. of pages 10 Start page 401 56 # Domain Source semantics Source example Target semantics Target example # egs 1 Food Hyg. Rating date 2015-12-01 Rating date 01/12/2015 57 2 Food Hyg. Building name Royal Free Hospital Street Pond Street 59 3 Food Hyg. Business name Waitrose Business name Waitrose 60 4 Lobby Person name Mr. Neil G Bluhm Person name Bluhm Neil G 206 5 Lobby Phone nr. (312) 463–1000 Phone nr. (312) 463–1000 191 6 Docs./Addrs. Address 55 Swain Street,Watchet Street Swain Street 41 7 Docs./Addrs. City Salford City Manchester 28 8 Build. Permits Address 1885 Maud Ave Address Maud Ave,Nr. 1885 26 9 Build. Permits Cost 6048 Cost $6048.00 22 10 Build. Permits Issue date 2014-06-05 Issue date 06/05/2014 26 11 Citations Author names Sven Apel and Dirk Beyer Author names S. Apel; D. Beyer 56 12 Citations Conf. date ” ” Conf. date 2–8 May 2010 32 13 Citations Pub. year 2011 Pub. year 2011 56 14 Citations Nr. of pages 10 Start page 401 56 View Large Table 4. Transformation candidates. # Domain Source semantics Source example Target semantics Target example # egs 1 Food Hyg. Rating date 2015-12-01 Rating date 01/12/2015 57 2 Food Hyg. Building name Royal Free Hospital Street Pond Street 59 3 Food Hyg. Business name Waitrose Business name Waitrose 60 4 Lobby Person name Mr. Neil G Bluhm Person name Bluhm Neil G 206 5 Lobby Phone nr. (312) 463–1000 Phone nr. (312) 463–1000 191 6 Docs./Addrs. Address 55 Swain Street,Watchet Street Swain Street 41 7 Docs./Addrs. City Salford City Manchester 28 8 Build. Permits Address 1885 Maud Ave Address Maud Ave,Nr. 1885 26 9 Build. Permits Cost 6048 Cost $6048.00 22 10 Build. Permits Issue date 2014-06-05 Issue date 06/05/2014 26 11 Citations Author names Sven Apel and Dirk Beyer Author names S. Apel; D. Beyer 56 12 Citations Conf. date ” ” Conf. date 2–8 May 2010 32 13 Citations Pub. year 2011 Pub. year 2011 56 14 Citations Nr. of pages 10 Start page 401 56 # Domain Source semantics Source example Target semantics Target example # egs 1 Food Hyg. Rating date 2015-12-01 Rating date 01/12/2015 57 2 Food Hyg. Building name Royal Free Hospital Street Pond Street 59 3 Food Hyg. Business name Waitrose Business name Waitrose 60 4 Lobby Person name Mr. Neil G Bluhm Person name Bluhm Neil G 206 5 Lobby Phone nr. (312) 463–1000 Phone nr. (312) 463–1000 191 6 Docs./Addrs. Address 55 Swain Street,Watchet Street Swain Street 41 7 Docs./Addrs. City Salford City Manchester 28 8 Build. Permits Address 1885 Maud Ave Address Maud Ave,Nr. 1885 26 9 Build. Permits Cost 6048 Cost $6048.00 22 10 Build. Permits Issue date 2014-06-05 Issue date 06/05/2014 26 11 Citations Author names Sven Apel and Dirk Beyer Author names S. Apel; D. Beyer 56 12 Citations Conf. date ” ” Conf. date 2–8 May 2010 32 13 Citations Pub. year 2011 Pub. year 2011 56 14 Citations Nr. of pages 10 Start page 401 56 View Large The results for this experiment are presented in Table 5, for the datasets in Table 3. In general, both precision and recall are high. In the case of the Building Permits domain, there are three attributes in each dataset representing a cost (see Row 9 in Table 4 for an example). Although there are nine possible source column-target column alignments, Algorithm 1 was able to identify the correct ones and returned no false positives. Food Hygiene precision is 0.86 due to a FP match identified by COMA. An example of this match is on the second row of Table 4. The two columns have the same name (AddressLine1), but different semantics. The precision and recall for Citation are reduced by one FP match identified by COMA (represented in the last row of Table 3) and 2 FNs (i.e. two pairs of columns that were not reported as matches by COMA). Table 5. Experiments 1 and 2. Domain Candidates Precision Recall Valid Food Hyg. 6 0.83 1.00 5/6 Lobby 9 1.00 1.00 9/9 Docs./Addrs. 2 1.00 1.00 1/2 Build.Perm. 12 1.00 1.00 12/12 Citations 7 0.86 0.75 1/7 Domain Candidates Precision Recall Valid Food Hyg. 6 0.83 1.00 5/6 Lobby 9 1.00 1.00 9/9 Docs./Addrs. 2 1.00 1.00 1/2 Build.Perm. 12 1.00 1.00 12/12 Citations 7 0.86 0.75 1/7 View Large Table 5. Experiments 1 and 2. Domain Candidates Precision Recall Valid Food Hyg. 6 0.83 1.00 5/6 Lobby 9 1.00 1.00 9/9 Docs./Addrs. 2 1.00 1.00 1/2 Build.Perm. 12 1.00 1.00 12/12 Citations 7 0.86 0.75 1/7 Domain Candidates Precision Recall Valid Food Hyg. 6 0.83 1.00 5/6 Lobby 9 1.00 1.00 9/9 Docs./Addrs. 2 1.00 1.00 1/2 Build.Perm. 12 1.00 1.00 12/12 Citations 7 0.86 0.75 1/7 View Large 4.3. Experiment 2 Is the validation process successful at identifying problematic column pairs? A column pair is problematic if we cannot synthesize a suitable transformation. Note that in Experiment 1, we measured the effectiveness of Algorithm 1 at identifying correct column pairs. Here we check whether the alignment of values, i.e. the result of the select at lines 8–10 in Algorithm 1, denotes suitable examples for synthesizing a correct transformation. To evaluate this we run a k-fold cross-validation task, with k=10 ⁠, on each pair of columns from Experiment 1. We investigate here the candidate pairs for which validation has failed; the cases for which validation has passed are discussed in Experiment 3. A pair of columns is considered to pass the validation step if all transformations are correct for each of the 10 iterations. The fraction of the candidate column pairs that pass validation is reported in the last column of Table 5. Column pairs have failed validation for the following reasons: (i) The matched columns have different semantics, and thus incompatible values, for which no transformation can be produced. This is the case for 2 of the 8 column pairs that fail validation (for example see Rows 2 and 14 in Table 4). (ii) The matched columns have the same semantics, but FlashFill has failed to synthesize a suitable transformation. This is the case for 2 of the column pairs that fail validation (for example, consider the lists of author names from Row 11 in Table 4). (iii) There are issues with the specific values in candidate columns. This is the case for 4 of the column pairs that fail validation (as an example consider the missing information from Row 12 in Table 4) and inconsistent values (in Row 7 in Table 4). It is important to note that in practice, the effectiveness of the off-the-shelf tools that we used here can be impacted by characteristics of the data such as the ones exemplified above. This evaluation shows that the validation method we employ is able to filter out example sets that otherwise would produce invalid transformations or no transformations at all. 4.4. Experiment 3 Do the synthesized transformations work on the complete datasets (and not just the training data)? To evaluate this, we report the precision and recall of validated transformations in Table 6; the missing row numbers are the examples from Table 4 for which the transformations failed validation. In computing the precision and recall, we use the following definitions: TP—the transformation produces the correct output; FP—the transformation produces an incorrect output; FN—the transformation results in an empty string. Table 6. Experiment 3. # Source semantics Source value Target semantics Target value Precision Recall 1 Rating date 2015-12-01 Rating Date 01/12/2015 1.0 1.0 3 Business name Waitrose Business Name Waitrose 1.0 1.0 4 Person name Mr. Neil G Bluhm Person Name Bluhm Neil G 0.98 0.84 5 Phone nr. (312) 463–1000 Phone Nr. (312) 463–1000 1.0 1.0 6 Address 55 Swain Street,Watchet Street Swain Street 0.68 1.0 8 Address 1885 Maud Ave Address Maud Ave,Nr. 1885 1.0 1.0 9 Cost 6048 Cost $6048.00 0.97 1.0 10 Issue date 2014-06-05 Issue Date 06/05/2014 1.0 1.0 13 Pub. year 2011 Pub. year 2011 1.0 1.0 # Source semantics Source value Target semantics Target value Precision Recall 1 Rating date 2015-12-01 Rating Date 01/12/2015 1.0 1.0 3 Business name Waitrose Business Name Waitrose 1.0 1.0 4 Person name Mr. Neil G Bluhm Person Name Bluhm Neil G 0.98 0.84 5 Phone nr. (312) 463–1000 Phone Nr. (312) 463–1000 1.0 1.0 6 Address 55 Swain Street,Watchet Street Swain Street 0.68 1.0 8 Address 1885 Maud Ave Address Maud Ave,Nr. 1885 1.0 1.0 9 Cost 6048 Cost $6048.00 0.97 1.0 10 Issue date 2014-06-05 Issue Date 06/05/2014 1.0 1.0 13 Pub. year 2011 Pub. year 2011 1.0 1.0 View Large Table 6. Experiment 3. # Source semantics Source value Target semantics Target value Precision Recall 1 Rating date 2015-12-01 Rating Date 01/12/2015 1.0 1.0 3 Business name Waitrose Business Name Waitrose 1.0 1.0 4 Person name Mr. Neil G Bluhm Person Name Bluhm Neil G 0.98 0.84 5 Phone nr. (312) 463–1000 Phone Nr. (312) 463–1000 1.0 1.0 6 Address 55 Swain Street,Watchet Street Swain Street 0.68 1.0 8 Address 1885 Maud Ave Address Maud Ave,Nr. 1885 1.0 1.0 9 Cost 6048 Cost $6048.00 0.97 1.0 10 Issue date 2014-06-05 Issue Date 06/05/2014 1.0 1.0 13 Pub. year 2011 Pub. year 2011 1.0 1.0 # Source semantics Source value Target semantics Target value Precision Recall 1 Rating date 2015-12-01 Rating Date 01/12/2015 1.0 1.0 3 Business name Waitrose Business Name Waitrose 1.0 1.0 4 Person name Mr. Neil G Bluhm Person Name Bluhm Neil G 0.98 0.84 5 Phone nr. (312) 463–1000 Phone Nr. (312) 463–1000 1.0 1.0 6 Address 55 Swain Street,Watchet Street Swain Street 0.68 1.0 8 Address 1885 Maud Ave Address Maud Ave,Nr. 1885 1.0 1.0 9 Cost 6048 Cost $6048.00 0.97 1.0 10 Issue date 2014-06-05 Issue Date 06/05/2014 1.0 1.0 13 Pub. year 2011 Pub. year 2011 1.0 1.0 View Large Of the 28 validated transformations from Table 5, all but 6 are identity transformations, i.e. the source and target data values are the same (e.g. Rows 3, 5 and 13 in Table 6). This can often happen in practice. For instance, Row 13 represents the year for a publication which is most commonly represented as a four-digit number. In such cases, FlashFill proved to be able to identify that the transformation is only a copying operation. Of the 6 cases where the values are modified, the precision and recall are both 1.0 in three cases (Rows 1, 8 and 10 in Table 6). For Rows 1 and 10, the transformations rearrange the date components and replace the separator. Our experiments confirmed the results of [3] according to which FlashFill is effective in such simple cases. For Row 8, the transformation is more complicated given that the street name does not always have a fixed number of words, or that the street number can have several digits. In this case, Algorithm 1 was able to identify enough examples to cover all the existing formats. There were problems with the transformations generated in three cases. For Row 4, a few source values do not conform to the usual pattern (e.g. the full stop is missing after the salutation). For Row 9, not all source values are represented as integers, giving rise to incorrect transformations. For Row 6, similarly to Row 9, the examples do not cover all the relevant address formats, i.e. 41 examples are used to synthesize a program to transform a rather large number of values (approx. 7700). 4.5. Discussion The technique evaluated above proves to be effective in scenarios where certain conditions are met. The most important of these is that the source and target contain overlapping information on some of the tuples, i.e. the left-hand sides of the functional dependencies used in Algorithm 1, and format diversity on the corresponding right-hand side. Another important condition is for FlashFill to be able to synthesize transformations from the pairs of generated examples. The evaluation shows that as long as these conditions are met, we can delay the need for user intervention in the cleaning process by synthesizing and applying some transformations automatically. 5. DISCOVERING EXAMPLES—WEIGHTED SCHEME The technique described in the previous section uses a set of hypothesized functional dependencies to pair values from a source dataset S:(sa1,…,san) and a target dataset T:(ta1,…,tam) ⁠, that represent the same real-world entity. While the evaluation from Section 4 showed that Algorithm 1 can be effective in some scenarios, there are cases where the conditions required by the algorithm are not satisfied. For instance, candidate functional dependencies may be missing or state-of-the-art algorithms are unable to find them due to inconsistencies in values. Furthermore, the number of example pairs generated by Algorithm 1 can be very large, while the number of cases covered by the examples is small. For instance, for Row 1 in Table 4, the algorithm generated 57 example pairs, all describing a single date format. Since the complexity of synthesis algorithms such as FlashFill is known to be exponential in the number of examples and high degree polynomial in the size of each example [9], not only is it the case that many such examples are useless, but they increase the runtime of the synthesis process. To address these scenarios, in this section we propose an approach to the automatic identification of examples using syntactic similarities existing between values of matching column pairs, as returned by the Matches function introduced in the previous section. We also propose an incremental example selection algorithm which, once the example pairs are generated, selects a subset from which there is evidence that FlashFill can synthesize effective transformations. The objective of the example discovery technique, described in subsection 5.1, is, given two columns from different tables, to heuristically identify pairs of values that are equivalent. To this end, we start by tokenizing each column value, where each token is a substring of the original value delimited by punctuation or spaces. The tokens enable the grouping of values into blocks, where each block contains values from both columns with common tokens. An intra-block, pairwise comparison of values is conducted to determine potentially equivalent instances. Given the fact that the syntactic comparison does not guarantee the optimal pairing of equivalent values, each candidate pair obtained will have an associated weight. This measure is used in the second phase, described in Section 5.2, to determine the minimal subset of candidate value pairs that can be used as examples to synthesize an effective transformation program. 5.1. Weighted examples generation Consider a pair of columns (S.sa,T.ta) ⁠, described in Fig. 3, between which there is a matching correspondence as returned by the Matches function. The objective is to identify which values in sa and ta are likely to be equivalent. Starting from the assumed matching relationship between the two columns, we can pair values based on their string representations and determine that, for example, 730 Grand Ave and Grand Ave, Nr. 730 represent the same address. FIGURE 3. View largeDownload slide Matching column pair. FIGURE 3. View largeDownload slide Matching column pair. More formally, assume we have two datasets, source S and target T ⁠. S has the attributes (sa1,…,san) ⁠, and T has the attributes (ta1,…,tam) ⁠. We want values from S to be formatted as in T ⁠. Further, assume we have a function, Matches ⁠, that returns a set of pairwise matches between the attribute names in S and T ⁠. Then, we can use Algorithm 2 to pair values from sai and taj that are likely to be equivalent. The obtained pairs can then be used as examples for synthesis algorithms to generate transformation programs that will format values from sai as in taj ⁠. Figure 3 depicts a pair of matching columns, each with four values. Notice that in a real-world case, the number of tuples of sa and ta will most likely be different. For the rest of this paper, we define the example set as the collection of pairs (eiin,eiout) ⁠, with eiin a value from sa and eiout its equivalent value from ta ⁠. For example, if eiin is Row 1 from sa ⁠, then eiout is Row 4 from ta ⁠. Analogously, we define the test set as the collection of values from sa for which there are no corresponding values identified in ta ⁠, i.e. the set of values that will be transformed by the transformation program synthesized using the example set. Algorithm 2 Weighted examples discovery using string similarities. 1: functionwEgsGen(S,T) 2:   Egs←{} 3:  for all 〈sa,ta〉∈Matches(S,T)do 4:    Toks←Tokenise(sa) 5:    Tokt←Tokenise(ta) 6:    Idx←Index(Toks,Tokt) 7:    Egs←Egs∪WeightedPairing(Idx) 8:  end for 9:  return Egs 10: end function 1: functionwEgsGen(S,T) 2:   Egs←{} 3:  for all 〈sa,ta〉∈Matches(S,T)do 4:    Toks←Tokenise(sa) 5:    Tokt←Tokenise(ta) 6:    Idx←Index(Toks,Tokt) 7:    Egs←Egs∪WeightedPairing(Idx) 8:  end for 9:  return Egs 10: end function View Large Algorithm 2 Weighted examples discovery using string similarities. 1: functionwEgsGen(S,T) 2:   Egs←{} 3:  for all 〈sa,ta〉∈Matches(S,T)do 4:    Toks←Tokenise(sa) 5:    Tokt←Tokenise(ta) 6:    Idx←Index(Toks,Tokt) 7:    Egs←Egs∪WeightedPairing(Idx) 8:  end for 9:  return Egs 10: end function 1: functionwEgsGen(S,T) 2:   Egs←{} 3:  for all 〈sa,ta〉∈Matches(S,T)do 4:    Toks←Tokenise(sa) 5:    Tokt←Tokenise(ta) 6:    Idx←Index(Toks,Tokt) 7:    Egs←Egs∪WeightedPairing(Idx) 8:  end for 9:  return Egs 10: end function View Large When applied on the column pair from Fig. 3, Algorithm 2 comprises of the following steps: Tokenise—lines 4,5: For each value in columns sa and ta ⁠, the Tokenise method transforms the string representation of the value into an array representation, where each element of the array is a token as defined by the list of regular expression-based primitives from Table 2. The last primitives, punctuation and space, are being used as separators. The intuition is that a punctuation sign has small significance in determining the similarity of two values, e.g. the separators have a small weight in determining the equivalence of 24/04/1989 and 04.24.1989. To illustrate this, Fig. 4, describes the tokenized representations of values from Fig. 3. FIGURE 4. View largeDownload slide Tokenized values. FIGURE 4. View largeDownload slide Tokenized values. Index—line 6: The tokens are then used to create an inverted index I(sa,ta) for the pair of matching columns. For each token t ⁠, the inverted list I[t] is a list of all values from sa and ta which contain t ⁠. For example, if t=Ave ⁠, I[t]=[sai.‘730 Grand Ave’, taj.‘Grand Ave, Nr. 730’,… ]—all values from sa and ta containing token Ave. Figure 5 shows three index entries for tokenised values from Fig. 4—each row denotes an index entry with the Token column being the key and Inverted list being the list of values containing the token. FIGURE 5. View largeDownload slide Indexed values. FIGURE 5. View largeDownload slide Indexed values. Weighted pairing—line 7: Each source value in each index entry is paired, by Algorithm 3, with the most similar target value according to a confidence measure described below. For instance, if t=Ave and I[t]=[sa.‘730 Grand Ave’, ta.‘Grand Ave, Nr. 730’, ta.‘Grand Central Ave, Nr. 331’ ] ⁠, then the returned pair would be (sa.‘730 Grand Ave’, ta.‘Grand Ave, Nr. 730’ ) ⁠. Similar examples can be seen in Fig. 6 for indexed values from Fig. 5. Algorithm 3 The WeightedPairing function of Algorithm 2. 1: functionWeightedPairing(Idx) 2:  for all e∈Idx.entriesdo 3:    pairs←SimPairing(e) 4:    maxPair←null 5:   for all p∈pairsdo 6:    if maxPair.weight ⁠), Rows 4–7 (⁠ ⁠), and Rows 8–9 (⁠ ⁠). Note that the pairs with the same format are grouped together for clarity—this might not be the case in a real-world scenario. Note also that both the punctuation and space count as token types rather than separators (as was the case for the tokenization step in previous section). The three examples from Fig. 8 have the highest weights for their respective formats. FIGURE 8. View largeDownload slide Initialization phase. FIGURE 8. View largeDownload slide Initialization phase. Synthesise—lines 4,10: This method, illustrated in Algorithm 5, takes as input a set of example pairs, e.g. the pairs with the highest weights as returned by the InitEgs method, and a set of test pairs, e.g. the pairs with lower weights than the ones considered as examples. Then, a transformation program is synthesized using FlashFill from the example set (line 2 in Algorithm 5). The resulting program is then tested against each pair from the test set. If the result of a transformation is different from the expected test output (line 4 in Algorithm 5), and if there has not been a previous value describing the same format which was correctly transformed (line 5 in Algorithm 5), then that failing pair is returned. The IsFirst(p) method at line 5, checks if pair p has the highest weight for the format it describes. Algorithm 5 The Synthesise method of Algorithm 4. 1: functionSynthesise(Egs,TestEgs) 2:   τ←FlashFill(Egs) 3:  for all p∈TestEgsdo 4:   if τ(p.in)≠p.outthen 5:    if IsFirst(p)then 6:     return p 7:    end if 8:   end if 9:  end for 10:  return ϵ 11: end function 1: functionSynthesise(Egs,TestEgs) 2:   τ←FlashFill(Egs) 3:  for all p∈TestEgsdo 4:   if τ(p.in)≠p.outthen 5:    if IsFirst(p)then 6:     return p 7:    end if 8:   end if 9:  end for 10:  return ϵ 11: end function View Large Algorithm 5 The Synthesise method of Algorithm 4. 1: functionSynthesise(Egs,TestEgs) 2:   τ←FlashFill(Egs) 3:  for all p∈TestEgsdo 4:   if τ(p.in)≠p.outthen 5:    if IsFirst(p)then 6:     return p 7:    end if 8:   end if 9:  end for 10:  return ϵ 11: end function 1: functionSynthesise(Egs,TestEgs) 2:   τ←FlashFill(Egs) 3:  for all p∈TestEgsdo 4:   if τ(p.in)≠p.outthen 5:    if IsFirst(p)then 6:     return p 7:    end if 8:   end if 9:  end for 10:  return ϵ 11: end function View Large Increment—line 9: Every time the Synthesise method returns a failing pair, it is added to the previous set of examples. The intuition is that the initial set of examples did not cover the format described by the failing pair, therefore, by considering the pair an example, the format will be covered. Halting condition—line 6: Algorithm 4 ends when there are no failing test pairs, or when the halting condition is met: all the pairs describing a failing format have been used as examples. This means that there are not enough examples for that format, the algorithm returns and the example set is rejected. To continue with our US Congress example, when Synthesise is called at line 4 in Algorithm 4, the pairs from Fig. 8 will be used as the example set to synthesize a transformation program τ ⁠, which will then be tested against the rest of the values from Fig. 9. Notice that for Row 5 of Fig. 9, it is likely that the transformation synthesized from the examples depicted in Fig. 8 will fail, i.e. τ(Joe Ackerman-Specter )≠Ackerman, because Ackerman is not the last name. The pair at Row 4 has the highest weight from the test pairs describing the same format f ⁠, meaning that f has not been covered by the previous example set, so the pair will be added to the example set at line 9 in Algorithm 4, and a new iteration started. Of particular interest is Row 7 in Fig. 9. Notice that, given the values exemplified so far, there is no transformation program that can transform Walter Ben-Stock to Benjamin-Stock. If the previous pair (Row 6), which has a higher weight, is correctly transformed, i.e. τ(Tammy Liu-Vitter )=Liu-Vitter, then the pair at Row 7 will be considered a false-positive returned by Algorithm 2, ignored, and the algorithm will continue towards a successful result. This result will include Rows 1, 4, 5 and 8 of Fig. 9, i.e. the minimal set of examples that can produce a transformation program that can correctly transform the rest of pairs. Otherwise, if τ(Tammy Liu-Vitter )≠Liu-Vitter and τ(Walter Ben-Stock )≠Benjamin-Stock, the halting condition is met, i.e. there are no more pairs for the format described by these two failing pairs, and the example set generated by Algorithm 2 is rejected. At this point, we consider that there is no need for further validation of the returned set of examples, e.g. employing a k-fold cross-validation step similar to the one from Section 3. The intuition is that if Algorithm 4 was successful, there is enough evidence to consider that FlashFill can synthesize a transformation program that will correctly transform the source values covered by the obtained examples. Furthermore, if applied on the output of Algorithm 4, the cross-validation, as described in Section 3, will fail because it requires at least two example pairs per format in the candidate example set. In other words, the validation technique requires redundant example candidates which is exactly what Algorithm 4 tries to remove. 6. EVALUATION OF WEIGHTED SCHEME In this section, we evaluate the effectiveness of Algorithms 2 and 4. As was the case with the functional dependency-based proposal, we hypothesize that the task of providing examples by the user, which is often required in current wrangling tools, can be partly replaced by Algorithms 2 and 4. Note that only the examples generation technique, from Algorithms 2 and 3, and the examples selection technique from Algorithms 4 and 5, are claimed as our contributions. We use an off-the-shelf schema matcher to identify column level matches between a source dataset and a target dataset, and the implementation of a synthesis algorithm described in [8] to determine the transformation programs, but we do not evaluate the effectiveness of these solutions. Rather, we carry out experiments that investigate the approaches on examples generation and selection described in the previous section, and the quality of the transformation programs synthesized from the results of our approaches. 6.1. Experimental setup In the experiments, we used publicly available data from nine different domains, describing five types of data in different formats, such as addresses, monetary values, person names, dates and numbers. Table 8 summarizes the used datasets. For each domain, the URLs represent the location of the source and target datasets, respectively. The last two columns illustrate the size (cardinality and arity) of each dataset. Table 8. Data sources used in experiments. Domain URLs Card. Arity Food ratings.food.gov.uk, 12323 7 Hygiene data.gov.uk 61 10 data.cityofchicago.org, 7542 12 Lobbyists data.cityofchicago.org 210 19 Doctors/Addresses www.nhs.uk, 7696 19 data.gov.uk 17867 3 Building app.enigma.io, 10000 13 Permits data.cityofchicago.org 1245 13 US opensecrets.org 620 9 Congress govtrack.us/congress 11870 10 Restaurants app.enigma.io 25000 19 data.ny.gov 21000 18 Movies imdb.com 75000 4 http://imdb.comimdb.com 15000 5 Employment data.cityofchicago.org 31000 8 data.cityofchicago.org 12000 7 Domain URLs Card. Arity Food ratings.food.gov.uk, 12323 7 Hygiene data.gov.uk 61 10 data.cityofchicago.org, 7542 12 Lobbyists data.cityofchicago.org 210 19 Doctors/Addresses www.nhs.uk, 7696 19 data.gov.uk 17867 3 Building app.enigma.io, 10000 13 Permits data.cityofchicago.org 1245 13 US opensecrets.org 620 9 Congress govtrack.us/congress 11870 10 Restaurants app.enigma.io 25000 19 data.ny.gov 21000 18 Movies imdb.com 75000 4 http://imdb.comimdb.com 15000 5 Employment data.cityofchicago.org 31000 8 data.cityofchicago.org 12000 7 View Large Table 8. Data sources used in experiments. Domain URLs Card. Arity Food ratings.food.gov.uk, 12323 7 Hygiene data.gov.uk 61 10 data.cityofchicago.org, 7542 12 Lobbyists data.cityofchicago.org 210 19 Doctors/Addresses www.nhs.uk, 7696 19 data.gov.uk 17867 3 Building app.enigma.io, 10000 13 Permits data.cityofchicago.org 1245 13 US opensecrets.org 620 9 Congress govtrack.us/congress 11870 10 Restaurants app.enigma.io 25000 19 data.ny.gov 21000 18 Movies imdb.com 75000 4 http://imdb.comimdb.com 15000 5 Employment data.cityofchicago.org 31000 8 data.cityofchicago.org 12000 7 Domain URLs Card. Arity Food ratings.food.gov.uk, 12323 7 Hygiene data.gov.uk 61 10 data.cityofchicago.org, 7542 12 Lobbyists data.cityofchicago.org 210 19 Doctors/Addresses www.nhs.uk, 7696 19 data.gov.uk 17867 3 Building app.enigma.io, 10000 13 Permits data.cityofchicago.org 1245 13 US opensecrets.org 620 9 Congress govtrack.us/congress 11870 10 Restaurants app.enigma.io 25000 19 data.ny.gov 21000 18 Movies imdb.com 75000 4 http://imdb.comimdb.com 15000 5 Employment data.cityofchicago.org 31000 8 data.cityofchicago.org 12000 7 View Large Once again, similarly to the experiments in Section 4, for Matching we used COMA 3.0 Community Edition with a similar set of parameters. For synthesis, we used the FlashFill implementation described in [8]. All experiments were run on Ubuntu 16.04.1 LTS installed on a 3.40 GHz Intel Core i7-6700 CPU and 16 GB RAM machine. 6.2. Experiment 1 The question we try to answer in this section is: does the synthesis algorithm produce effective transformation programs from the example pairs generated by Algorithms2and4? This is equivalent to the question analysed in Experiment 3 of Section 4. Analyzing how effective the weighted scheme is at identifying candidate column pairs, i.e. similar to Experiment 1 from Section 4, would mean to measure the effectiveness of Match, since Algorithm 2 produces example instances for each matching pair returned by Match. While the matching function is an essential part of our end-to-end method, schema matching evaluation is not our primary focus in this paper. As for the validation part, the weighted approach does not employ a k-fold cross-validation stage and relies on Algorithm 4 to discard candidate pairs if there is not enough evidence for synthesizing transformations (as described in the previous section). Some candidate pairs for which example selection failed are discussed below. For each matching column pair returned by the Matches function, we first run Algorithm 2, the result is passed to Algorithm 4, then, using the output from the latter and the FlashFill implementation, a transformation program is synthesized and applied on the test set. Recall that, for each matching column pair (sa,ta) ⁠, the test set represents the set of values from sa for which there are no corresponding values from ta determined by Algorithm 2. We report the precision and recall of the result of transformations applied on each test set using the following definitions: TP—the transformation produces the correct output; FP—the transformation produces an incorrect output; FN—the transformation resulted in an empty string. The results of the experiment are illustrated in Table 9. Note that the cases illustrated here include only candidate column pairs for which the synthesized transformation is different from the identity transformation, e.g. Rows 3 and 13 from Table 4. Our proposed algorithms can trivially identify examples for the identity transformation, i.e. where source and target strings are the same, and we have seen in Section 4 that FlashFill is able to synthesize such simple transformations. Therefore, our focus here is on more challenging ones. Table 9. Evaluation results. # Domain Type Source Target Alg.2 Alg.4 Prec Rec 1 Food Hyg. Date 2015-12-01 01/12/2015 57 1 1.00 1.00 2 Lobbyists Pers. name Mr. Neil G Bluhm Bluhm Neil G 206 25 0.75 0.71 3 Docs./Addr. Address 55 Swain Street, Watchet Swain Street 2819 N/A N/A N/A 4 Build. Perm. Address 1885 Maud Ave Maud Ave,Nr. 1885 1051 14 0.90 1.00 5 Build. Perm. Cost 6048 $6048.00 257 3 0.99 1.00 6 US Congress Pers. name Pete Stark (D-Calif) Stark 473 34 0.77 1.00 7 US Congress Pers. name Pete Stark (D-Calif) Pete 279 25 0.75 1.00 8 US Congress Pers. name Pete Stark (D-Calif) Calif 23 6 0.94 1.00 9 Restaurants Address 41 Page Avenue, Delhi Page Avenue 1401 N/A N/A N/A 10 Restaurants Date 2014-06-23 00:00:00+00 06/23/2014 1148 1 1.00 1.00 11 Movies Actor name James Brown james_brown 1872 N/A N/A N/A 12 Movies Pers. Name Perry Lang perry_lang 1680 17 0.79 0.98 13 Employment Pers. name OUTTEN, MIA G Mia G Outtent 8903 22 0.97 0.99 14 Addresses Address 2440 N Cannon Drive Cannon Dr 21 4 0.87 1.00 # Domain Type Source Target Alg.2 Alg.4 Prec Rec 1 Food Hyg. Date 2015-12-01 01/12/2015 57 1 1.00 1.00 2 Lobbyists Pers. name Mr. Neil G Bluhm Bluhm Neil G 206 25 0.75 0.71 3 Docs./Addr. Address 55 Swain Street, Watchet Swain Street 2819 N/A N/A N/A 4 Build. Perm. Address 1885 Maud Ave Maud Ave,Nr. 1885 1051 14 0.90 1.00 5 Build. Perm. Cost 6048 $6048.00 257 3 0.99 1.00 6 US Congress Pers. name Pete Stark (D-Calif) Stark 473 34 0.77 1.00 7 US Congress Pers. name Pete Stark (D-Calif) Pete 279 25 0.75 1.00 8 US Congress Pers. name Pete Stark (D-Calif) Calif 23 6 0.94 1.00 9 Restaurants Address 41 Page Avenue, Delhi Page Avenue 1401 N/A N/A N/A 10 Restaurants Date 2014-06-23 00:00:00+00 06/23/2014 1148 1 1.00 1.00 11 Movies Actor name James Brown james_brown 1872 N/A N/A N/A 12 Movies Pers. Name Perry Lang perry_lang 1680 17 0.79 0.98 13 Employment Pers. name OUTTEN, MIA G Mia G Outtent 8903 22 0.97 0.99 14 Addresses Address 2440 N Cannon Drive Cannon Dr 21 4 0.87 1.00 View Large Table 9. Evaluation results. # Domain Type Source Target Alg.2 Alg.4 Prec Rec 1 Food Hyg. Date 2015-12-01 01/12/2015 57 1 1.00 1.00 2 Lobbyists Pers. name Mr. Neil G Bluhm Bluhm Neil G 206 25 0.75 0.71 3 Docs./Addr. Address 55 Swain Street, Watchet Swain Street 2819 N/A N/A N/A 4 Build. Perm. Address 1885 Maud Ave Maud Ave,Nr. 1885 1051 14 0.90 1.00 5 Build. Perm. Cost 6048 $6048.00 257 3 0.99 1.00 6 US Congress Pers. name Pete Stark (D-Calif) Stark 473 34 0.77 1.00 7 US Congress Pers. name Pete Stark (D-Calif) Pete 279 25 0.75 1.00 8 US Congress Pers. name Pete Stark (D-Calif) Calif 23 6 0.94 1.00 9 Restaurants Address 41 Page Avenue, Delhi Page Avenue 1401 N/A N/A N/A 10 Restaurants Date 2014-06-23 00:00:00+00 06/23/2014 1148 1 1.00 1.00 11 Movies Actor name James Brown james_brown 1872 N/A N/A N/A 12 Movies Pers. Name Perry Lang perry_lang 1680 17 0.79 0.98 13 Employment Pers. name OUTTEN, MIA G Mia G Outtent 8903 22 0.97 0.99 14 Addresses Address 2440 N Cannon Drive Cannon Dr 21 4 0.87 1.00 # Domain Type Source Target Alg.2 Alg.4 Prec Rec 1 Food Hyg. Date 2015-12-01 01/12/2015 57 1 1.00 1.00 2 Lobbyists Pers. name Mr. Neil G Bluhm Bluhm Neil G 206 25 0.75 0.71 3 Docs./Addr. Address 55 Swain Street, Watchet Swain Street 2819 N/A N/A N/A 4 Build. Perm. Address 1885 Maud Ave Maud Ave,Nr. 1885 1051 14 0.90 1.00 5 Build. Perm. Cost 6048 $6048.00 257 3 0.99 1.00 6 US Congress Pers. name Pete Stark (D-Calif) Stark 473 34 0.77 1.00 7 US Congress Pers. name Pete Stark (D-Calif) Pete 279 25 0.75 1.00 8 US Congress Pers. name Pete Stark (D-Calif) Calif 23 6 0.94 1.00 9 Restaurants Address 41 Page Avenue, Delhi Page Avenue 1401 N/A N/A N/A 10 Restaurants Date 2014-06-23 00:00:00+00 06/23/2014 1148 1 1.00 1.00 11 Movies Actor name James Brown james_brown 1872 N/A N/A N/A 12 Movies Pers. Name Perry Lang perry_lang 1680 17 0.79 0.98 13 Employment Pers. name OUTTEN, MIA G Mia G Outtent 8903 22 0.97 0.99 14 Addresses Address 2440 N Cannon Drive Cannon Dr 21 4 0.87 1.00 View Large The columns from Table 9 (starting with the second column) represent the domain of the matching column pair, their data type, a source instance value, the corresponding target instance value, the number of example pairs returned by Algorithm 2, the number of example pairs returned by Algorithm 4, the precision, and the recall, respectively. 6.3. Discussion Of the 14 cases exemplified in Table 9, Algorithm 4 rejected three example sets: Rows 3, 9 and 11. For Rows 3 and 11, the halting condition was met, while in the case of Row 9, FlashFill was unable to synthesize a transformation program. For the rest of the matching column pairs, the recall descended below 0.98 only in one case: Row 2. This is due to the fact that person names often contain common tokens which can lead to highly weighted false positives returned by Algorithms 2 and 4. This is also true for precision. In fact, all of the cases with precision lower than 0.80 describe person names as well. This means that providing false positive examples to FlashFill decreases the accuracy of transformations. Notice that columns 6 and 7 from Table 9 denote large differences between the number of example pairs generated by Algorithm 2 and the pairs selected by Algorithm 4. This suggests that many of the pairs generated by the former are covering a relatively small number of formats. For instance, for the Food Hygiene domain, all source values of the 57 example pairs generated by Algorithm 2 are describing a single date format, e.g. Row 1. Therefore, Algorithm 4 returned a single pair. It is important to understand that while the source values of the example pairs contain dates in a single format, the test set might contain values in different formats. If that was to be the case, then we would see a decrease in recall, e.g. Row 12: some names from the source column have a middle name, but this format is not represented in the examples. This suggests that, for all cases exemplified in Table 9 that returned a perfect recall, each format present in the test set is represented at least once in the examples set. 6.4. Scheme comparison In this section, we provide a comparative study of the two schemes for generating examples presented in this paper. We analyse the computational cost of each technique, i.e. Algorithms 1, 2 and 4, the computational cost of transformation synthesis using examples generated by each method, and the consequences (if any) of removing redundant examples—Algorithm 4. We do not analyse here the complexity of the other algorithms used in our experiments, e.g. Match, FlashFill, as this has been done in their original papers. Table 10 gives details on cases from Table 9 for which examples have been generated using both schemes. The other cases from Table 9 have only been used with the weighted technique because the conditions for the FD-based approach were not met, e.g. FDs could not be discovered. We report the row identifier from Table 9 on the first column, while the rest of the columns represent the source/target cardinality, the number of example pairs returned by Algorithm 1, the number of example pairs returned by Algorithm 2 and Algorithm 4, and the time in seconds required to generate the examples by each algorithm. Note that we only analyse cases for which the validation stage has been passed (for FD-based scheme) and for which the selection stage (weighted scheme) returned at least one example pair. Table 10. Scheme comparison. # Source/target size Alg.1 Alg.2 Alg.4 Alg.1 (s) Alg.2 (s) Alg.4 (s) 1 12 323/61 57 57 1 0.006 0.2 2.3 2 7542/210 206 206 25 0.005 0.15 42.2 4 10 000/1245 26 1051 14 0.006 12.5 8.9 5 10 000/1245 22 257 3 0.005 0.15 2.9 # Source/target size Alg.1 Alg.2 Alg.4 Alg.1 (s) Alg.2 (s) Alg.4 (s) 1 12 323/61 57 57 1 0.006 0.2 2.3 2 7542/210 206 206 25 0.005 0.15 42.2 4 10 000/1245 26 1051 14 0.006 12.5 8.9 5 10 000/1245 22 257 3 0.005 0.15 2.9 View Large Table 10. Scheme comparison. # Source/target size Alg.1 Alg.2 Alg.4 Alg.1 (s) Alg.2 (s) Alg.4 (s) 1 12 323/61 57 57 1 0.006 0.2 2.3 2 7542/210 206 206 25 0.005 0.15 42.2 4 10 000/1245 26 1051 14 0.006 12.5 8.9 5 10 000/1245 22 257 3 0.005 0.15 2.9 # Source/target size Alg.1 Alg.2 Alg.4 Alg.1 (s) Alg.2 (s) Alg.4 (s) 1 12 323/61 57 57 1 0.006 0.2 2.3 2 7542/210 206 206 25 0.005 0.15 42.2 4 10 000/1245 26 1051 14 0.006 12.5 8.9 5 10 000/1245 22 257 3 0.005 0.15 2.9 View Large The complexity of Algorithm 1 is given by the number of matching relationships of the source and target and by the number of functional dependency candidates for each dataset (lines 3 and 5–6). In practice, the numbers of matches and FDs tend to be small, meaning that the time required to run Algorithm 1 is dominated by the select query at lines 8–10. Conversely, the complexity of Algorithm 2 is defined by the number of records of the two datasets, i.e. the more instance values two columns have, the more entries the index will contain, which translates into more pairwise string similarity comparisons. The first two rows of Table 10 show that if we do not use functional dependencies and resort to fuzzy string matching, generating the same number of examples takes 30 times longer. Often (last two rows), the pairs generated by Algorithm 2 contain false positives, i.e. example instances for which the two strings do not represent the same thing. To mitigate such cases we employ Algorithm 4, the complexity of which is given by FlashFill—exponential in the number of examples and highly polynomial in the length of examples [9]. The number of seconds required to refine the examples generated by Algorithm 2 using Algorithm 4 is reported in the last column of Table 10. The benefit of minimizing examples can be observed in Fig. 10, where the synthesis time, i.e. the time it takes FlashFill to synthesize a transformation program, is reported. We compare synthesizing transformations from the examples generated by Algorithm 1 against synthesizing transformations from examples generated by Algorithm 2 and refined by Algorithm 4. It can be observed that eliminating redundant example instances substantially improves synthesis time, especially when there are many examples per format: the fact that for Row 2 Algorithm 4 reduced 206 examples to 25 suggests that there are almost nine examples per format. FIGURE 10. View largeDownload slide Synthesis time (s). FIGURE 10. View largeDownload slide Synthesis time (s). With respect to the transformations synthesized, for Cases 1 and 2 from Table 10 the transformation resulting from the examples returned by Algorithm 1 was equivalent to the transformation obtained from the examples returned by Algorithm 4. For the last two cases, the transformation synthesized from the examples returned by Algorithm 4 was more complex.6 This is unsurprising, as the number of candidate pairs generated by Algorithm 2 contains more formats that the ones from Algorithm 1, and these formats are successfully identified in Algorithm 4. Recall from Section 2 that FlashFill is able to synthesize transformations that cover multiple cases, i.e. formats, through conditional expressions, e.g. Switch. Overall, the experiments from this section, together with the evaluation from Section 4, show that the functional dependency-based approach efficiently generates more effective example pairs, as long as FDs can be discovered. This approach can lead to an increase in synthesis complexity due to redundant example instances, but the problem can be solved in practice by combining Algorithm 1 with the example selection technique from Algorithm 4. For cases where functional dependencies are not available, the weighted solution can prove viable at the expense of effectiveness. For example, Row 2 of Table 10 represents a case where the weighted scheme proves less effective than the FD-based scheme: the precision decreased with 23% and the recall decreased with 15% (Row 4 of Table 6 compared with Row 2 of Table 9). 7. RELATED WORK In recent years, there has been an increasing number of proposals that use declarative, constraint-based quality rules to detect and repair data problems (e.g. [14, 15, 12, 13], see [10] and [11] for surveys). For many of these heuristic techniques, rule-based corrections are possible as long as the repairing values are present either in an external reference dataset or in the original one. For example, in [13] the correct values are searched for in a master dataset using editing rules which specify how to resolve violations, with the expense of requiring high user involvement. While the semantics of editing rules can be seen as related to our approach described in Algorithm 1, there are at least two essential differences. First, editing rules address instance-level repairs, i.e. every tuple is checked against every rule and the values of the attributes covered by the rule are replaced with correct ones from the reference dataset (if they exist). Our approach determines pairs of values from which we learn transformations that hold for entire columns, so we do not search for the correct version of every value that needs to be cleaned, but for a small number of values that describe the syntactic pattern of the correct version. Second, we determine these transformations automatically, without any intervention from the user. A related proposal on program synthesis is the work by Contreras-Ochando et al [16]. In their solution, they address the problem of generality characteristic to domain-specific languages, such as the one used by FlashFill. Specifically, they propose the use of domain-specific background knowledge to synthesize more specific transformations for fixed data types such as dates, e-mails, names, etc. An important body of work close to our proposal for example selection, i.e. Algorithm 4, and built on the transformation synthesis methods, is the research by B. Wu et al. [17] In their approach, they propose an example recommending algorithm to assist the user in providing enough examples for the entire column to be transformed. To this end, the approach samples a set of records for automatic inspection. It then uses machine learning techniques to identify potentially incorrect records and presents these records for the users to examine. This proposal can be considered orthogonal to our approach. In fact, one can see our technique as an automatic initialization phase of a pay-as-you-go process, while the work by Wu and Knoblock [17] could represent the refinement phase where previous results are improved with user support. We leave this idea for future work. Recent work on transformation-driven join operations by Zhu et al. [18] resulted in a technique for automatically joining two tables using fuzzy value pairing and synthesis of transformation programs. Their approach leverages sub-string indexes to efficiently identify candidate row pairs that can potentially join, and then it synthesizes a transformation program whose execution can lead to equi-joins. Although their principle is similar to our approach, their focus is on join operations which can be considered as part of the mapping generation phase of a data wrangling pipeline, while our proposal is presented as being part of the format transformation and normalization task of the pipeline. Moreover, their focus on joining operations requires that the columns from which examples for the transformation are being searched, have to be candidate keys in their respective datasets (or one key and one foreign key). Our approaches, especially the weight based one, aim to enable normalization for any column which has a corresponding match in the target dataset. Closer to our solution are the tools for pattern enforcement and transformation, starting with traditional ETL tools like Pentaho7 or Talend,8 but especially Data Wrangler [2], its ancestor Potter’s Wheel [19], and OpenRefine.9 Data Wrangler stands out by proposing a transformation language and an inference algorithm that aids the user in transforming the data. Although the automated suggestion mechanism, described in [20], avoids manual repetitive tasks by actively learning from the decisions the user makes, or from the transformations the user writes, the user must know what transformation is required and how to express that operation. Our work is a first step towards a solution in which such transformations are synthesized without up-front human involvement. An important body of related work is the research on synthesis programming introduced in Section 2. The algorithms and languages proposed in [3, 4], and extended in [8] have been developed with spreadsheet scenarios in mind. We build on these solutions and argue that such techniques can be applied on real-world big data as well, where the amount of inconsistency and format heterogeneity is higher. 8. CONCLUSIONS Recent advancements in areas such as big data management have increased the importance of data integration and data wrangling techniques. The challenges caused by data volume and variety require (semi)automatic, cost-effective processes to prepare the data before the analysis process. In this respect, data wrangling is important, as a precursor to data analysis, but is often labour intensive. The creation of data format transformations is an important part of data wrangling and in the data cleaning landscape there are important recent results on techniques to support the creation of such transformations (e.g. [2, 3]). These solutions are user-centric and many of the transformations that can be automatically learned from examples provided by Algorithms 1 and 2 can be obtained using the above solutions as well, but with a much greater level of human involvement. In this paper we build on and complement these results by describing two approaches that can automate the creation of format transformations. Thus we do not replace human curators, but we reduce the number of cases in which manual input is required. In several domains, we have described how candidate sets of examples can be discovered and refined automatically, how these examples can be used to synthesize transformations using FlashFill, and how the resulting transformations can be validated automatically. The evaluation showed that, when certain conditions are met, Algorithm 1 can generate effective examples from which transformation programs can be synthesized. When the conditions are not fulfilled, a heuristic technique was proposed, which proved to have comparable effectiveness while being aligned with synthesis’s performance requirements. The approaches described in this paper can potentially be applied in different settings. For example, transformations could be identified in the background by a platform such as Trifacta’s Wrangler,10 and offered as candidate solutions to users. However, the approach can also be used as part of more general extract-transform-load platforms, in which format transformation is only one aspect. Indeed, the methods described here have been incorporated into the VADA system, to support a pay-as-you-go approach to wrangling in which the results from several initially automated steps are refined in the light of data context [22] and user feedback on the data product [21]. FUNDING This work has been made possible by funding from the UK Engineering and Physical Sciences Research council, whose support we are pleased to acknowledge. REFERENCES 1 Furche , T. , Gottlob , G. , Libkin , L. , Orsi , G. and Paton , N.W. ( 2016 ) Data Wrangling for Big Data: Challenges and Opportunities. Advances in Database Technology EDBT 2016: Proc. 19th Int. Conf. Extending Database Technology, Bordeaux, France, March 15–16, pp. 473–478. OpenProceedings.org. 2 Kandel , S. , Paepcke , A. , Hellerstein , J.M. and Heer , J. ( 2011 ) Wrangler: Interactive Visual Specification of Data Transformation Scripts. CHI’11 Proc. SIGCHI Conf. Human Factors in Computing Systems, Vancouver, BC, Canada, May 07–12, pp. 3363–3372. ACM New York, NY, USA. 3 Gulwani , S. ( 2011 ) Automating String Processing in Spreadsheets Using Input-output Examples. POPL’11: Proc. 38th Annual ACM SIGPLAN-SIGACT Symp. Principles of programming languages, Austin, TX, USA, January 26–28, pp. 317–330. ACM New York, NY, USA. 4 Singh , R. ( 2016 ) Blinkfill: Semi-supervised programming by example for syntactic string transformations . Proc. VLDB Endowment , 9 , 816 – 827 . Google Scholar Crossref Search ADS WorldCat 5 Bogatu , A. , Paton , N.W. and Fernandes , A.A.A. ( 2017 ) Towards Automatic Data Format Transformations: Data Wrangling at Scale. BICOD Proc. 31st British Int. Conf. Databases, London, UK, July10–12, pp. 36–48. Springer International Publishing, Basel. 6 Papenbrock , T. and Naumann , F. ( 2016 ) A Hybrid Approach to Functional Dependency Discovery. SIGMOD’16 Proc. 2016 Int. Conf. Management of Data, San Francisco, CA, USA, June 26–July01, pp. 821–833. ACM New York, NY, USA. 7 Rahm , E. and Bernstein , P.A. ( 2001 ) A survey of approaches to automatic schema matching . VLDB J. , 10 , 334 – 350 . Google Scholar Crossref Search ADS WorldCat 8 Wu , B. and Knoblock , C.A. ( 2015 ) An Iterative Approach to Synthesize Data Transformation Programs. Proc. 24th Int. Joint Conf. Artificial Intelligence, Buenos Aires, Argentina, July 25–31, pp. 1726–1732. AAAI Press. 9 Raza , M. , Gulwani , S. and Milic-Frayling , N. ( 2014 ) Programming by Example Using Least General Generalizations. AAAI’14 Proc. 28th AAAI Conf. Artificial Intelligence, Quebec, Canada, July 27–31, pp. 283–290. AAAI Press. 10 Fan , W. ( 2008 ) Dependencies Revisited for Improving Data Quality. PODS’08 Proc. Twenty-seventh ACM SIGMOD-SIGACT-SIGART Symp. Principles of Database Systems, Vancouver, Canada, June 09–12, pp. 159–170. ACM New York, NY, USA. 11 Fan , W. ( 2015 ) Data quality: from theory to practice . ACM SIGMOD Record , 44 , 7 – 18 . Google Scholar Crossref Search ADS WorldCat 12 Chu , X. , Ilyas , I.F. and Papotti , P. ( 2013 ) Holistic Data Cleaning: Putting Violations into Context. ICDE IEEE 29th Int. Conf. Data Engineering, Brisbane, QLD, Australia, April 8–12, pp. 458–469. IEEE. 13 Fan , W. , Li , J. , Ma , S. , Tang , N. and Yu , W. ( 2012 ) Towards certain fixes with editing rules and master data . VLDB J. , 21 , 213 – 238 . Google Scholar Crossref Search ADS WorldCat 14 Fan , W. , Geerts , F. , Jia , X. and Kementsietsidis , A. ( 2008 ) Conditional Functional Dependencies for Capturing Data Inconsistencies . ACM TODS , 33 , Article No. 6 . WorldCat 15 Yakout , M. , Elmagarmid , A.K. , Neville , J. , Ouzzani , M. and Ilyas , I.F. ( 2011 ) Guided Data Repair . Proc. VLDB Endowment , 4 , 279 – 289 . Google Scholar Crossref Search ADS WorldCat 16 Contreras-Ochando , L. , Ferri , C. , Hernandez-Orallo , J. , Martinez-Plumed , F. , Ramirez-Quintana , M.J. and Katayama , S. ( 2017 ) Domain Specific Induction for Data Wrangling Automation. AutoML@ICML, Sydney, Australia, August 10. 17 Wu , B. and Knoblock , C.A. Maximizing Correctness with Minimal User Effort to Learn Data Transformations. IUI’16 Proc. 21st Int. Conf. Intelligent User Interfaces, Sonoma, CA, USA, March 07–10, pp. 375–384. ACM New York, NY, USA. 18 Zhu , E. , He , Y. and Chaudhuri , S. ( 2017 ) Auto-join: Joining tables by leveraging transformations . Proc. VLDB Endowment , 10 , 1034 – 1045 . Google Scholar Crossref Search ADS WorldCat 19 Raman , V. and Hellerstein , J.M. ( 2001 ) Potter’s Wheel: an Interactive Data Cleaning System. VLDB’01 Proc. 27th Int. Conf. Very Large Data Bases, Roma, Italy, September 11–14, pp. 381–390. Morgan Kaufmann Publishers Inc. San Francisco, CA, USA. 20 Heer , J. , Hellerstein , J.M. and Kandel , S. ( 2015 ) Predictive Interaction for Data Transformation. CIDR 7th Biennial Conf. Innovative Data Systems Research, Asilomar, CA, USA, January 4–7. 21 Konstantinou , N. et al. ( 2017 ) The VADA Architecture for Cost-effective Data Wrangling. SIGMOD’17 Proc. 2017 ACM Int. Conf. Management of Data, Chicago, IL, USA, May 14–19, pp. 1599–1602. ACM New York, NY, USA. 22 Koehler , M. et al. ( 2017 ) Data context informed data wrangling. 2017 IEEE Int. Conf. Big Data, Boston, MA, USA, December 11–14, pp. 956–963. Footnotes 1 http://nyti.ms/1Aqif2X 2 http://bit.ly/2fLVvtl 3 http://bit.ly/2f5DwJW 4 In computing the overlap, we only consider alpha-numeric, lowercase characters 5 In our experiments, we used the Euclidean distance metric 6 We consider a transformation to be complex if it contains more conditional branches. 7 http://www.pentaho.com/ 8 https://www.talend.com/ 9 http://openrefine.org/ 10 https://www.trifacta.com/products/wrangler/ © The British Computer Society 2018. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model) TI - Towards Automatic Data Format Transformations: Data Wrangling at Scale JF - The Computer Journal DO - 10.1093/comjnl/bxy118 DA - 2019-07-18 UR - https://www.deepdyve.com/lp/oxford-university-press/towards-automatic-data-format-transformations-data-wrangling-at-scale-np0lYw8cR6 SP - 1044 VL - 62 IS - 7 DP - DeepDyve ER -