Discovering XML Conditional Dependencies for Data Quality Issues

DOI: http://dx.doi.org/10.24018/ejece.2020.4.1.156 1 Abstract—Extensible Markup Language (XML) is emerging as the primary standard for representing and exchanging data, with more than 60% of the total; XML considered the most dominant document type over the web; nevertheless, their quality is not as expected. XML integrity constraints especially XFDplays an important role in keeping the XML dataset as consistent as possible, but their ability to solve data quality issues is still intangible. The main reason is that oldfashioned data dependencies were basically introduced to maintain the consistency of the schema rather than that of the data. The purpose of this study is to introduce a method for discovering pattern tableaus for XML conditional dependencies to be used for enhancing XML document consistency as a part of data quality improvement phases. The notations of the conditional dependencies as new rules are designed mainly for improving data instance and extended traditional XML dependencies by enforcing pattern tableaus of semantically related constants. Subsequent to this, a set of minimal approximate conditional dependencies (XCFD, XCIND) is discovered and learned from the XML tree using a set of mining algorithms. The discovered patterns can be used as a Master data in order to detect inconsistencies that don’t respect the majority of the dataset.


I. INTRODUCTION
Today, data become the lifeblood of businesses, as different database applications, such as Decision Support Systems, Customer Relationship Management, Data Warehouses, Web Services, and eLearning Systems are being used; beneficial information and knowledge can be gained from considerable amounts of data. However, investigations demonstrate that heaps of such applications fail to run successfully and efficiently, due to many issues, such as poor system design or weak query performance, yet nothing is sure to cause application failure than carelessness of data quality issues [1].
According to reports presented by V12-Data in 2015, the expenses of bad data might be considerably higher than 12% lost revenue. About 28% of individuals who had issues related to the delivery of emails said that customer service has endured accordingly, while 21% experienced reputation damage. The vast majority of the organizations (86%) admitted that their data might be inaccurate somehow. About 44% of businesses and organizations reported that missing or imperfect data is the most frequent problem alongside obsolete information [2].
The impact of poor data quality can be of three types: Operational (causing customer and employee dissatisfaction and increased costs), Tactical (affecting decision making and causing mistrust), and Strategic Impacts (affecting the overall organization's strategy). Overall, any system or enterprise that heavily relies on data is prone to experience problems if the data being handled does not possess the expected data quality issues [3], [4].
Extensible Markup Language (XML) stands out rapidly amongst essential data file formats. It has been used for scientific data such as DNA sequences, to annotate extensive documents such as DrugBank database, or for exchanging data over the Web for e-commerce benefits.
Grijzenhout & Marx, provide in-depth analysis to answer the question "Is the quality of XML documents found on the web sufficient to apply XML technologies like XQuery, XPath, and XSLT?" The results show that on the web, 58% of the existing documents over the web are of XML file format, nevertheless, one-third of these documents accompanying valid XML Schema Definition (XSD) or Document Type Definition (DTD). Moreover, about 14% of the documents lack well-formedness. A simple error of mismatching or missing tags will render the entire XML technologies useless over this document [5].
The growing interest of XML as the dominant way of exchanging data over the Web encourages researchers to address XML data cleaning as an open research problem [6], and to start searching for data cleaning approaches for XML (Weis, Monod, & Cedex, 2007) especially approaches based on integrity constraints (IC) [7]- [9].

A. Motivational Example
Consider an XML dataset contains information about university libraries with a finite number of books and articles to be used for the online borrowing system and transferring transactions during daily orders as appearing next in Fig 1. The system will use information stored in the XML file to check available resources (books, articles) and return order's query results.
Book, Order, and Article nodes are three complex type elements with a sequence of elements and data values below them. For instance, the complex element 'Book' represents a set of books; each one has Title, Author, Year, Publisher and Genre as non-key elements and ISBN as a key, same idea shared with Article and Order complex elements. A set of two XML Inclusion Dependencies (XIND) held in the XML tree T and used for schema matching, in addition to a single The first XIND insists that if an order (text node) is requested by a student, it should find a matching data value (text node) at library books, entirely as the second XIND demands that the request finds a matching in available articles. Given XFD appeals that if two books agree on their Title value, they should share the same Author value. These dependencies are satisfied by a given XML document (Σ ⊨ T) and require full agreement between both sides of the XIND dependency. However, these two inclusion dependencies do not make sense; how can the same order node match nodes from two different XML subtrees at the same time. To remedy this contradiction and without loss of generality, let us add a conditional element to lhs of the dependencies to become: . However, this type of dependencies makes a weaker assertion and highly applicable to real-world cleaning capabilities over traditional XIND ones [10].
These enhanced dependencies are XIND hold only on a subset of the XML document, which satisfies the condition rather than the entire document, also, to be checked only using the related dependency, not both [11]. Another case for XFD, as known, XFD can check the consistency by at least two paths, if there exists a single path contains a wrong value, then, how can an inconsistency in a text node be captured? For instance, when the Title of the book is "First Course in Databases", Then the author will be "Jenifer Widom" not "Thomas M Connolly" as appear in Fig 1. Traditional XFD is not able to discover this violation, adding a pattern tableau will solve the problem [12]. This type is called constant XCFD. In this research, a set of approximate constant patterns will be used in order to detect inconsistencies with low frequencies.

II. LITERATURE REVIEW
Obviously, in recent years, XML IC's have attracted the attention of scientist who aims at ensuring data consistency as it is vital and indispensable [13]. The expression "Integrity Constraint" in XML is regularly used to mean extensions of relational ICs, such as Primary Keys, Foreign Keys, Functional Dependencies, and Inclusion Dependencies, which depends principally on the equality of data values within the same subtree or between different subtrees. Traditional XML Dependencies (XFD, XIND, XAFD) have limit capability toward cleaning the XML tree effectively. However, recently, a conditional version of these dependencies presented to overcome their limitation and to open a new era of cleaning approaches for XML cleaning [11], [12]. Sections below cover the basics of these conditional dependencies.

A. XML Conditional Dependencies
Inspiration from the relational database; conditional dependencies (CFD, CIND) were presented to overcome relational traditional dependencies (FD, IND) limitations, especially data quality issues [14], [15]. The conditional dependencies own more quality characteristics that make them directed toward data cleaning such as covering a subset of the dataset. Furthermore, these dependencies proved their efficiency in eliminating inconsistencies from relational databases and detecting more inaccurate tuples and fields within tuples over traditional dependencies. Also, cleaning approaches that adopted these dependencies are considered the most used techniques in the last ten years, besides crowdsourcing and knowledge base cleaning approaches [16].
Adding some conditional values as pattern tableaus to traditional dependencies will make it acquire more required characteristics. However, it will not make things easier in terms of the complexity of time and validity, also, the number of CD that is equal to a single traditional one can also increase. Pattern tableau can be defined as a subset of tree tuples in which the underlying conditional integrity constraints hold. It concisely summarizes subsets of the XML values that mostly satisfy or fail the constraint.

1) XML Conditional Functional Dependencies
XCFD emerges from relational CFD, traditional dependencies were basically introduced for schema design enhancement, which is a vital part of enhancing schema quality. However, for sure it is not adequate to consider schema quality and disregard data quality. Authors in [12] conduct a conditional copy of XFD called XML Conditional Functional Dependency (XCFD), and change XFD notation to convey new rules.

Definition 2:
Using basic definitions mentioned in [17], consider XML data tree as T = ( , , , , ) and XSD as S = ( , , , ), XCFD is an expression of the : is the Pivot path. • ℯ: is the conditional part and has a form of ( 1 * 2 * … * ), where is a Boolean expression associated with a conditional element, * is an operator either The conformation between the tree and XCFD T ⊨ , can be achieved if two paths agree conditionally on their lhs then their rhs should match as well. This type of functional dependency is important for data cleaning issues rather than schema design issues and holds on a subset rather than an entire document as mentioned earlier.

2) XML Conditional Inclusion Dependencies
XML Conditional Inclusion Dependencies (XCIND) syntax [11] is too close to strong path notations [18], as this notation is the closest notation to the relational notation and the most explicit description of XML Inclusion Dependencies.

Definition 3: XCIND is a pair:
The idea behind tree tuple is to find a relative representation of the XML tree as a relational data model. Some of the important key points about the mapping phase are: firstly, tuples hold related information can be easily accessed by different DBMSs, gives a clearer conceptual view of the XML dataset, and composed values can be compared and verified effectively [19], [20]. secondly, discovered patterns take tableau form, so it becomes much easier to handle XML tree as tables form and combine them at advanced levels.
Mapping XML tree can be done in many ways. Firstly, shredding the whole tree into a single huge flat relation; this concept has many disadvantages like the unmatched number of unreal tuples and a large number of columns (Attributes). Therefore, a direct result of this mapping is the tremendous complexity time [21]. The second way is using a seminormalized form by shredding XML into many related relations using the concept of tuple class, this representation uses the concept of schema normalization and reduces the complexity of time and space. The aforementioned sample dataset is represented by the following

A. Discovering Patterns Tableaus for Conditional Dependencies
Since the primary objective of this research is improving the data quality, discovered data quality rules that obey dataset schema might not be satisfied in the related instance. More precisely, fully agreed dependencies (XCFD, XCIND) are considered as business rules to prevent inconsistent values from entering the database. On the other side, discovering inconsistencies from the existing database requires the definition of some error measure ( ) for the satisfaction of patterns; the difference between full and approximate dependencies shows spurious data as a result.
To mine approximate dependencies, a confidence threshold is required for the identification of the percentage of right values. On the other side, conditional dependencies hold on a subset of the dataset. Therefore, it requires a support threshold to identify the percentage of the dataset taken into consideration. Both thresholds will be hired as a set of approximate conditional dependencies (XCFD, XCIND) is required as a preceding step for data cleaning. To sum up, a set of approximate data dependencies, which need not be true for every instance, will be discovered and inspected to capture data inconsistencies with predefined minimum support and confidence thresholds values.

1) Interested Pattern Tableaus
The range of discovered patterns from the XML tree T under the new notions and the new algorithms are quite broad. Apparently, it is not a good idea to return the set of all possible rules; some unnecessarily trivial and non-minimal dependencies can also appear. Therefore, a minimum of canonical cover (non-redundant and minimal set of XCFDs and XCINDs) are taken into consideration, while other dependencies can be driven using implication rules. Moreover, real-life data are often dirty, contain errors and noise. Thus, to exclude patterns that match errors and noise only, a small set of tableau patterns with both high support threshold (number of paths should match) and high confidence value (few exceptions) in XML tree are taken into account.

a) Non-Trivial Dependencies
Data dependency is called trivial if either its rhs as a single path belongs to lhs paths i.e. P y ∈ {P x1 … P xn } such that XFD ∶ {P x1 … P xn } → P y , or if it does not hold on any subset of T. Mining non-trivial dependencies is considered an important aspect for all data models, because it reduces the search space and as a result decreases the complexity time and memory consumption. However, XML dependency notations require additional features to classify the discovered dependency as non-trivial. These features do not allow null values to any path that belongs to the lhs of XFD. Also, the dependency should rely on an essential tuple class and finally, rhs paths always match to a descendant node of the tuple of pivot node .

b) Minimal Dependencies
A dependency (XCFD or XCIND) is considered a left reduced or minimum lhs, if the removal of any field path from its lhs makes the dependency not satisfied, i.e., for any ⊆ (P x1 … P xn ), T ⊭ ({P x1 … P xn }/{G} → P y ). This condition assures the minimality of dependency.
Additional condition for pattern tableaus is required to enforce the minimality, for any ≼ ̃ , T ⊭ ∶

c) Frequency of the Dependency
Before any argument about the support of the dependency begins, the cover of the dependency should be defined. The cover can be defined as a set of paths under given tree tuple as lhs of the dependency that match the lhs of a given pattern tableau divided by the total number of the paths of the tree tuple | |.
The proportion of paths in a tree tuple that obeys a specific dependency (pattern tableau) is known as the support of that dependency. In other words, a dependency can be considered as valid if its support is superior to a predefined requested threshold ( ( , T) ≥ ).
Therefore, the minimum number of paths required to consider a dependency valid with as a percentage is:

d) Confidence of the Dependency
The confidence test is considered one of the highest tests to ensure the validity of a given dependency because it requires finding the percentage of the paths in which lhs and rhs of the dependency appear together and then matching related pattern tableau on the percentage of lhs that also matches the pattern tables and appear alone. If lhs always occurs with rhs, then the confidence value is one, and this looks like traditional XML dependencies. XFD has a support value of = 1, whereas XCFD with a specific pattern tableau has supported: 0 ≤ ≤ 1. Furthermore, an exact XFD or XCFD is required to hold with a confidence equal = 1, whereas an approximate XFD or XCFD is held with confidence within 0 ≤ ≤ 1. The above properties for pattern tableaus are required together to define a set of the conical cover of patterns. In other words, discovered conditional dependencies on T with predefined thresholds , , is a set of minimal,frequent patterns hold on tree T.

B. Discovering XCFD Pattern Tableaus
Once the XML document Tree T is figured out as temporary tables R p , mining a set of XCFDs involves a set of procedures invoked from the main algorithm 1: Mining XCFD. The pseudo-code of these procedures will be omitted in this paper to the lack of space.
The algorithm starts by defining a set of parameters and variables to store values during the algorithm debugging. After that each temporary relation builds a containment lattice [20], and calculates the partitions for its attributes, however, only non-key attributes participate in the containment lattice. Looking back at the library example, consider a temporary table RBook with a finite set of columns (ISBN, Title, Author, Year, Publisher). Fig 2 shows how these columns are connected using edges as a potential relationship. The set of candidates at level = 1 are filled by all single attributes of R p , and only attributes that contain data (nonkey) are checked and added using the CandSelect procedure (line 3), while others are used for the XML reconstructing issue. After assigning 1 candidates, their partitions are calculated using Singleton Calculate Partition procedure at line number 4. These candidates are considered as initial XCFD's lhs that are connected by an edge with rhs candidates at the level 2 , which will be generated and partitioned at lines 7 and 8 using the procedures AprioriGeneration and CalculatePartition respectively. Finally, The main algorithm passes partitions (Π X , Π Y ) for two candidates at two consecutive levels ( , +1 ) with a predefined tableau support and confidence thresholds , to the XCFD Generator procedure.

1) XCFD Validity
Consider : X → Y as an XCFD; since the proposed algorithm represents tree tuple values as equivalent classes Π, and the research aim is detecting data violations; then only X partition class , that split into two or more classes in Y are considered to discover a set of approximate XCFD. Initially, the concept of XCFD using a partition, which was adopted by [12], needs to be introduced. For any twoelement sets in the containment lattice (X, Y), a class in Π X is subsumed by in Π Y , if is a subset of or exactly equal. Let Ωx ⊆ Π X represent all subsumed classes of Π X , each ∈ Ωx presents full XCFD; its lhs contains a set of paths that are considered conditional paths if the same class appears in its partitions, otherwise considered a variable field path. Moreover, rhs path Y can even be or field path-based on the existence of a class , such that ( ∈ Π Y ) ⊆ Ωx. Consider two elements candidates (Publisher, Genre), while <> do 6.

13.
Return XCFD; } Let represents the largest class ( 117 =  {117,118,119,120,122}), represents the smallest class ( 116 = {116,121}) and recall that is the minimum support and is the minimum confidence level. If | | ≥ ( * ) and ∃ such that | | < ((1 − ) * ), and | | ≠ | |, then paths are reported as a violation for an XCFD : [ ℎ = → ℎ = ]. For instance, 116 = ′ ′ as a Genre value is a violation to approximate XCFD [ : ℎ = ′ ′ → = ′ ′ ]. XCFD Generator procedure is invoked from the main algorithm to validate elements partitions to see if it has the ability to generate an XCFD rule or not. Unless both sides of the edge share identical partitions, then a non-strict relationship (XFD) needs to be checked (line 3). If there is such that | | ≥ and there is which has paths more than and less than the allowed confidence threshold , then an approximate XCFD is created using line 8.
Let Υ be the set ∈ Λ X used earlier to generate an approximate XCFD, if another partition ∈ Π Y is mapped to the same and has satisfied support value , then there is no need to create a new dependency, rather they should just be merged to produce a richer information rules whose lhs value is mapped to more than one rhs value (lines 9-14). These two dependencies can be merged into a single dependency and used " " operation to separate rhs values and to generate a disjunction query statement. This concept reduces the number of produced dependences and will prevent the contradiction of two rules as their lhs matches and rhs do not.

2) Merging Pattern for XCFD
Procedure XCFD Generator, on the line (10), merge rules that share the same metadata based on the following cases: If two XCFD share same lhs, rhs selectors, same lhs conditional paths with the same patterns, same rhs conditional path with different patterns, then these dependencies can produce a dependency with more than pattern tableau: For instance, the above dependencies 1 , 2 shared book element as lhs, rhs selectors, Publisher as an lhs conditional paths, " " as a pattern value and shared Genre as an rhs conditional path, then the following dependency appears after the merging; : (( book, [Publisher = " Little Brown"]) → (book, [Genre = Science, Science fiction]).
The other case, if two XCFD share same lhs, rhs selectors same lhs conditional paths with different patterns, same rhs conditional path with the same patterns, then these dependencies can produce a dependency with more than pattern tableau: In order to discover a set of XCIND, the mining algorithm takes as input an XML document T as a set of related relations R P , and discovers a set of approximate XIND with confidence thresholds , and then produces a set of minimal, non-trivial conditional inclusion. The reasons why the discovering of traditional approximate XIND needs before the process of searching for XCIND begins are the need to find common elements between subtrees to generate patterns later. Moreover, any attribute of R P (repeatable path) cannot participate in both types of paths (Conditional and Field). However, in this article, we will omit the Mining XIND phase and considered a set of traditional XIND dependency is already available, refer to [11] to see the full XCIND mining algorithm.
return (XCIND) 8. } Patterns of inclusion dependencies differ slightly from those discovered during the XCFD mining algorithm as they have two kinds of conditions; selecting and demanding conditions. Selecting condition appears on lhs subtree ( ) and is used to ensure the validity of included dependency, whereas demanding condition is used on rhs subtree ( ′ ) to ensure the completeness of the subtree.
The proposed XCIND mining algorithm focuses on the selecting condition because their requirement subsumes the demanding condition requirement. To convert discovered dependencies (XIND) to a conditional one (XCIND) using the XCIND mining algorithm, the search for patterns though non-inclusion attributes are required. The full algorithms for discovering XCIND are omitted in this paper for the lack of space, however, all details presented in [11].

D. Pattern Tableaus Table
A set of approximate XCFD and XCIND will be mined using algorithms mentioned earlier on. The question is how all discovered patterns can be arranged in a manner that they can be accessed for the next cleaning phase. As known, each discovered conditional rules would associate with a pattern tableau that contains the context of this dependency. As a result, the number of patterns will be huge (as large as discovered dependencies). To solve this problem, a new concept associated with the storage of patterns called ConstraintsPatternTableaus table is introduced in this research. It contains both metadata about dependencies and pattern data as well. Table 4 illustrates the scheme for the proposed table that holds all information about all discovered XML condition dependencies (XCFD, XCIND). To show how this summary table is created, the requirement of metadata for both dependencies and what pattern information is needed will be explained.

IV. IMPLEMENTATION AND DISCUSSION
In order to test the effects of varying tableau support and tableau confidence thresholds on the number of discovered approximate XCFD, one of these thresholds should be fixed and the other one should be varied. The support value for each dependency related to the number of tree tuples in R p , for instance, support value = 0.25 for a relation with (400) tree tuples will produce only dependencies with (100) supported tree tuples and above. On the other side, varying confidence values will produce a different number of XCFDs based on the number of distinct values. For instance, the algorithm may not discover any dependency with = 0.65, whereas two dependencies sharing the same metadata with = 0.4 may be discovered, and therefore, after merge operation, both dependencies will produce = 0.8 as a dependency confidence (if exist). , and the tableau support fixed to be = 0.025. As expected, the number of dependencies decreases when the support or confidence value increases; that is because dependencies that have pattern tableaus covered required support thresholds would appear as rules, otherwise, they are considered as less important rules.
The total time required for running the mining XCFD algorithm with any support and confidence values is the sum of all subprocedures. Firstly, the algorithm will spend ((2 . | |). |R p |) for partitioning, and after that, the XCFD Generator will search for rules and perform ( /2) operations. Therefore, the total bound is (| |. ( / 2). 2 ). |R p |), and the worst-case complexity is  (2 ).The aim of the AppXCFD algorithm is discovering a minimal set of approximate XCFD in order to detect inconsistencies hidden underneath rather than use it as association rules. Fig 4. A, B shows the number of XCIND discovered by varying tableau support and confidence threshold (different cases for Complete and Covering) [11]. As expected, the number of produced rules decreases when the tableau thresholds increase. XCIND mining algorithm checks all generated queries for all XINDs and only XIND rule that matched pattern tableaus and has supported equal or large threshold is taken into consideration.

V. CONCUSSION
The main contribution committed in the research was providing algorithms to discover a minimal set of XCFD and XCIND. Firstly, an algorithm presented to mine and discover a minimal set of constant approximate XCFD between XML elements under the same pivot path. Moreover, a set approximate XCFD is able to detect almost all inconsistencies from the existed dataset unlike full XCFD dependencies, which can prevent inconsistencies during the insert operation.
The scalability shows that the proposed algorithm for mining XCFD or XCIND behaves exponentially concerning the total number of non-repeatable elements with an obvious significance to the number of tree tuples gathered from an XML tree.