Historical dependencies

In addition to the pdfs, for each dependent group of uncertain attributes present in the dependency sets, we store its history. While dependency information (or joint pdfs) express intra-tuple dependencies at the schema level, history captures the inter-tuple dependencies at the instance level. The history of a given set of uncertain attributes denotes the attribute sets from which it is derived, and is used while performing operations on the tuples to ensure that the results are consistent with PWS [18]. History captures dependencies among attribute sets as a result of prior database operations. A similar concept is used in many tuple uncertainty models to track correlations between tuples. [1] uses lineage and [14] uses factor tables to capture such dependencies. As we are interested in capturing historical dependencies between attributes of tuples, our concept of dependencies is different from this related work, which captures these dependencies on a per tuple basis. We maintain the history of uncertain attributes by storing the top-level ancestors of each dependency set in a tuple. The function maps each pdf of a dependency set, to a set of pdfs that are its ancestors, i.e. the base pdfs which are inserted into the database by the user. We assume that the base tuples are independent. All the derived attributes point back to the base pdfs from which they are derived.

As an example, we show a base table $T$ and two derived tables $T_1$ and $T_2$ in Figure 2. $T$ has two uncertain attributes $A$ and $B$ that are jointly distributed. $T_1$ is derived by projecting $A$ on $T$ satisfying $A > 3$. $T_2$ is simply a projection of attribute $B$ on $T$. $T_{12}$ is a resulting table after joining $T_1$ and $T_2$. As we can see from $T_{12}$, the pair $(4, 5)$ has a probability $0.9$ rather than $0.9 * 0.9 = 0.81$ and the pair $(4, 3)$ is not even in the result. If we had looked at $T_1$ and $T_2$ only without checking their histories (both of them come from $T$ and their attributes are correlated in $T$), we could have made mistakes about the resulting tuples after joining the two tables. Since attribute $A$ and $B$ are historically jointly distributed, we need to refer to the original distribution in $T$ when performing the join operation on $T_1$ and $T_2$. The pair $(4, 5)$ has a probability $0.9$ according to the joint distribution in $T$ and the pair $(4, 3)$ is not a valid pair (cannot appear in any possible worlds) because it is not in the original distribution.

Figure 2: Example illustrating the history
Image history

Note that the deletion of a base tuple will cause dependency sets of its derived tuples to lose their ancestor information. Thus, while deleting a tuple from the base table, we first check if any other tuple in the database is referencing any dependency set within the tuple. If there is a reference, we delete the tuple but keep the dependency set and its pdf as a phantom node until its reference count falls to zero.

Rohit Jain 2011-08-02