Correlations and missing data

The simple pdf distributions discussed above can be used to represent 1-D pdfs. But in many cases, there exist intra-tuple correlations between the attributes. Table 2 is an example where the uncertain attributes $x$ and $y$ in the same tuple are correlated. These more complex distributions are supported in Orion 2.0 using joint probability distributions across attributes.

The information about intra-tuple dependencies is captured by the schema dependency information, which is a partition of all the uncertain attributes present in the table $T$. It consists of multiple sets of attributes that are correlated within a tuple, i.e. dependency sets. It also contains singleton sets containing attributes that are uncertain but are not dependent on any other attributes. The attributes not listed in the dependency sets are assumed to be certain.

To illustrate, let us consider a table $T$ with schema $(a_1:d_1, a_2:d_2, a_3:d_3, a_4:d_4)$, where $d_i$ represents the data type of attribute $a_i$. If all the attributes in the table are certain, then the dependency sets are empty. On the other hand, if $a_1, a_2$ and $a_3$ are uncertain and $a_1, a_2$ are correlated, this information is captured by the dependency sets: $\{a1, a2\}, \{a3\}$.

In our first sensor data example (Table 1), attribute Sensor ID is a certain attribute while Location is uncertain. Its dependency information is $\{Location\}$. Note that there is only one dependency set, since there is only one uncertain attribute. However, in the case when there are multiple uncertain attributes, there may be still one dependency set. For example, in Table 2 where the location is 2-D, the only dependency set is $\{x, y\}$ because $x, y$ are jointly distributed, hence dependent on each other.

Consider the special case when all the uncertain attributes in a table $T$ are jointly distributed (i.e. there is one dependency set). This extreme case captures tuple uncertainty as the complete value of the tuple is uncertain [18], as is illustrated in Table 2. The probability of the presence of a tuple (we call it tuple probability) is then the total probability of the joint distribution. When there are multiple dependency sets within a tuple, we can compute the joint pdf of the tuple by multiplying the individual pdfs of the dependency sets (the justification is that different dependency sets are independent from each other). If the probabilities over the joint pdf sum up (integrate) to $p$, then $1-p$ is the probability that the tuple does not exist in the database. When $p < 1$, we call such pdfs partial pdfs. An important feature of Orion is its support for partial pdfs to ensure that database operations such as selection are consistent with PWS [18].

Rohit Jain 2011-08-02