##

Correlations and missing data

The simple pdf distributions discussed above can be used
to represent 1-D pdfs. But in many cases, there exist
intra-tuple correlations between the attributes. Table 2
is an example where the uncertain attributes and
in the same tuple are correlated.
These more complex distributions are supported in Orion 2.0 using
joint probability distributions across attributes.
The information about intra-tuple dependencies is captured
by the schema dependency information, which is a partition
of all the uncertain attributes present in the table . It consists
of multiple sets of attributes that are correlated within a
tuple, i.e. dependency sets. It also contains
singleton sets containing attributes that are uncertain but are
not dependent on any other attributes. The attributes not listed
in the dependency sets are assumed to be certain.

To illustrate, let us consider a table with schema
, where represents the data type
of attribute . If all the attributes in the table are certain,
then the dependency sets are empty. On the other hand, if and are uncertain
and are correlated, this information is captured by the dependency sets:
.

In our first sensor data example (Table 1), attribute Sensor ID is a certain attribute while Location is
uncertain. Its dependency information is . Note that there is only one dependency set, since
there is only one uncertain attribute. However, in the case when there are multiple uncertain attributes,
there may be still one dependency set. For example, in Table 2 where the location is 2-D, the only dependency
set is because are jointly distributed, hence dependent on each other.

Consider the special case when all the uncertain attributes in a table
are jointly distributed (i.e. there is one dependency set). This extreme
case captures tuple uncertainty as the complete value of the
tuple is uncertain [18], as is illustrated in Table 2. The probability of the presence of a tuple (we call it tuple probability)
is then the total probability of the joint distribution. When there are multiple
dependency sets within a tuple, we can compute the joint pdf of the tuple
by multiplying the individual pdfs of the dependency sets (the justification is
that different dependency sets are independent from each other).
If the probabilities over the joint pdf sum up (integrate) to
, then is the probability that the tuple does not exist
in the database. When , we call such pdfs partial pdfs.
An important feature of Orion is its support
for partial pdfs to ensure that database
operations such as selection are consistent with PWS [18].

Rohit Jain
2011-08-02