Background and Motivations

Most databases deal with data that has certain values. However, for many applications data is inherently uncertain: The values of attributes may not be known precisely, or the presence of a tuple in a relation may be uncertain. Examples of applications with uncertain data include sensor databases (measured values have errors), text annotation (annotations are rarely perfect), information retrieval (the match between a document and a query is often a question of degree or confidence), scientific data (model outputs, estimates, experimental measurements, and hypothetical data), and data cleansing (multiple alternatives for an incorrect value) [18].

Due to the importance of uncertain data management, many researchers have addressed the problem of uncertain database modeling as well as querying with uncertain data. While some work deals with the fuzzy aspect of uncertain data [9], we focus on the probabilistic aspect, where uncertainty falls into two categories: One is the discrete uncertainty where an attribute has multiple alternative values, each with a corresponding probability to take that particular value; the other is the continuous uncertainty where an attribute is associated with a probability distribution function (pdf) over the range of the possible values. Such uncertainty comes from attributes, hence we call it attribute uncertainty. However, another source of uncertainty comes from tuples (we call it tuple uncertainty), where the presence of a tuple is probabilistic.

Our first example is the sensor data containing sensor IDs and locations of objects that the sensor measured. This is shown in Table 1. Here Location is uncertain (i.e. it is an uncertain attribute) with continuous uncertainty represented by a Gaussian distribution. Our second example is also from sensor readings. However, instead of a one-dimensional (1-D) location and a continuous pdf like the Gaussian distribution in Table 1, the location is now two-dimensional (2-D) with $x$ and $y$ as the coordinates jointly distributed over a list of possible pair values. This is shown in Table 2. The reason that we have a single joint pdf over these two attributes instead of specifying two independent pdfs over $x$ and $y$ is that the two coordinates are closely dependent on each other. Note that for the joint distribution of $x$ and $y$, the probabilities may not add up to 1. For example, in Tuple 2, the probabilities of the $2-D$ location add up to 0.7, hence the presence of the tuple itself is uncertain with 0.3 probability that the tuple does not exist.

Table 3: Example: Sensor data
Table 1: 1-D Location
Sensor ID Location
1 Gauss(20, 5)
2 Gauss(25, 3)
3 Gauss(15, 6)
Table 2: 2-D Location
Sensor ID x y
<#1045#> (1, 0): 0.2
  (1, 0): 0.5
  (1, -1): 0.3
<#1048#> (3, 4): 0.3
  (2, 3): 0.4
<#1050#> (-3, 4): 0.1
  (-2, 3): 0.4
  (-1, -1): 0.1

Many probabilistic data models have been proposed in recent years [2,1,4]. However, none of them deals with probabilistic data that can have both discrete and continuous domains and allows uncertainty at both attribute and tuple levels. The uncertain database model proposed in [18] is the first to address all these issues at the same time. Furthermore, the new model handles arbitrary correlations between attributes from the same tuple or across different tuples.

Besides probabilistic modeling, a lot of research has been done on efficient query processing over probabilistic data, including range queries [3,6], nearest-neighbor queries [7,10,13], ranking queries [19,11], skyline queries [12], and so on. Most of the work assumes single table queries and simple uncertain data models (e.g. either tuple uncertainty or attribute uncertainty, either discrete uncertainty or continuous uncertainty, no correlation between attributes, etc.). It lacks a comprehensive model to handle complex database queries consisting of selects, projects and joins in a consistent manner [18]. This motivated us to implement a general-purpose uncertain Database Management System (DBMS) that unifies the modeling of probabilistic data across applications, i.e. Orion 2.0. We followed the probabilistic data model proposed in [18] because it is a general and natural model based on attribute uncertainty that is also capable of handling tuple uncertainty as well as dealing with correlations between uncertain attributes whose data types can be discrete or continuous. Orion 2.0 not only offers a systematic way to manage different kinds of uncertainty, but also provides additional opportunities to the query engine for indexing and optimization [15].

Rohit Jain 2011-08-02