|
|
|
 |
|
The Tukwila Data Integration System
- A data integration system is an automated method for querying across
multiple heterogeneous databases in a uniform way. In essence, a
mediated schema is created to represent a particular
application domain and data sources are mapped as views over the
mediated schema. The user asks a query over the mediated schema
and the data integration system reformulates this into a query over the
data sources and executes it. The system then intelligently
processes the query, reading data across the network and responding to
data source sizes, network conditions, and other factors.
- Query Processing Challenges
Query processing in data integration occurs over network-bound,
autonomous data sources ranging from conventional databases on the LAN
or intranet to web-based sources across the Internet. The Tukwila
data integration system is designed to scale up to the amounts of data
transmissible across intranets and the Internet (tens to hundreds of
MBs), with large numbers of data sources. This requires extensions
to traditional optimization and execution techniques for three reasons:
there is an absence of quality statistics about the data, data transfer
rates are unpredictable and bursty, and slow or unavailable data sources
can often be replaced by overlapping or mirrored sources.
Additionally, the natural data format for data integration is XML --
which creates new challenges in query processing and optimization.
(See the discussion of the x-scan
operator for more information.)
- The Tukwila Solution

The Tukwila data integration system can make use
of a mediated schema by taking the original query and applying a highly
efficient query reformulation algorithm, MiniCon,
which maps the input query from the mediated schema to the data
sources. Next, the query processing system takes a three-pronged
approach to providing efficient query answering: (1) use of
operators with flexible scheduling policies, to mask latencies,
(2) overlapping of query operations using pipelining, even for
XML data, with the x-scan operator, and (3) convergent query
processing, which allows the system to choose a query plan, execute
it for some amount of time, then revise statistics and cost estimates
and generate an improved plan -- all in mid-stream. Experiments
have shown that the Tukwila approach results in superior performance
versus traditional techniques.
- Our architecture extends previous innovations in adaptive execution
(such as query scrambling, mid-execution re-optimization, eddies, and
choose nodes), and we have experimentally demonstrated the benefits of
our system (see the papers below for more information).
Recent Work
The recent focus of the Tukwila project has been in three areas:
- Fully integrated support for efficient processing of XML data,
based on the x-scan
operator. XML provides a common encoding for data from many
different sources; combined with standardization of schemas (DTDs)
across certain domains, it greatly reduces the needs for wrappers and
even query reformulation. The latest versions of Tukwila are
built around an adaptive query processing architecture for XML, and
can seamlessly combine XML and relational data into new XML content.
- Support for output of incremental results in a meaningful
fashion. For web-based applications, we would like to see output
incrementally, and as early as possible. (This is in contrast to
the typical batch-oriented paradigm, in which overall query completion
time is key.)
- Providing "preview" answers to queries. In many cases, it
may be useful to quickly return some sort of approximate or partial
answers during a long-running query, so the user gets early
feedback. We have developed a general framework for expressing
the semantics of preview results, as well as optimization and
execution strategies for "preview queries."
- Development of techniques and algorithms for "convergent query
processing." Our convergent query processing framework allows
the system to continuously re-optimize a query at any point as it gets
improved information, for fairly complex classes of queries (including
grouping, sorting, etc.).
Project Members
- Professor Alon
Halevy
- Professor Dan
Weld
- Zack Ives is
responsible for the Tukwila execution engine and its adaptive operation,
as well as the optimizer for previewing query results. His work
largely relates to adaptive query processing, processing of XML data,
XML and zero-knowledge query optimization, and XML query languages.
Project Alumni
- Hakim Weatherspoon worked on a web wrapper toolkit. Hakim is
now a graduate student at U.C.
Berkeley.
- Marc
Friedman constructed the original optimizer for the Tukwila
system. He recently graduated and joined the Viathan Corporation.
- Daniela
Florescu, INRIA (France) advised during the original Tukwila design
stages. She is now at Propel
Software..
- Rachel
Pottinger has developed the Tukwila query reformulator, which uses a
new query reformulation algorithm, MiniCon,
that is significantly more efficient than previous approaches.
Rachel's current work involves the construction of systems for managing
models (e.g. schemas).
Web Pages
- The x-scan
operator, a new primitive for XML data integration
- The MiniCon
query reformulation algorithm
- High-level system
overview
- Tukwila query execution engine/optimizer interface
specifications (UPDATED for XML extensions)
- Spring 1999 CSE
544, using Tukwila as the core of the class project
Presentations
Publications
- Zachary G. Ives, Alon Y. Halevy, Daniel S. Weld.
Convergent Query Processing. Submitted for
publication, 2001.
- Zachary G. Ives, Alon Y. Halevy, Daniel S. Weld. An XML
Query Engine for Network-Bound Data. Submitted for
publication, 2001.
- Rachel Pottinger, Alon Levy. A
Scalable Algorithm for Answering Queries Using Views. VLDB
2000, Cairo, Egypt.
- Zachary G. Ives, Alon Y. Levy, Daniel S. Weld, Daniela Florescu,
Marc Friedman. Adaptive Query
Processing for Internet Applications. IEEE Data
Engineering Bulletin, Vol. 23 No. 2, June 2000.
- Zachary G. Ives, Alon Y. Levy, Daniel S. Weld. Efficient
Evaluation of Regular Path Expressions on Streaming XML Data.
Technical Report UW-CSE-2000-05-02, University of Washington.
- Zachary G. Ives, Daniela Florescu, Marc Friedman, Alon Levy, Daniel
S. Weld. An
Adaptive Query Execution System for Data Integration. ACM
SIGMOD Conference on Management of Data, Philadelphia, PA, June 1-3,
1999.
This research has been funded in part by NSF grant 9978567.
|