Date of Award
Doctor of Philosophy
Fast and accurate estimations for complex queries are profoundly beneficial for large databases with heavy workloads. The most widely adopted query optimizers use synopses to tune up the databases in manners of optimization and re-optimization. From Chapter 1 to Chapter 3, we focus on the synopses for query optimization. We propose a statistical summary for a database, called CS2 (Correlated Sample Synopsis), to provide rapid and accurate result size estimations for all queries with joins and arbitrary selections. Unlike the state-of-the-art techniques, CS2 does not completely rely on simple random samples, but mainly consists of correlated sample tuples that retain join relationships with less storage. We introduce a statistical technique, called reverse sample, and design an innovative estimator, called reverse estimator, to fully utilize correlated sample tuples for query estimation. We prove both theoretically and empirically that the reverse estimator is unbiased and accurate using CS2. Extensive experiments on multiple datasets show that CS2 is fast to construct and derives more accurate estimations than existing methods with the same space budget. In Chapter 4, we focus on the synopses for query re-optimization on repetitive queries. Repetitive queries refer to those queries that are likely to be executed repeatedly in the future, such as those that are used to generate periodic reports, perform routine maintenance, summarize data for analysis, etc. They can constitute a large part of daily activities of a database system and deserve more optimization efforts. In this paper, we propose to collect information about how tuples are joined in a query, called the query or join trace, during execution of a query. We intend to use this join trace to compute the selectivities of joins in all join orders for the query. We use existing operators, as well as new operators, to gather such information. We show that the trace gathered from a query is sufficient to compute the exact selectivities of all plans of the query. To reduce the overheads of generating a trace, we propose a sampling scheme that generates only a sample of the trace. Experimental results have shown that with only a small sample of the trace, accurate estimates of join selectivities can be obtained. The sample trace makes re-estimation of join selectivities of a repetitive query efficient and accurate.
This dissertation is only available for download to the SIUC community. Current SIUC affiliates may also access this paper off campus by searching Dissertations & Theses @ Southern Illinois University Carbondale from ProQuest. Others should contact the interlibrary loan department of your local library or contact ProQuest's Dissertation Express service.