SPRINTER: A Fast n-ary Join Query Processing Method for Complex OLAP Queries.

abstract

(1) The concept of OLAP query processing is now being widely adopted in various applications.FK-FK joins increases in those applications.

(2) However, the existing in-memory OLAP systems tend not to handle such complex queries efficiently since they generate a large amount of intermediate results or incur a huge amount of probe cost.

(3) we propose an effective query planning method for complex OLAP queries.It generates a query plan contain_x0002_ing n-ary join operators based on a cost model.。The plan does not generate intermediate results for processing FK-FK joins and significantly reduces the probe cost

Keywords

n Connection operators, query planning, query optimization, FKFK connection, complex OLAP query processing, collaborative processing

INTRODUCTION

Sorting algorithms: SPRINTER uses different sorting algorithms and techniques depending on the cardinality of a table to be sorted, the number of sorting columns, and the capacity of GPU memory in case of GPU sorting.Generally, comparison-based algorithms are not used, because non-comparison-based algorithms are O(k.N), and the speed limit is much lower than that of comparison-based algorithms, which is O(N.logN). SPRINTER uses heterogeneous sorting to effectively utilize CPU and GPU.as much
as possible by overlapping three kinds of low-level GPU op_x0002_erations, H2D copy, sort, and D2H copy.

Worst-case optimal join algorithm: The algorithms are mainly proposed to process a graph pattern query (e.g.,triangle query) efficiently by avoiding the generation of in_x0002_termediate results.。If a query has L join variables, the number of possible global variable orders is L!factorial. the dominating cost of TJ is the sorting cost of input relations,TJ uses binary search, and so, the cost of a single seek is O(logN), and estimates the cost of join processing for each possible order and choose the best one.

PRELIMINARIES

Sorting algorithms

Parallel sorting algorithms and techniques used in SPRINTER. Most GPU sorting algorithms can only sort data in GPU memory. However, the input array group to be sorted can be much larger than the GPU memory capacity in the OLAP system. In that case, SPRINTER used heterogeneous sorting, effectively using CPU and GPU.

2.2 Worst-case optimal join algorithm

Worst-case optimal join algorithm usually uses a fixed global ordering for all connection variables.Assume that the order of global variables in Figure 2 is x<y<z. The TJ algorithm is sorted by (x, y), and R is sorted by x<y.. etc. Then, TJ first scans all the relationships on the first connection variable x (ie R and T).

3 MOTIVATION

we present a motivating query that shows the drawbacks of the existing OLAP query processing systems. The query is the one on the TPC-DS benchmark database and widely used for testing the query performance of the OLAP systems.

The query contains three fact tables {SS, SR, CS} in blue rectangles (fact tables, which record specific events, such as user transaction flow tables)

​ The three-dimensional table is of green rectangle {D, I, C}. (Each dimension may contain multiple attributes, such as time dimension, insert time and update time)
The blue line has 2 FK-FK connection operations, and the green line has 5 PK-FK connection operations

4 QUERY PLANNING METHOD

Figure shows the query plan and its performance generated by System-X, OmniSci and our SPRINTER.

System-x is a fully functional state-of-the-art commercial in-memory database system

In Figure, It generates a left-deep join tree for the query plan, which is equivalent to combining from left to right. C1, C2, and C3 are the same relationship, namely relationship C. The plan generates 22.3B intermediate tuples as the left operand before the last join. , And take the 144 million M-tuples of CS as the right operand,
Because the connection operation is an FK-FK connection, and there are many duplicate key values in the hash table, the detection cost of num in the process of detecting the connection is huge, reaching 160.9B.

SPRINTER is a system prototype, that is, a method we propose to integrate seamlessly across all relevant layers and modules into OmniSci.

As shown in the figure, SPRINTER generates a query plan consisting of multiple white binary join operators and a red n-ary join operator.
It executes three connected subtrees respectively, S1, S2, S3, and a pipeline method like OmniSci. When we calculate, the total detection cost of S1, S2, and S3 is only 758M.

The performance results show that OmniSci eliminates the connection of a single large block through pipelining. SPRINTER improves the performance of OmniSci by dividing a single large connection tree into multiple smaller connection subtrees and performing n connections. Split into multiple smaller connected subtrees, and perform n connections on the result of the connected subtrees.

OmniSci is an open source collaborative processing database system, where collaborative processing means using cpu and gpu for query processing at the same time.

It generates a left deep join query plan the same as a.
However, the plan is executed in a non-blocking and pipelined manner, and it does not generate and store intermediate results to be connected. As shown in Figure (b), each tuple of the second connection detection passes the CS table of the first connection 1.44M tuple for the hash, and the probe cost is 20.7 B. The total probe cost is 25.8 B.

4 QUERY PLANNING METHOD

​ Introduce a query planning method used to generate a query plan containing a single n-ary join operator. Then, we generalize the method to more complex queries, so that it can generate an operator that contains multiple n-ary joins in Section 4.2

We consider a connection graph from a given query Q. The connection graph G=(V, E, f(v∈V), g(e∈E), h(e∈E)) is an undirected multigraph. There are three marking functions f(v), g(e) and h(e).

The function f(v) returns the type of relation v, that is, fact or dimension,

Function g(e) the type of connection operation, namely PK-FK or FK-FK,

And the connection predicate e such as function h(e), namely h(e)=(X[i], Y[j] ).


The subgraph {SS, SR, CS} is a core subgraph, because all vertices are fact tables

We can decompose G into a core subgraph core and a set of non-core subgraphs that do not intersect each other at the edges. The above two decomposition methods are expressed as D1={core, G1, G2, G3} and D2={core, G1}. Make a core subgraph the root node of a query plan and mak Use a non-core subgraph to represent the child nodes of the root node in the query plan.

Algorithm 1 provides a basic query planning method for queries containing only a single core subgraph.

Algorithm 1. Determine a core subgraph, then traverse its non-core subgraph, which is generally a dimension table, and finally calculate the calculation of the fact table to minimize the amount of results formed by the intermediate connection operation.

The intuition behind this method is to make a core subgraph the root node of a query plan and mak. Use non-core subgraphs to represent the child nodes of the root node in the query plan. Considering the heuristic method, the FK-FK join on the fact table is processed in the last step of the query plan, thereby reducing the amount of intermediate results generated from the join operation in the non-core subgraph.

4.1 Query of a single n-ary join operator

Figure shows two possible query plan trees generated by the above two decompositions D1={core, G1, G2, G3} and D2={core, G1}.

When comparing the two query plans in Figure 6, we can say that plan P1 usually costs less than plan P2 because P1 scans a dimension table C twice, and P2 scans two facts SR and CS again.

4.2 Query of multiple n-ary join operators

We will explain the query planning method when there are multiple core subgraphs in a query. We denote the number of core subgraphs as Ncore, which means that a specific core in the connection graph is regarded as the only core in the connection graph.

Apply Algorithm 1 recursively to all non-core subgraphs. For the generalized version, we only need to modify the sixth row in Algorithm 1.

If Gj has one or more core subgraphs, algorithm will generate a query plan Pnew
and then estimates Pold cost and PNew cost,
After comparing the two costs, only the lower planned cost is returned as an output.

4.3 Search space

The number of decompositions depends on which core subgraph is selected as the root node. Generally speaking, there is Ncore(G) core subgraph in the connection graph G.

Calculate the number of possible plans with a single n-ary join operator (as a root node) generated by a naive m

Each decomposition can generate multiple query plans. For a specific decomposition Dj (1≤j≤Ndcmp(Ci)), we assume that there are Nsubg(Dj) non-core subgraphs

5 n-ARY JOIN PROCESSING METHOD

We will introduce SPRINTER’s n connection processing method. We will implement a naive method of handling n concatenation operators with a set of sub-join subtrees. Please perform the following six steps.

  • -Step 1: Evaluate {S1, ·, ·, Sn} one by one.Si is the connected subtree
    -Step 2: Calculate the necessary statistical data of {S1,·,··,Sn} results.
    -Step 3: Estimate the cost of each global sequence of connected variables.
    -Step 4: Choose the best global variable order.
    -Step 5: Sort the results of {S1, ·, ·, Sn} in a global order.
    -Step 6: Merge connections on n sorted relations.

Steps 2-4 are completed at time t1 in Figure 7(a), which determines the order in which the TJ algorithm is globally applied between the connected columns.
Step 5 Sort each result according to the global order. Step 6 is completed at time t2 using the TJ algorithm described in Section 2.2 2.2.
The total elapsed time in the graph will not change

Therefore, even if we use the GPU for the sorting step, it is difficult to greatly reduce the total running time. Therefore, we use an improved method to deal with an n-ary join operator, which is performed by of the following four steps.

  • Step 1: Estimate the cost of each global sequence of connected variables.
  • Step 2: Choose the best global variable order.
  • Step 3: Perform overlapping sorting on the results. **difference
  • Step 4: Combine connections on n sorted relations.and using GPU(i, j) to sort the results of Sj.

In step 3, it overlaps using the CPU to evaluate Si,

Determination of Global Order

The figure shows a core sub-figure of TPC-DSQ17, which has three connected variables, item, cust and ticket.

5.1 Determination of Global Order

The figure shows a core sub-figure of TPC-DSQ17, which has three connected variables, item, cust and ticket.

For item, |Estem|=2, and Citem=||CS||+||SR||+||SS||.
Then, the algorithm sorts the triplet list in descending order of (|Ew|, Cw).
Intuitively, the algorithm chooses a join variable having more join conditions and potentially more tuples to be processed as higher pri_x0002_ority to reduce the total amount of binary search on sorted relations.

5.2 Strategy for Sorting

Three factors: GPU availability, the size of the amount of data to be sorted, the amount of data fields,

If GPU is not available, sprinter uses cpu non-comparative sorting number of sorting columns is only one,

Secondly, if the data to be sorted can fit into GPU memory, SPRINTER uses a non-comparative or comparison-based GPU sorting algorithm, depending on the number of sorting columns.

Third, if the data to be sorted cannot fit into the GPU memory, we need to choose carefully

A sorting algorithm The problem of lack of good implementation of GPU sorting

The implementation speed of a comparison-based algorithm of GPU is only slightly faster than that of CPU.

5.3 COST MODEL FOR QUERY OPTIMIZATION

Query processing using n-ary join operators may not always achieve better performance than traditional query processing methods using only binary join operators. Therefore, we only choose it when the cost model is favorable.

EQ3 shows the cost function of Pold
Eq4 is the cost function for building a hash table
Eq5 is the cost function of detecting tuples

5.4 Cost model of SPRINTER

Eq.7 represents the cost of n-ary connection processing, which includes the sorting cost of n connected subtree results and the connection cost of using the TJ algorithm.

Eq8 shows the cost of merging n sort relationship algorithms

6 EXPERIMENTAL EVALUATION

First, we compare SPRINTER with the existing OLAP query processing systems in terms of the elapsed times for complex OLAP queries in the TPC-DS benchmark.
Second,we show the characteristics of SPRINTER.

Experimental setup

Query and data set: In the TPC-DS benchmark test, there are a total of 26 TPC-DS queries. There is at least one FK-FK connection environment: We conduct all experiments on one machine, equipped with two Intel Xeonen 10-core CPUs,
512GB main memory and an 11GB NVIDIAGTX1080TiGPU.
Operating system OS7.5

Compared with the sprinter system, OLAP systems are divided into two types: cpu-based systems (such as Syssem-x) and collaborative processing systems (such as OmniSci). Each system is set to use the main memory and GPU device memory as much as possible at the same time (only for common processing systems)

For cpu-based systems, System-Y is one of the most advanced commercial OLAP database systems, supporting index-driven query execution and query optimization technology, which is similar to System-X.
However, it generates a binary (more balanced tree, different from making a deep tree) query plan, which is one of the main differences from System-x.
In addition, System-Y processes query plans in a pipelined manner and supports the covariance of query execution. We use the latest version of system x and system y for experiments.

using only CPU as SPRINTER(C) and the one of SPRINTER,using both CPU and GPU as SPRINTER(G).

The performance of most queries of sprinter(C) and sprinter(G) is better than all systems, this is due to their different query plans and different connection processing
For Q64, SPRINTER(G) has improved performance compared with Syssem-Y, System-X, MonetDB and Actia. n vectors are 6.6, 7.4, 20.1 and 5.1 times, respectively
We note that the current SPRINTER does not use ad_x0002_vanced query optimization techniques that System-X and System-Y use, since its base system, OmniSci, does not sup_x0002_port them yet. Thus, the performance of SPRINTER can be further improved by applying the optimization techniques to OmniSci or SPRINTER.

Compare all the tested queries. Compared with cpu-based systems, existing collaborative processing systems have more failures during processing, and generally have worse performance.
This is because co-processing systems usually do not use advanced query optimization techniques, and are not mature enough to use GPUs to effectively process complex queries.
For OmniSci, even when using main memory to execute a query, if the query becomes more complex, it tends to incorrectly estimate the amount of memory required, so the connection tree with the deeper left becomes deeper and deeper. As a result, OmniSci often fails due to the large detection cost of the m.E. of incorrect memory allocation or FK-FK connection.

In contrast, although SPRINTER is based on OmniSci, it did not fail and the query performance was significantly improved. The left deep join subtree in the query plan generated by SPRINTER is much smaller than OmniSci’s join tree. At the same time, there is almost no fact table used to build a hash table.

In contrast

This paper proposes a fast n-ary join query processing method for complex OLAP queries with FK-FK joins.

It will generate a query plan containing n-ary join operators if it is better than the traditional left deep binary join tree based on our cost model.

This plan can significantly reduce the cost of detection by putting the FK-FK connection on the fact table into an n-ary connection operator.

We also have proposed an effi cient n-ary join processing method which is based on the TJ algorithm and heuristic algorithm selecting a good global variable order.

The sprinter we proposed can already be integrated into the open source memory OLAP system, OmniSci, across all related layers and modules.

Through experiments using the TPC-DS benchmark test, we have proved that even without GPU sequencing, the performance of SPRINTER is better than the most advanced OLAP system, although its basic system OmniSci achieves the second-worst performance among them.

Algorithm execution process:

Algorithm 1. Determine a core subgraph, then traverse its non-core subgraph, which is generally a dimension table, and finally calculate the calculation of the fact table to minimize the amount of results formed by the intermediate connection operation.

Algorithm 2 is a generalization of Algorithm 1, there are multiple core subgraphs, and it is traversed.

调查目的 MOTIVATION

调查对象 Intro

调查内容 4和5

调查的实验评估

调查结果 实验结果

调查体会 In conclude