oracle三种表的连接模式hash join、nested loop,sort merge join的分析
(2014-08-11 14:22:02)
标签:
佛学 |
此博文转自http://www.2cto.com/database/201309/245953.html,稍后我会有补充以及修改。
hash join、nested loop,sort merge join
在oracle 执行计划中存在三种表的连接方式,hash join、nested loop,sort merge join
1、hash join
hash join
A join in which the database uses the smaller of two tables or data
sources to build a hash table in memory. The database scans the
larger table, probing the hash table for the addresses of the
matching rows in the smaller table.
也就是说hash join是 在两个表中连接的时候存在的,散列连接是CBO
做大数据集连接时的常用方式,可以把一个小表或是数据源整合到内存中建立一个hash table,然后数据库开始扫描这个大表,使用内存中的hash table的地址来匹配外表中的数据行。
一个小的表开始joined一个大表的时候,oracle的hash join与nested loop
join相比显的非常快。oracle对hash
join占用的内存是有限制的,这个值是5%pga_aggregate_target。
The Oracle DBA controls the optimizers'
propensity to invoke hash joins because the DBA must allocate the
RAM resources to Oracle (using the hash_area_size and
pga_aggregate_target parameters) for the optimizer to choose a hash
join. The CBO will only choose a hash join if you have allocated
Oracle enough RAM area in which to perform the hash join.
我们可以使用use_hash强制使用hash join
另外,oracle 的hash
join需要很多的内存来建立hash表,为了提高效率,我们必须设置hash_area_size足够大,如果hash表占用的内存超出了hash_area_size的大小,那么就会分页到临时表空间,这会带来一定的消耗影响性能。
eg:
SQL> select count(*) from amy_emp;
COUNT(*)
----------
7340032
SQL> select count(*) from amy_dept;
COUNT(*)
----------
4
SQL> set autotrace on explain
SQL> select count(*) from amy_emp,amy_dept where
amy_emp.deptno=amy_dept.deptno;
COUNT(*)
----------
7340032
Elapsed: 00:00:03.35
Execution Plan
----------------------------------------------------------
Plan hash value: 653649851
-------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 6 | 4 (0)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 6 | | |
|* 2 | HASH JOIN | | 14 | 84 | 4 (0)| 00:00:01 |
| 3 | INDEX FULL SCAN | REVERSE_INDEX | 4 | 12 | 1 (0)| 00:00:01
|
| 4 | TABLE ACCESS FULL| AMY_EMP | 14 | 42 | 3 (0)| 00:00:01
|
-------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("AMY_EMP"."DEPTNO"="AMY_DEPT"."DEPTNO")
SQL> select count(*) from amy_emp,amy_dept where
amy_emp.deptno=amy_dept.deptno;
COUNT(*)
----------
7340032
Elapsed: 00:00:25.10
Execution Plan
----------------------------------------------------------
Plan hash value: 1472773524
-------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 6 | 3 (0)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 6 | | |
| 2 | NESTED LOOPS | | 14 | 84 | 3 (0)| 00:00:01 |
| 3 | TABLE ACCESS FULL| AMY_EMP | 14 | 42 | 3 (0)| 00:00:01
|
|* 4 | INDEX RANGE SCAN | REVERSE_INDEX | 1 | 3 | 0 (0)| 00:00:01
|
-------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
4 - access("AMY_EMP"."DEPTNO"="AMY_DEPT"."DEPTNO")
SQL>
从中,可以看出在这种情况下hash join运行的速度远远大于nested loops;
什么时间optimizer 使用 hash joins呢?
The optimizer uses a hash join to join two tables if they are
joined using an equijoin and if either of the following conditions
are true:
1)A large amount of data must be joined.
2)A large fraction of a small table must be joined.
2、nested loop joins
这个东西叫嵌套循环链接,使用它的条件是
1)数据库获取的数据的集合不能太大
2)能够有效的访问内部表,也就是说内部表最好有索引。
看一下这句话:
It is important to ensure that the inner table is driven from
(dependent on) the outer table. If the inner table's access path is
independent of the outer table, then the same rows are retrieved
for every iteration of the outer loop, degrading performance
considerably. In such cases, hash joins joining the two independent
row sources perform better.
它包含的步骤如下:
1)optimizer 决定驱动表或是外部表
2)optimizer决定外部表
3)内表被外表驱动,外表返回的每一行都要在内表中检索找到与它匹配的行
如果我们想强制使用nested loop joins可以使用use_nl(table1,table2)
eg:
SQL> set autotrace trace;
SQL> r
1* select empno,ename,job,mgr from amy_emp,amy_dept where
amy_emp.deptno=amy_dept.deptno and amy_emp.ename
in('KING','TURNER')
Elapsed: 00:00:00.05
Execution Plan
----------------------------------------------------------
Plan hash value: 2114037428
------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2 | 56 | 3 (0)| 00:00:01 |
| 1 | NESTED LOOPS | | 2 | 56 | 3 (0)| 00:00:01 |
|* 2 | TABLE ACCESS FULL| AMY_EMP | 2 | 50 | 3 (0)| 00:00:01
|
|* 3 | INDEX RANGE SCAN | REVERSE_INDEX | 1 | 3 | 0 (0)| 00:00:01
|
-------------------------------------------
驱动表为amy_emp,使用use_nl转换驱动表;
注意:
use_nl(table1,table2)并不能让optimizer选择这两个表那个做为驱动表,如果想明确指定需要加ordered参数,对于ordered的参数指明from后边的表顺序为从左到右,也就是左边为驱动表,右边为被驱动表,另外如果use_nl参数仅仅指定一个表,那么这个表就为被驱动表,但指定的这个表为外部表那么则会忽略这个hint。
eg:
QL> select empno,ename,job,mgr from amy_dept,amy_emp where
amy_emp.deptno=amy_dept.deptno and amy_emp.ename
in('KING','TURNER');
Elapsed: 00:00:00.01
Execution Plan
----------------------------------------------------------
Plan hash value: 2114037428
------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2 | 56 | 3 (0)| 00:00:01 |
| 1 | NESTED LOOPS | | 2 | 56 | 3 (0)| 00:00:01 |
|* 2 | TABLE ACCESS FULL| AMY_EMP | 2 | 50 | 3 (0)| 00:00:01
|
|* 3 | INDEX RANGE SCAN | REVERSE_INDEX | 1 | 3 | 0 (0)| 00:00:01
|
------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("AMY_EMP"."ENAME"='KING' OR
"AMY_EMP"."ENAME"='TURNER')
3 - access("AMY_EMP"."DEPTNO"="AMY_DEPT"."DEPTNO")
驱动表为:amy_emp,被驱动表为:amy_dept,use_nl无法判断。
SQL> select empno,ename,job,mgr from amy_dept,amy_emp where
amy_emp.deptno=amy_dept.deptno and amy_emp.ename
in('KING','TURNER');
Elapsed: 00:00:00.00
Execution Plan
----------------------------------------------------------
Plan hash value: 2114037428
------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2 | 56 | 3 (0)| 00:00:01 |
| 1 | NESTED LOOPS | | 2 | 56 | 3 (0)| 00:00:01 |
|* 2 | TABLE ACCESS FULL| AMY_EMP | 2 | 50 | 3 (0)| 00:00:01
|
|* 3 | INDEX RANGE SCAN | REVERSE_INDEX | 1 | 3 | 0 (0)| 00:00:01
|
------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("AMY_EMP"."ENAME"='KING' OR
"AMY_EMP"."ENAME"='TURNER')
3 - access("AMY_EMP"."DEPTNO"="AMY_DEPT"."DEPTNO")
驱动表为:amy_emp,被驱动表为:amy_dept,use_nl忽略hint
SQL> select empno,ename,job,mgr from amy_dept,amy_emp where
amy_emp.deptno=amy_dept.deptno and amy_emp.ename
in('KING','TURNER');
Elapsed: 00:00:00.00
Execution Plan
----------------------------------------------------------
Plan hash value: 2114037428
------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2 | 56 | 3 (0)| 00:00:01 |
| 1 | NESTED LOOPS | | 2 | 56 | 3 (0)| 00:00:01 |
|* 2 | TABLE ACCESS FULL| AMY_EMP | 2 | 50 | 3 (0)| 00:00:01
|
|* 3 | INDEX RANGE SCAN | REVERSE_INDEX | 1 | 3 | 0 (0)| 00:00:01
|
------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("AMY_EMP"."ENAME"='KING' OR
"AMY_EMP"."ENAME"='TURNER')
3 - access("AMY_EMP"."DEPTNO"="AMY_DEPT"."DEPTNO")
SQL> select empno,ename,job,mgr from amy_dept,amy_emp where
amy_emp.deptno=amy_dept.deptno and amy_emp.ename
in('KING','TURNER');
Elapsed: 00:00:00.01
Execution Plan
----------------------------------------------------------
Plan hash value: 3588874585
------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2 | 56 | 4 (0)| 00:00:01 |
|* 1 | HASH JOIN | | 2 | 56 | 4 (0)| 00:00:01 |
| 2 | INDEX FULL SCAN | REVERSE_INDEX | 4 | 12 | 1 (0)| 00:00:01
|
|* 3 | TABLE ACCESS FULL| AMY_EMP | 2 | 50 | 3 (0)| 00:00:01
|
------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - access("AMY_EMP"."DEPTNO"="AMY_DEPT"."DEPTNO")
3 - filter("AMY_EMP"."ENAME"='KING' OR
"AMY_EMP"."ENAME"='TURNER')
SQL>
驱动表为:amy_dept,我们使用了ordered进行了指定。full() index()
3、sort meger join
Hash Join Hints
Apply the USE_HASH hint to instruct the optimizer to use a hash
join when joining two tables together. See "PGA Memory Management"
to learn how to size SQL work areas. See "Hints for Join
Operations" to learn about the USE_HASH hint.
11.3.5 Sort Merge Joins
Sort merge joins can join rows from two independent sources. Hash
joins generally perform better than sort merge joins. However, sort
merge joins can perform better than hash joins if both of the
following conditions exist:
?The row sources are sorted already.
?A sort operation does not have to be done.
However, if a sort merge join involves choosing a slower access
method (an index scan as opposed to a full table scan), then the
benefit of using a sort merge might be lost.
Sort merge joins are useful when the join condition between two
tables is an inequality condition such as <, <=, >, or
>=. Sort merge joins perform better than nested loop joins for
large data sets. You cannot use hash joins unless there is an
equality condition.
In a merge join, there is no concept of a driving table. The join
consists of two steps:
1.Sort join operation: Both the inputs are sorted on the join
key.
2.Merge join operation: The sorted lists are merged together.
If the input is sorted by the join column, then a sort join
operation is not performed for that row source. However, a sort
merge join always creates a positionable sort buffer for the right
side of the join so that it can seek back to the last match in the
case where duplicate join key values come out of the left side of
the join.
11.3.5.1 When the Optimizer Uses Sort Merge Joins
The optimizer can choose a sort merge join over a hash join for
joining large amounts of data if any of the following conditions
are true:
?The join condition between two tables is not an equijoin.
?Because of sorts required by other operations, the optimizer finds
it is cheaper to use a sort merge than a hash join.
11.3.5.2 Sort Merge Join Hints
To instruct the optimizer to use a sort merge join, apply the
USE_MERGE hint. You might also need to give hints to force an
access path.
There are situations where it makes sense to override the optimizer
with the USE_MERGE hint. For example, the optimizer can choose a
full scan on a table and avoid a sort operation in a query.
However, there is an increased cost because a large table is
accessed through an index and single block reads, as opposed to
faster access through a full table scan.
参考:http://docs.oracle.com/cd/E11882_01/server.112/e41573/optimops.htm#PFGRF94645
总结一下,在哪种情况下用哪种连接方法比较好:(一下摘自摘自:SunnyXu)
排序 - - 合并连接(Sort Merge Join, SMJ):
a) 对于非等值连接,这种连接方式的效率是比较高的。
b) 如果在关联的列上都有索引,效果更好。
c) 对于将2个较大的row source做连接,该连接方法比NL连接要好一些。
d) 但是如果sort merge返回的row
source过大,则又会导致使用过多的rowid在表中查询数据时,数据库性能下降,因为过多的I/O。
嵌套循环(Nested Loops, NL):
a) 如果driving row source(外部表)比较小,并且在inner row source(内部表)上
有唯一索引,或有高选择性非唯一索引时,使用这种方法可以得到较好的效率。
b) NESTED LOOPS有其它连接方法没有的的一个优点是:可以先返回已经
连接的行,而不必等待所有的连接操作处理完才返回数据,
这可以实现快速的响应时间。
哈希连接(Hash Join, HJ):
a) 这种方法是在oracle7后来引入的,使用了比较先进的连接理论,
一般来说,其效率应该好于其它2种连接,但是这种连接只能用在
CBO优化器中,而且需要设置合适的hash_area_size参数,
才能取得较好的性能。
b) 在2个较大的row source之间连接时会取得相对较好的效率,在一个
row source较小时则能取得更好的效率。
c) 只能用于等值连接中
后一篇:oracle的物理读和逻辑读