oracle三种表的连接模式hash join、nested loop，sort merge join的分析_550341599

http://blog.sina.com.cn/u/1802614403

首页博文目录关于我

个人资料

550341599

微博

加好友发纸条

写留言加关注

博客等级：
博客积分：0

博客访问：140
关注人气：0
获赠金笔：0支
赠出金笔：0支
荣誉徽章：

正文字体大小：大中小

oracle三种表的连接模式hash join、nested loop，sort merge join的分析

(2014-08-11 14:22:02)

标签：

佛学

此博文转自http://www.2cto.com/database/201309/245953.html，稍后我会有补充以及修改。

hash join、nested loop，sort merge join

在oracle 执行计划中存在三种表的连接方式，hash join、nested loop，sort merge join

1、hash join

hash join

A join in which the database uses the smaller of two tables or data sources to build a hash table in memory. The database scans the larger table, probing the hash table for the addresses of the matching rows in the smaller table.

也就是说hash join是在两个表中连接的时候存在的，散列连接是CBO 做大数据集连接时的常用方式，可以把一个小表或是数据源整合到内存中建立一个hash table，然后数据库开始扫描这个大表，使用内存中的hash table的地址来匹配外表中的数据行。

一个小的表开始joined一个大表的时候，oracle的hash join与nested loop join相比显的非常快。oracle对hash join占用的内存是有限制的，这个值是5%pga_aggregate_target。

The Oracle DBA controls the optimizers' propensity to invoke hash joins because the DBA must allocate the RAM resources to Oracle (using the hash_area_size and pga_aggregate_target parameters) for the optimizer to choose a hash join. The CBO will only choose a hash join if you have allocated Oracle enough RAM area in which to perform the hash join.

我们可以使用use_hash强制使用hash join

另外,oracle 的hash join需要很多的内存来建立hash表，为了提高效率，我们必须设置hash_area_size足够大，如果hash表占用的内存超出了hash_area_size的大小，那么就会分页到临时表空间，这会带来一定的消耗影响性能。

eg：

SQL> select count(*) from amy_emp;

COUNT(*)

----------

7340032

SQL> select count(*) from amy_dept;

COUNT(*)

----------

SQL> set autotrace on explain

SQL> select count(*) from amy_emp,amy_dept where amy_emp.deptno=amy_dept.deptno;

COUNT(*)

----------

7340032

Elapsed: 00:00:03.35

Execution Plan

----------------------------------------------------------

Plan hash value: 653649851

-------------------------------------------------------------------------------------

-------------------------------------------------------------------------------------

| 0 | SELECT STATEMENT | | 1 | 6 | 4 (0)| 00:00:01 |

| 1 | SORT AGGREGATE | | 1 | 6 | | |

|* 2 | HASH JOIN | | 14 | 84 | 4 (0)| 00:00:01 |

| 3 | INDEX FULL SCAN | REVERSE_INDEX | 4 | 12 | 1 (0)| 00:00:01 |

| 4 | TABLE ACCESS FULL| AMY_EMP | 14 | 42 | 3 (0)| 00:00:01 |

-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

2 - access("AMY_EMP"."DEPTNO"="AMY_DEPT"."DEPTNO")

SQL> select count(*) from amy_emp,amy_dept where amy_emp.deptno=amy_dept.deptno;

COUNT(*)

----------

7340032

Elapsed: 00:00:25.10

Execution Plan

----------------------------------------------------------

Plan hash value: 1472773524

-------------------------------------------------------------------------------------

-------------------------------------------------------------------------------------

| 0 | SELECT STATEMENT | | 1 | 6 | 3 (0)| 00:00:01 |

| 1 | SORT AGGREGATE | | 1 | 6 | | |

| 2 | NESTED LOOPS | | 14 | 84 | 3 (0)| 00:00:01 |

| 3 | TABLE ACCESS FULL| AMY_EMP | 14 | 42 | 3 (0)| 00:00:01 |

|* 4 | INDEX RANGE SCAN | REVERSE_INDEX | 1 | 3 | 0 (0)| 00:00:01 |

-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

4 - access("AMY_EMP"."DEPTNO"="AMY_DEPT"."DEPTNO")

SQL>

从中，可以看出在这种情况下hash join运行的速度远远大于nested loops；

什么时间optimizer 使用 hash joins呢？

The optimizer uses a hash join to join two tables if they are joined using an equijoin and if either of the following conditions are true:

1）A large amount of data must be joined.

2）A large fraction of a small table must be joined.

2、nested loop joins

这个东西叫嵌套循环链接，使用它的条件是

1）数据库获取的数据的集合不能太大

2）能够有效的访问内部表，也就是说内部表最好有索引。

看一下这句话：

It is important to ensure that the inner table is driven from (dependent on) the outer table. If the inner table's access path is independent of the outer table, then the same rows are retrieved for every iteration of the outer loop, degrading performance considerably. In such cases, hash joins joining the two independent row sources perform better.

它包含的步骤如下：

1）optimizer 决定驱动表或是外部表

2）optimizer决定外部表

3）内表被外表驱动，外表返回的每一行都要在内表中检索找到与它匹配的行

如果我们想强制使用nested loop joins可以使用use_nl(table1,table2)

eg：

SQL> set autotrace trace;

SQL> r

1* select empno,ename,job,mgr from amy_emp,amy_dept where amy_emp.deptno=amy_dept.deptno and amy_emp.ename in('KING','TURNER')

Elapsed: 00:00:00.05

Execution Plan

----------------------------------------------------------

Plan hash value: 2114037428

------------------------------------------------------------------------------------

------------------------------------------------------------------------------------

| 0 | SELECT STATEMENT | | 2 | 56 | 3 (0)| 00:00:01 |

| 1 | NESTED LOOPS | | 2 | 56 | 3 (0)| 00:00:01 |

|* 2 | TABLE ACCESS FULL| AMY_EMP | 2 | 50 | 3 (0)| 00:00:01 |

|* 3 | INDEX RANGE SCAN | REVERSE_INDEX | 1 | 3 | 0 (0)| 00:00:01 |

-------------------------------------------

驱动表为amy_emp，使用use_nl转换驱动表；

注意：

use_nl(table1,table2)并不能让optimizer选择这两个表那个做为驱动表，如果想明确指定需要加ordered参数，对于ordered的参数指明from后边的表顺序为从左到右，也就是左边为驱动表，右边为被驱动表，另外如果use_nl参数仅仅指定一个表，那么这个表就为被驱动表,但指定的这个表为外部表那么则会忽略这个hint。

eg：

QL> select empno,ename,job,mgr from amy_dept,amy_emp where amy_emp.deptno=amy_dept.deptno and amy_emp.ename in('KING','TURNER');

Elapsed: 00:00:00.01

Execution Plan

----------------------------------------------------------

Plan hash value: 2114037428

------------------------------------------------------------------------------------

------------------------------------------------------------------------------------

| 0 | SELECT STATEMENT | | 2 | 56 | 3 (0)| 00:00:01 |

| 1 | NESTED LOOPS | | 2 | 56 | 3 (0)| 00:00:01 |

|* 2 | TABLE ACCESS FULL| AMY_EMP | 2 | 50 | 3 (0)| 00:00:01 |

|* 3 | INDEX RANGE SCAN | REVERSE_INDEX | 1 | 3 | 0 (0)| 00:00:01 |

------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

2 - filter("AMY_EMP"."ENAME"='KING' OR "AMY_EMP"."ENAME"='TURNER')

3 - access("AMY_EMP"."DEPTNO"="AMY_DEPT"."DEPTNO")

驱动表为：amy_emp,被驱动表为：amy_dept,use_nl无法判断。

SQL> select empno,ename,job,mgr from amy_dept,amy_emp where amy_emp.deptno=amy_dept.deptno and amy_emp.ename in('KING','TURNER');

Elapsed: 00:00:00.00

Execution Plan

----------------------------------------------------------

Plan hash value: 2114037428

------------------------------------------------------------------------------------

------------------------------------------------------------------------------------

| 0 | SELECT STATEMENT | | 2 | 56 | 3 (0)| 00:00:01 |

| 1 | NESTED LOOPS | | 2 | 56 | 3 (0)| 00:00:01 |

|* 2 | TABLE ACCESS FULL| AMY_EMP | 2 | 50 | 3 (0)| 00:00:01 |

|* 3 | INDEX RANGE SCAN | REVERSE_INDEX | 1 | 3 | 0 (0)| 00:00:01 |

------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

2 - filter("AMY_EMP"."ENAME"='KING' OR "AMY_EMP"."ENAME"='TURNER')

3 - access("AMY_EMP"."DEPTNO"="AMY_DEPT"."DEPTNO")

驱动表为：amy_emp,被驱动表为：amy_dept,use_nl忽略hint

SQL> select empno,ename,job,mgr from amy_dept,amy_emp where amy_emp.deptno=amy_dept.deptno and amy_emp.ename in('KING','TURNER');

Elapsed: 00:00:00.00

Execution Plan

----------------------------------------------------------

Plan hash value: 2114037428

------------------------------------------------------------------------------------

------------------------------------------------------------------------------------

| 0 | SELECT STATEMENT | | 2 | 56 | 3 (0)| 00:00:01 |

| 1 | NESTED LOOPS | | 2 | 56 | 3 (0)| 00:00:01 |

|* 2 | TABLE ACCESS FULL| AMY_EMP | 2 | 50 | 3 (0)| 00:00:01 |

|* 3 | INDEX RANGE SCAN | REVERSE_INDEX | 1 | 3 | 0 (0)| 00:00:01 |

------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

2 - filter("AMY_EMP"."ENAME"='KING' OR "AMY_EMP"."ENAME"='TURNER')

3 - access("AMY_EMP"."DEPTNO"="AMY_DEPT"."DEPTNO")

SQL> select empno,ename,job,mgr from amy_dept,amy_emp where amy_emp.deptno=amy_dept.deptno and amy_emp.ename in('KING','TURNER');

Elapsed: 00:00:00.01

Execution Plan

----------------------------------------------------------

Plan hash value: 3588874585

------------------------------------------------------------------------------------

------------------------------------------------------------------------------------

| 0 | SELECT STATEMENT | | 2 | 56 | 4 (0)| 00:00:01 |

|* 1 | HASH JOIN | | 2 | 56 | 4 (0)| 00:00:01 |

| 2 | INDEX FULL SCAN | REVERSE_INDEX | 4 | 12 | 1 (0)| 00:00:01 |

|* 3 | TABLE ACCESS FULL| AMY_EMP | 2 | 50 | 3 (0)| 00:00:01 |

------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

1 - access("AMY_EMP"."DEPTNO"="AMY_DEPT"."DEPTNO")

3 - filter("AMY_EMP"."ENAME"='KING' OR "AMY_EMP"."ENAME"='TURNER')

SQL>

驱动表为：amy_dept，我们使用了ordered进行了指定。full() index()

3、sort meger join

Hash Join Hints

Apply the USE_HASH hint to instruct the optimizer to use a hash join when joining two tables together. See "PGA Memory Management" to learn how to size SQL work areas. See "Hints for Join Operations" to learn about the USE_HASH hint.

11.3.5 Sort Merge Joins

Sort merge joins can join rows from two independent sources. Hash joins generally perform better than sort merge joins. However, sort merge joins can perform better than hash joins if both of the following conditions exist:

?The row sources are sorted already.

?A sort operation does not have to be done.

However, if a sort merge join involves choosing a slower access method (an index scan as opposed to a full table scan), then the benefit of using a sort merge might be lost.

Sort merge joins are useful when the join condition between two tables is an inequality condition such as <, <=, >, or >=. Sort merge joins perform better than nested loop joins for large data sets. You cannot use hash joins unless there is an equality condition.

In a merge join, there is no concept of a driving table. The join consists of two steps:

1.Sort join operation: Both the inputs are sorted on the join key.

2.Merge join operation: The sorted lists are merged together.

If the input is sorted by the join column, then a sort join operation is not performed for that row source. However, a sort merge join always creates a positionable sort buffer for the right side of the join so that it can seek back to the last match in the case where duplicate join key values come out of the left side of the join.

11.3.5.1 When the Optimizer Uses Sort Merge Joins

The optimizer can choose a sort merge join over a hash join for joining large amounts of data if any of the following conditions are true:

?The join condition between two tables is not an equijoin.

?Because of sorts required by other operations, the optimizer finds it is cheaper to use a sort merge than a hash join.

11.3.5.2 Sort Merge Join Hints

To instruct the optimizer to use a sort merge join, apply the USE_MERGE hint. You might also need to give hints to force an access path.

There are situations where it makes sense to override the optimizer with the USE_MERGE hint. For example, the optimizer can choose a full scan on a table and avoid a sort operation in a query. However, there is an increased cost because a large table is accessed through an index and single block reads, as opposed to faster access through a full table scan.

参考：http://docs.oracle.com/cd/E11882_01/server.112/e41573/optimops.htm#PFGRF94645

总结一下，在哪种情况下用哪种连接方法比较好：（一下摘自摘自：SunnyXu）

排序 - - 合并连接(Sort Merge Join, SMJ)：

a) 对于非等值连接，这种连接方式的效率是比较高的。

b) 如果在关联的列上都有索引，效果更好。

c) 对于将2个较大的row source做连接，该连接方法比NL连接要好一些。

d) 但是如果sort merge返回的row source过大，则又会导致使用过多的rowid在表中查询数据时，数据库性能下降，因为过多的I/O。

嵌套循环(Nested Loops, NL)：

a) 如果driving row source(外部表)比较小，并且在inner row source(内部表)上

有唯一索引，或有高选择性非唯一索引时，使用这种方法可以得到较好的效率。

b) NESTED LOOPS有其它连接方法没有的的一个优点是：可以先返回已经

连接的行，而不必等待所有的连接操作处理完才返回数据，

这可以实现快速的响应时间。

哈希连接(Hash Join, HJ)：

a) 这种方法是在oracle7后来引入的，使用了比较先进的连接理论，

一般来说，其效率应该好于其它2种连接，但是这种连接只能用在

CBO优化器中，而且需要设置合适的hash_area_size参数，

才能取得较好的性能。

b) 在2个较大的row source之间连接时会取得相对较好的效率，在一个

row source较小时则能取得更好的效率。

c) 只能用于等值连接中

阅读┊ 收藏 ┊ 喜欢 ▼ ┊打印┊举报/Report

前一篇：rman的备份与还原里面的参数分析

后一篇：oracle的物理读和逻辑读

新浪BLOG意见反馈留言板　欢迎批评指正