记一次oracle等待事件enq:ko-fast object checkpoint
(2016-10-18 14:19:40)
标签:
'enqko-fastobjectch |
You see this event because object level checkpoint which
happens when you run direct path reads on a segment (like with
serial direct read full table scan or parallel execution full
segment scans).
Now the question is how much of your sessions response time is spent waiting for that event - and whether the winning in query speed due direct read/parallel execution outweighs the time spent waiting for this checkpoint.
Waiting for KO enqueue - fast object checkpoint means that your session has sent CKPT process a message with instruction to do object level checkpoint and is waiting for a reply.
CKPT in turn asks DBWR process to perform. the checkpoint and may wait for response.
So you should check what the DBWR and CKPT processes are doing - whether they're stuck waiting for something (such controlfile enqueue) or doing IO or completely idle.
This you can do using v$session_wait or v$session_event...
You can run this:
select sid,program from v$session where program like '%CKPT%' or program like 'DBW%';
and then run
select sid,event,state,p1,p2,p3,seconds_in_wait from v$session_wait where sid in (select sid from v$session where program like '%CKPT%' or program like 'DBW%');
Please post the output here - if the CKPT or DBWR processes are hung waiting for something, these queries would show it.
If they're not completely hung it may be that they're just very slow (due bad disk write performance for example) or a rare chance that the CKPT has somehow missed the message your session sent (or vice versa). But that would be rare...
Now the question is how much of your sessions response time is spent waiting for that event - and whether the winning in query speed due direct read/parallel execution outweighs the time spent waiting for this checkpoint.
Waiting for KO enqueue - fast object checkpoint means that your session has sent CKPT process a message with instruction to do object level checkpoint and is waiting for a reply.
CKPT in turn asks DBWR process to perform. the checkpoint and may wait for response.
So you should check what the DBWR and CKPT processes are doing - whether they're stuck waiting for something (such controlfile enqueue) or doing IO or completely idle.
This you can do using v$session_wait or v$session_event...
You can run this:
select sid,program from v$session where program like '%CKPT%' or program like 'DBW%';
and then run
select sid,event,state,p1,p2,p3,seconds_in_wait from v$session_wait where sid in (select sid from v$session where program like '%CKPT%' or program like 'DBW%');
Please post the output here - if the CKPT or DBWR processes are hung waiting for something, these queries would show it.
If they're not completely hung it may be that they're just very slow (due bad disk write performance for example) or a rare chance that the CKPT has somehow missed the message your session sent (or vice versa). But that would be rare...
半夜接到数据库告警CPU使用率达到50%,正常空闲时间为1-8%,交易查询缓慢,查询数据等待事件存在大量的enq:ko-fast
object checkpoint等待事件,伴随着有DPR(Direct Path
Read),查询到所有等待事件对应的都是交易进库的一条查询语句,五分钟后CPU负载下降
解释下DPR:
ORACLE使用Direct Path Read(以下简称DPR)的原因主要是防止物理读对buffer cache的泛
洪,所以一般情况下很少使用这种方式来读取数据,常见的原因有表大于buffer cache的5%,
并行,对temporary段的读等.
ORACLE在做DPR之前会对将要执行的对像做一次对像级别的checkpoint,因为direct path read
读取的数据是直接从物理磁盘中读取的,所以需要保证物理磁盘中数据是最新的.比如磁盘中数
据块的版本是T1,而数据块在buffer cache中经过修改后的版本是T2,在T3的时候会话A发布一个DPR,
这个时候把所有在buffer cache中该对像小于T3的块都要写到磁盘上,使会话A读取到的数据是一
个最新的版本.
ORACLE使用Direct Path Read(以下简称DPR)的原因主要是防止物理读对buffer cache的泛
洪,所以一般情况下很少使用这种方式来读取数据,常见的原因有表大于buffer cache的5%,
并行,对temporary段的读等.
读取的数据是直接从物理磁盘中读取的,所以需要保证物理磁盘中数据是最新的.比如磁盘中数
据块的版本是T1,而数据块在buffer cache中经过修改后的版本是T2,在T3的时候会话A发布一个DPR,
这个时候把所有在buffer cache中该对像小于T3的块都要写到磁盘上,使会话A读取到的数据是一
个最新的版本.
===========问题定位开始
获取该时间段的会话信息
create table test09 as
select *
from
dba_hist_active_sess_history
where sample_time between
to_timestamp('2016-10-12 18:00:00', 'yyyy-mm-dd hh24:mi:ss')
and
to_timestamp('2016-10-12 18:20:00', 'yyyy-mm-dd hh24:mi:ss');
select *
会话信息查询汇总
select dbid,
instance_number,
sample_id,
sample_time,
count(*) session_count
from test09 t
group by dbid, instance_number, sample_id,
sample_time
order by dbid, instance_number,
sample_time;
查询堵塞会话id
select dbid,
instance_number,
sample_id,
sample_time,
blocking_inst_id,
blocking_session,
count(*) session_count
from test09 t
where event = 'enq: KO - fast object
checkpoint'
group by dbid,
instance_number,
sample_id,
sample_time,
blocking_inst_id,
blocking_session
order by dbid, instance_number,
sample_time;
堵塞会话信息
select t.dbid,
t.sample_id,
t.sample_time,
t.instance_number,
t.session_id,
t.sql_id,
t.session_type,
t.program,
t.event,
t.seq#,
t.session_state,
t.blocking_inst_id,
t.blocking_session,
t.blocking_session_status
from test09 t
where instance_number = 1
and session_id = 5265
order by dbid, instance_number,
sample_time;
select t.dbid,
写进程对应信息
select t.dbid,
t.sample_id,
t.sample_time,
t.instance_number,
t.session_id,
t.session_type,
t.program,
t.event,
t.seq#,
t.time_waited,
t.session_state
from test09 t
where program like 'DBW%'
order by session_id, dbid, instance_number,
sample_time;
select t.dbid,
=====查询到这里定位到问题出在写入过慢,导致入库语句进行checkpoint时出现等待
--判断写入速度
select t.*, ((t.waits * 8192 / 1024) / t.time) * 1000000
from (select t.dbid,
t.instance_number,
t.sample_time,
sum(t.p1) waits,
sum(t.time_waited) time
from test09 t
where event = 'db file async I/O submit'
and time_waited > 0
group by dbid, instance_number, sample_time) t
order by dbid, instance_number, sample_time;
查出结果只有300-400KB/s(这个速度是相当慢的)
判断写入量
select t.*, (t.waits * 8192 / 1024) block
from (select t.dbid,
t.instance_number,
t.sample_time,
sum(t.p1) waits,
sum(t.time_waited) time
from test09 t
where event = 'db file async I/O submit'
group by dbid, instance_number, sample_time) t
order by dbid, instance_number, sample_time;
select t.*, (t.waits * 8192 / 1024) block
from (select t.dbid,
t.instance_number,
t.sample_time,
sum(t.p1) waits,
sum(t.time_waited) time
from test09 t
where event = 'db file async I/O submit'
group by dbid, instance_number, sample_time) t
order by dbid, instance_number, sample_time;
拿到这部分数据和平时时间段对比发现修改量有突变,然后跟据这个思路查找出修改该等待事件对应的对象sql语句,确认到是做了批量修改导致该问题。