CUDA 中cudaDeviceSynchronize()同步,异步,timer
今天,无意中测试了cuda提供的simpleTexture例程,在我的前面一篇博文中,特别研究了一下texture
CUDA_LAUNCH_BLOCKING
环境变量设置为1.这种功能,只为调试目的而用,不应该拿它作为可靠运行发行版软件的手段。
memory,并对例程做了修改,具体代码见链接。
在主函数中,重要的部分我还是贴出来。当程序是这样的时候:
sdkCreateTimer(&timer);
sdkStartTimer(&timer);
// Execute the kernel
transformKernel_integer<<>>(dData, width,
height);
// Check if kernel execution generated an error
getLastCudaError("Kernel execution failed");
checkCudaErrors(cudaDeviceSynchronize());// 先同步
sdkStopTimer(&timer);//再测量时间
printf("Processing time: %f (ms)\n",
sdkGetTimerValue(&timer));
printf("%.2f Mpixels/sec\n",
(width *height / (sdkGetTimerValue(&timer) / 1000.0f)) /
1e6);
http://s14/mw690/002N61Gtzy6HEZL1aNved&690中cudaDeviceSynchronize()同步,异步,timer" TITLE="CUDA 中cudaDeviceSynchronize()同步,异步,timer" />
当程序是这样的时候:
sdkCreateTimer(&timer);
sdkStartTimer(&timer);
// Execute the kernel
transformKernel_integer<<>>(dData, width,
height);
// Check if kernel execution generated an error
getLastCudaError("Kernel execution failed");
sdkStopTimer(&timer);//直接在执行内核语句后面,停止测量
checkCudaErrors(cudaDeviceSynchronize());//然后再同步
printf("Processing time: %f (ms)\n",
sdkGetTimerValue(&timer));
printf("%.2f Mpixels/sec\n",
(width *height / (sdkGetTimerValue(&timer) / 1000.0f)) /
1e6);
http://s5/mw690/002N61Gtzy6HF044GXie4&690中cudaDeviceSynchronize()同步,异步,timer" TITLE="CUDA 中cudaDeviceSynchronize()同步,异步,timer" />
结果是:
1.两者得到的数据文件完全与参考文件匹配
2.后者的处理时间,是前者的1/1000,我和我的小伙伴都惊呆了。。不科学!!!
在前面的博文中,还在吐槽,texture的速度问题。这到底是怎么回事儿!!
1.cudaDeviceSynchronize()的问题???
2.测量方式不对的问题????
3.texture的问题????
琢磨了一下:
CUDA Toolkit Reference Manual中:
5.1.2.12
__cudart_builtin__ cudaError_t cudaDeviceSynchronize
(void)
Blocks until
the device has completed all preceding requested tasks.
cudaDeviceSynchronize() returns an error if one of the preceding
tasks has failed. If the cudaDeviceScheduleBlockingSync flag was
set for this device, the host thread will block until the device
has finished its work
cudaDeviceSynchronize
将会一直处于阻塞状态,直到前面所有请求的任务已经被全部执行完毕,如果前面执行的某个任务失败了的话,将会返回一个错误。如果对于这个device,cudaDeviceScheduleBlockingSync
标志位被置位了,那么host的线程将会一直处于阻塞,直到device完成了它的任务。
CUDA C Programming
Guide中:
3.2.5Asynchronous
Concurrent Execution同步并发执行
3.2.5.1 Concurrent
Execution between Host and Device
In order to facilitate concurrent execution between host and
device, some function calls
are asynchronous: Control is returned to the
host thread before the device has
completed
the requested task. These are:
为了便于host和device的并发执行,一些函数调用时异步的:在device完成请求的任务之前,一些控制会被传回到host线程。包含以下一些:
‣ Kernel launches;
‣ Memory copies between two addresses to the same device
memory;
‣ Memory copies from host to device of a memory block of 64 KB or
less;
‣ Memory copies performed by functions that are suffixed with
Async;
‣ Memory set function calls.
Programmers can globally disable asynchronous kernel launches
for all CUDA
applications running on a system by setting the
CUDA_LAUNCH_BLOCKING environment
variable to 1. This feature is provided for
debugging purposes only and should never
be
used as a way to make production software run
reliably.
编程人员可以全局性的取消掉异步启动内核,方法是将
Kernel launches are synchronous in the following cases:
内核的同步启动只在下面这些情况中出现
‣ The application is run via a debugger or memory checker
(cuda-gdb, cudamemcheck,
Nsight) on a device of compute capability
1.x;
‣ Hardware counters are collected via a profiler (Nsight, Visual
Profiler).
3.2.5.5.2 Default
Stream默认流
Kernel launches and host
<-> device memory copies that do not specify any
stream
parameter, or equivalently
that set the stream parameter to zero, are issued to the
default
stream. They are therefore
executed in order.
内核启动,host与device之间的内存拷贝,没有指定任何stream参数时,或者等效的设定为0时,都是被分发给默认的stream。因此,它们是顺序执行的。??指令
B.18 EXECUTION
CONFIGURATION中:
The
execution configuration is specified by inserting an expression of
the form <<<<b style="line-height: 1.5; font-size:
14px;">
Dg, Db, Ns, S >>>
between the function name and the parenthesized argument
list,
where:
、、、
S is of type cudaStream_t and specifies the associated stream;
S is an optional
argument which defaults to
0.
设定内核执行参数配置时,第四个参数S,我们一般没有管。但是S是默认为0
的。
3.2.5.5.3 Explicit
Synchronization显示同步
cudaDeviceSynchronize() waits until all
preceding commands in all streams of all
host
threads have
completed.等待所有的streams中的所有指令完成。
cudaStreamSynchronize()takes a stream as a parameter and waits
until all
preceding commands in the given stream have completed. It can
be used to synchronize
the host
with a specific stream, allowing other streams to continue
executing on the
device.等待指定的stream中所有的指令完成。
cudaStreamWaitEvent()takes a stream and an event as parameters
(see Events for
a
description of events)and makes all the commands added to the given
stream after
the call
to cudaStreamWaitEvent()delay their execution until the given event
has
completed. The stream can be 0, in which case all the
commands added to any stream
after
the call to cudaStreamWaitEvent()wait on the
event.
cudaStreamQuery()provides applications with a way to know if
all preceding
commands in a stream have completed.
To avoid unnecessary slowdowns, all
these synchronization functions are usually
best
used for
timing purposes or to isolate a launch or memory copy that is
failing.
为了避免不必要的速度变慢,
这些同步功能
通常最好用于计时或隔离发射或存储副本失败所有。
3.2.5.5.4 Implicit Synchronization隐式同步
Two commands from different streams
cannot run concurrently if any one of the
following operations is issued in-between them by the host
thread:
不同的流中的两个命令,不能同步运行,如果下面来自host线程的 中任何一个操作 分发给了他们:
、、、
、、、
‣ Blocks all later kernel launches from any stream in the CUDA
context until the kernel
launch being checked is complete.
阻塞来自CUDA上下文的stream中,所有后来的内核启动,直到内核启动完成了检查。。。
Operations that require a dependency check include any other
commands within the
same stream as the launch being checked and any call to
cudaStreamQuery() on that
stream. Therefore, applications should follow these guidelines
to improve their potential
for concurrent kernel
execution:
一些要求有有依赖性检查的操作,如:同一stream中
cudaStreamQuery()调用。因此应用程序应该遵循下面这些指导意见,来提升潜在的并发线程执行效率。
随着启动执行检查的
任何其他命令;和对那个stream的
‣ All independent operations should be issued before dependent
operations,
所有的独立操作,应该在非独立操作之前分发。
‣ Synchronization of any kind should be delayed as long as
possible.
任何类型的同步操作,应该尽可能延迟较长时间。
进过文档的洗礼,再来搜索一下别人是怎么看待这个问题:
提问的哥们儿,跟我的问题是一样的,他还只有7~12倍的速度差别。。我都有1000倍了。。。
票数最多的答案
aland给出的
,很好的给我们解释了这个巨大的时间差别问题:
Although CUDA kernel launches are asynchronous, all GPU-related
tasks placed in one stream (which is default behaviour) are
executed sequentially.
So, for example,
kernel1<<<</span>X,Y>>>(...); // kernel start execution, CPU continues to next statement
kernel2<<<</span>X,Y>>>(...); // kernel is placed in queue and will start after kernel1 finishes, CPU continues to next statement
cudaMemcpy(...); // CPU blocks until ememory is copied, memory copy starts only after kernel2 finishes
So in your example there is no need
for cudaDeviceSynchronize.
However, it might be useful for debugging to detect which of your
kernel has caused an error (if there is any).
cudaDeviceSynchronize
这样基本解释了这个时间差别,我们还可以接着验证一下这哥们以及文档中关于同步(显式核隐式)、异步的看法。
将代码再修改一下下,如下:
sdkCreateTimer(&timer);
sdkStartTimer(&timer);
// Execute the kernel
transformKernel_integer<<>>(dData, width,
height);
// Check if kernel execution generated an error
getLastCudaError("Kernel execution failed");
//在同步之前测量一次
sdkStopTimer(&timer);
printf("Processing time: %f (ms)\n",
sdkGetTimerValue(&timer));
sdkResetTimer(&timer);//复位timer
sdkStartTimer(&timer);//再一次启动
checkCudaErrors(cudaDeviceSynchronize());
//在同步之后再同步一次
sdkStopTimer(&timer);
printf("After sync Processing time: %f (ms)\n",
sdkGetTimerValue(&timer));
printf("%.2f Mpixels/sec\n",
(width *height / (sdkGetTimerValue(&timer) / 1000.0f)) /
1e6);
sdkDeleteTimer(&timer);
得到新的验证结果。
http://s8/mw690/002N61Gtzy6HF6buIWXb7&690中cudaDeviceSynchronize()同步,异步,timer" TITLE="CUDA 中cudaDeviceSynchronize()同步,异步,timer" />
只能说,这样的结果,又把我打回人间了。。。o(╯□╰)o。。。。以为它有多快呢。。。。
bwb@HUST 2014.03.28

加载中…