加载中…
个人资料
  • 博客等级:
  • 博客积分:
  • 博客访问:
  • 关注人气:
  • 获赠金笔:0支
  • 赠出金笔:0支
  • 荣誉徽章:
正文 字体大小:

CUDA 中cudaDeviceSynchronize()同步,异步,timer

(2014-03-28 14:31:24)
今天,无意中测试了cuda提供的simpleTexture例程,在我的前面一篇博文中,特别研究了一下texture 

memory,并对例程做了修改,具体代码见链接。

在主函数中,重要的部分我还是贴出来。当程序是这样的时候:
sdkCreateTimer(&timer);
sdkStartTimer(&timer);

// Execute the kernel
transformKernel_integer<<>>(dData, width, height);

// Check if kernel execution generated an error
getLastCudaError("Kernel execution failed");
checkCudaErrors(cudaDeviceSynchronize());// 先同步
sdkStopTimer(&timer);//再测量时间
printf("Processing time: %f (ms)\n", sdkGetTimerValue(&timer));
printf("%.2f Mpixels/sec\n",
(width *height / (sdkGetTimerValue(&timer) / 1000.0f)) / 1e6);
http://s14/mw690/002N61Gtzy6HEZL1aNved&690中cudaDeviceSynchronize()同步,异步,timer" TITLE="CUDA 中cudaDeviceSynchronize()同步,异步,timer" />
当程序是这样的时候:
sdkCreateTimer(&timer);
sdkStartTimer(&timer);

// Execute the kernel
transformKernel_integer<<>>(dData, width, height);

// Check if kernel execution generated an error
getLastCudaError("Kernel execution failed");
sdkStopTimer(&timer);//直接在执行内核语句后面,停止测量
checkCudaErrors(cudaDeviceSynchronize());//然后再同步
printf("Processing time: %f (ms)\n", sdkGetTimerValue(&timer));
printf("%.2f Mpixels/sec\n",
(width *height / (sdkGetTimerValue(&timer) / 1000.0f)) / 1e6);
http://s5/mw690/002N61Gtzy6HF044GXie4&690中cudaDeviceSynchronize()同步,异步,timer" TITLE="CUDA 中cudaDeviceSynchronize()同步,异步,timer" />



结果是:
1.两者得到的数据文件完全与参考文件匹配
2.后者的处理时间,是前者的1/1000,我和我的小伙伴都惊呆了。。不科学!!!

在前面的博文中,还在吐槽,texture的速度问题。这到底是怎么回事儿!!

1.cudaDeviceSynchronize()的问题???

2.测量方式不对的问题????
3.texture的问题????


琢磨了一下:
CUDA Toolkit Reference Manual中:
5.1.2.12 __cudart_builtin__ cudaError_t cudaDeviceSynchronize (void)

Blocks until the device has completed all preceding requested tasks. cudaDeviceSynchronize() returns an error if one of the preceding tasks has failed. If the cudaDeviceScheduleBlockingSync flag was set for this device, the host thread will block until the device has finished its work

cudaDeviceSynchronize 将会一直处于阻塞状态,直到前面所有请求的任务已经被全部执行完毕,如果前面执行的某个任务失败了的话,将会返回一个错误。如果对于这个device,cudaDeviceScheduleBlockingSync 标志位被置位了,那么host的线程将会一直处于阻塞,直到device完成了它的任务。
CUDA C Programming Guide中:
3.2.5Asynchronous Concurrent Execution同步并发执行

3.2.5.1 Concurrent Execution between Host and Device
In order to facilitate concurrent execution between host and device, some function calls 
are asynchronous: Control is returned to the host thread before the device has completed 
the requested task. These are:
为了便于host和device的并发执行,一些函数调用时异步的:在device完成请求的任务之前,一些控制会被传回到host线程。包含以下一些:
‣ Kernel launches;
‣ Memory copies between two addresses to the same device memory;
‣ Memory copies from host to device of a memory block of 64 KB or less;
‣ Memory copies performed by functions that are suffixed with Async;
‣ Memory set function calls.

Programmers can globally disable asynchronous kernel launches for all CUDA
applications running on a system by setting the CUDA_LAUNCH_BLOCKING environment 
variable to 1. This feature is provided for debugging purposes only and should never be 
used as a way to make production software run reliably.

编程人员可以全局性的取消掉异步启动内核,方法是将
CUDA_LAUNCH_BLOCKING 环境变量设置为1.这种功能,只为调试目的而用,不应该拿它作为可靠运行发行版软件的手段。

Kernel launches are synchronous in the following cases:
内核的同步启动只在下面这些情况中出现
‣ The application is run via a debugger or memory checker (cuda-gdb, cudamemcheck,
Nsight) on a device of compute capability 1.x;
‣ Hardware counters are collected via a profiler (Nsight, Visual Profiler).


3.2.5.5.2 Default Stream默认流
Kernel launches and host <-> device memory copies that do not specify any stream 
parameter, or equivalently that set the stream parameter to zero, are issued to the default 
stream. They are therefore executed in order.
内核启动,host与device之间的内存拷贝,没有指定任何stream参数时,或者等效的设定为0时,都是被分发给默认的stream。因此,它们是顺序执行的。??指令
B.18 EXECUTION CONFIGURATION中:
The execution configuration is specified by inserting an expression of the form <<<<b style="line-height: 1.5; font-size: 14px;">
Dg, Db, Ns, S >>> between the function name and the parenthesized argument list,
where:
、、、
S is of type cudaStream_t and specifies the associated stream; S is an optional 
argument which defaults to 0.
设定内核执行参数配置时,第四个参数S,我们一般没有管。但是S是默认为0 的。

3.2.5.5.3 Explicit Synchronization显示同步

cudaDeviceSynchronize() waits until all preceding commands in all streams of all 
host threads have completed.等待所有的streams中的所有指令完成。

cudaStreamSynchronize()takes a stream as a parameter and waits until all
preceding commands in the given stream have completed. It can be used to synchronize 
the host with a specific stream, allowing other streams to continue executing on the 
device.等待指定的stream中所有的指令完成。

cudaStreamWaitEvent()takes a stream and an event as parameters (see Events for 
a description of events)and makes all the commands added to the given stream after 
the call to cudaStreamWaitEvent()delay their execution until the given event has 
completed. The stream can be 0, in which case all the commands added to any stream 
after the call to cudaStreamWaitEvent()wait on the event. 

cudaStreamQuery()provides applications with a way to know if all preceding
commands in a stream have completed.

To avoid unnecessary slowdowns, all these synchronization functions are usually best 
used for timing purposes or to isolate a launch or memory copy that is failing.
为了避免不必要的速度变慢,
这些同步功能
通常最好用于计时隔离发射存储副本失败所有。

3.2.5.5.4 Implicit Synchronization隐式同步
Two commands from different streams cannot run concurrently if any one of the
following operations is issued in-between them by the host thread:
不同的流中的两个命令,不能同步运行,如果下面来自host线程的 中任何一个操作 分发给了他们:
、、、
、、、
‣ Blocks all later kernel launches from any stream in the CUDA context until the kernel
launch being checked is complete.

阻塞来自CUDA上下文的stream中,所有后来的内核启动,直到内核启动完成了检查。。。

Operations that require a dependency check include any other commands within the
same stream as the launch being checked and any call to cudaStreamQuery() on that
stream. Therefore, applications should follow these guidelines to improve their potential 
for concurrent kernel execution:
一些要求有有依赖性检查的操作,如:同一stream中
随着启动执行检查的 
任何其他命令;和对那个stream的
 cudaStreamQuery()调用。因此应用程序应该遵循下面这些指导意见,来提升潜在的并发线程执行效率。
‣ All independent operations should be issued before dependent operations,
所有的独立操作,应该在非独立操作之前分发。
‣ Synchronization of any kind should be delayed as long as possible.
任何类型的同步操作,应该尽可能延迟较长时间。

进过文档的洗礼,再来搜索一下别人是怎么看待这个问题:

提问的哥们儿,跟我的问题是一样的,他还只有7~12倍的速度差别。。我都有1000倍了。。。

票数最多的答案
aland给出的
,很好的给我们解释了这个巨大的时间差别问题:
Although CUDA kernel launches are asynchronous, all GPU-related tasks placed in one stream (which is default behaviour) are executed sequentially.
So, for example,

kernel1<<<</span>X,Y>>>(...); // kernel start execution, CPU continues to next statement

kernel2<<<</span>X,Y>>>(...); // kernel is placed in queue and will start after kernel1 finishes, CPU continues to next statement

cudaMemcpy(...); // CPU blocks until ememory is copied, memory copy starts only after kernel2 finishes

So in your example there is no need for cudaDeviceSynchronize. However, it might be useful for debugging to detect which of your kernel has caused an error (if there is any).

cudaDeviceSynchronize may cause some slowdown, but 7-12x seems too much. Might be there is some problem with time measurement, or may be the kernels are really fast, and the overhead of explicit synchronization is huge relative to actual computation time.


这样基本解释了这个时间差别,我们还可以接着验证一下这哥们以及文档中关于同步(显式核隐式)、异步的看法。

将代码再修改一下下,如下:

sdkCreateTimer(&timer);
sdkStartTimer(&timer);

// Execute the kernel
transformKernel_integer<<>>(dData, width, height);

// Check if kernel execution generated an error
getLastCudaError("Kernel execution failed");
//在同步之前测量一次
sdkStopTimer(&timer);
printf("Processing time: %f (ms)\n", sdkGetTimerValue(&timer));

sdkResetTimer(&timer);//复位timer
sdkStartTimer(&timer);//再一次启动
checkCudaErrors(cudaDeviceSynchronize());
//在同步之后再同步一次
sdkStopTimer(&timer);
printf("After sync Processing time: %f (ms)\n", sdkGetTimerValue(&timer));


printf("%.2f Mpixels/sec\n",
(width *height / (sdkGetTimerValue(&timer) / 1000.0f)) / 1e6);
sdkDeleteTimer(&timer);
得到新的验证结果。

http://s8/mw690/002N61Gtzy6HF6buIWXb7&690中cudaDeviceSynchronize()同步,异步,timer" TITLE="CUDA 中cudaDeviceSynchronize()同步,异步,timer" />


只能说,这样的结果,又把我打回人间了。。。o(╯□╰)o。。。。以为它有多快呢。。。。


bwb@HUST 2014.03.28

0

阅读 收藏 喜欢 打印举报/Report
  

新浪BLOG意见反馈留言板 欢迎批评指正

新浪简介 | About Sina | 广告服务 | 联系我们 | 招聘信息 | 网站律师 | SINA English | 产品答疑

新浪公司 版权所有