I have a fairly complex OpenCL implementation with 2D NDRange as follows: Num of Work Groups - {10,7} Work Group Size {64,1}, With this I get a performance of 0.625 Secs, But when i decrease the number of work groups to {10,4} the performance
↧