I didn't say that.
I said, and you can look this up easily by simply looking at the text you quoted, that all the work original scheduling is done serially and executed parallel, this very statement is in there.
Command given to Drivers -> Drivers Request Data for CUDA -> Driver sends data to CUDA -> Driver tells GPU what to do -> GPU does the rest. (nVidia)
The only difference is that AMD doesn't need the driver middle-man if low level access is exposed, Vulkan does this more than DX12.
True but it doesn't have to be, you can optimize software to ignore such commands if the hardware is capable of it.
This is the normal optimization method to GPU manufacturers .. if you can expose it and use it why limit it?
That said it's quite a bit more difficult to get it working than the official way and that's where nVidia's strength in graphics cards lie.
Again nVidia right now cannot do it any other way because their architecture literally CANNOT do it any other way... doesn't mean it sucks but it's 1 limitation with a list of Pros and Cons.
I never suggested nVidia cannot parallel execute, just parallel process due to architecture design, read in the last paragraph of what you quoted, it's in there specifically.
You're on the same page as me but you're either skimming the posts or misreading them (after you figured out it is indeed serial processing ).