Because it's jarring to watch cutscenes of realistically proportioned 3D models be animated as if they are having a conversation, but no sound is coming out of their mouth. If the devs wanted to avoid this dissonance then they should have stuck with the more abstracted 2D aesthetic.
Don't forget the old man singing on the village bridge!
There are PS2 games that have full voice acting crammed into 5 GB discs. Switch cartridges can easily hold much more than that.