Since I haven't seen any of the comments mention this yet...
I think the big reason is storage/bandwidth.
Digital audio is an interesting form of media because the size of an audio file is determined almost entirely by 1) how long it is and 2) the bitrate/quality and has a lot less to do with what the actual content is. Therefore, an audio track of a video that contains dialogue and music is pretty much the same size as one that only contains music. So, if you were to, for example, separate dialogue and music of a video into two tracks to allow a user adjust the volume of either independently of the other (an amazing user experience IMO) the storage size (and bandwidth usage) of the audio virtually doubles despite no "additional" content being added.
Multiple audio track is actually something I've wanted for forever, especially for watching stream on Twitch. But I think it's a pretty hefty burden to place on the service, especially if a lot of people aren't even going to use or notice it.
I can usually do these no problem but this one was a real struggle (on Mobile). If you're having issues: