Can anyone here provide me a concrete answer as to why rectangular bounding boxes were chosen for the object detection task considering that they do not actually enclose and localize the objects precisely?
I think it might because rectangular is more easy to do in practice.
If you need to bound other shapes of boxes in picture, you need more complex work.
Annotations are expensive, labeling perfect contours of objects is superior in recognition tasks but they require several orders more labor to do so.
In case object detection there are two tasks involved in this whether the image contains the object or not (classification) and if it contains the class where (localization) we use bounding because we just want to locate the object position in the image. Bounding box is simple and easier for calculations in as it contains only 4 parameters x, y, height and depth. What you are talking about i suppose is image segmentation job which will contain the full object. I m posting a link below which might help you clear your understanding.
Also bounding box annotations are cheap, easier and less time consuming than semantic