Fewer complex classes vs more simple classes in object detection

I am looking for guidance/literature on whether its to have more simpler classes or fewer more complex classes.

Lets say I want to recognize apples in a tree with a high accuracy. I could build a data set that has a single class - apple - and the training data has samples of red apples, green apples, yellow apples, rotten apples all assigned to the same apple class. Or I could have 4 class for each type of apple.

Fundamentally I’m interested in understanding how to best make the choice between diverse samples in a single class or if its having more specific samples with less diversity. There is bound to be research / guidance on this topic, but I have not been able to locate it yet.