Visual-Linguistic Semantic Alignment: Fusing Human Gaze And Spoken Narratives For Image Region Annotation