典型文献
Instance-sequence reasoning for video question answering
文献摘要:
Video question answering(Video QA)involves a thorough understanding of video content and question language,as well as the grounding of the textual semantic to the visual content of videos.Thus,to answer the questions more accurately,not only the semantic entity should be associated with certain visual instance in video frames,but also the action or event in the question should be localized to a corresponding temporal slot.It turns out to be a more challen-ging task that requires the ability of conducting reasoning with correlations between instances along temporal frames.In this paper,we propose an instance-sequence reasoning network for video question answering with instance grounding and temporal localization.In our model,both visual instances and textual representations are firstly embedded into graph nodes,which benefits the integration of intra-and inter-modality.Then,we propose graph causal convolution(GCC)on graph-structured sequence with a large receptive field to capture more causal connections,which is vital for visual grounding and instance-sequence reasoning.Finally,we evaluate our model on TVQA+dataset,which contains the groundtruth of instance grounding and temporal localization,three other Video QA datasets and three multimodal language processing datasets.Extensive experiments demonstrate the effectiveness and generalization of the proposed method.Specifically,our method outperforms the state-of-the-art methods on these benchmarks.
文献关键词:
中图分类号:
作者姓名:
Rui LIU;Yahong HAN
作者机构:
College of Intelligence and Computing,Tianjin University,Tianjin 300350,China;Tianjin Key Lab of Machine Learning,Tianjin University,Tianjin 300350,China
文献出处:
引用格式:
[1]Rui LIU;Yahong HAN-.Instance-sequence reasoning for video question answering)[J].计算机科学前沿,2022(06):89-97
A类:
TVQA+dataset,groundtruth
B类:
Instance,sequence,reasoning,answering,Video,involves,thorough,understanding,content,language,well,grounding,textual,semantic,visual,videos,Thus,questions,more,accurately,not,only,entity,should,associated,certain,frames,but,also,action,event,localized,corresponding,temporal,slot,It,turns,challen,ging,task,that,requires,ability,conducting,correlations,between,instances,along,this,paper,network,localization,our,model,both,representations,are,firstly,embedded,into,graph,nodes,which,benefits,integration,intra,inter,modality,Then,causal,convolution,GCC,structured,large,receptive,field,capture,connections,vital,Finally,evaluate,contains,three,other,datasets,multimodal,processing,Extensive,experiments,demonstrate,effectiveness,generalization,proposed,Specifically,outperforms,state,art,methods,these,benchmarks
AB值:
0.504363
相似文献
机标中图分类号,由域田数据科技根据网络公开资料自动分析生成,仅供学习研究参考。