TY - JOUR
AU1 - Ye, Junjie
AU2 - Guo, Junjun
AB - Multi-modal neural machine translation (MNMT), which mainly focuses on the use of image information to guide text translation. Recent MNMT approaches have been shown that incorporating visual features into textual translation framework is helpful to improve machine translation. However, visual features always contain textual unrelated information, but the noisy visual feature fusion problem is rarely considered for traditional MNMT methods. How to extract the useful visual features to enhance textual machine translation is the key point need to be considered for MNMT. In this paper, we propose a novel Dual-level Interactive Multimodal-Mixup Encoder (DLMulMix) based on multimodal-mixup for MNMT, which can extract the useful visual features to enhance textual-level machine translation. We first employ the Textual-visual Gating to extract text related visual features, which we believe that regional features are crucial for MNMT. Then visual grid features are employed in order to establish the image context of the effective regional features. Moreover, an effective visual-textual multimodal-mixup is adopted to align textual features and visual features into multi-modal common space to improve textual-level machine translation. We evaluate our proposed method on the Multi30K dataset. The experimental results show that the proposed approach outperforms the previous efforts for both EN-DE and EN-FR tasks regarding BLEU and METEOR scores.
TI - Dual-level interactive multimodal-mixup encoder for multi-modal neural machine translation
JF - Applied Intelligence
DO - 10.1007/s10489-022-03331-8
DA - 2022-09-01
UR - https://www.deepdyve.com/lp/springer-journals/dual-level-interactive-multimodal-mixup-encoder-for-multi-modal-neural-g0Jp5dh0R3
SP - 14194
EP - 14203
VL - 52
IS - 12
DP - DeepDyve
ER -