TY - JOUR AU - Li, Toby Jia-Jun AB - Abstract: Audio-visual learning seeks to enhance the computer's multi-modal perception leveraging the correlation between the auditory and visual modalities. Despite their many useful downstream tasks, such as video retrieval, AR/VR, and accessibility, the performance and adoption of existing audio-visual models have been impeded by the availability of high-quality datasets. Annotating audio-visual datasets is laborious, expensive, and time-consuming. To address this challenge, we designed and developed an efficient audio-visual annotation tool called Peanut. Peanut's human-AI collaborative pipeline separates the multi-modal task into two single-modal tasks, and utilizes state-of-the-art object detection and sound-tagging models to reduce the annotators' effort to process each frame and the number of manually-annotated frames needed. A within-subject user study with 20 participants found that Peanut can significantly accelerate the audio-visual data annotation process while maintaining high annotation accuracy. TI - PEANUT: A Human-AI Collaborative Tool for Annotating Audio-Visual Data JF - Computing Research Repository DO - 10.1145/3586183.3606776 10.1145/3586183.3606776 10.1145/3586183.360677610.1145/3586183.3606776 10.1145/3586183.3606776 DA - 2023-07-27 UR - https://www.deepdyve.com/lp/arxiv-cornell-university/peanut-a-human-ai-collaborative-tool-for-annotating-audio-visual-data-znzFE5TC48 VL - 2023 IS - 2307 DP - DeepDyve ER -