{"ID":539908,"CreatedAt":"2026-03-04T20:59:09Z","UpdatedAt":"2026-03-04T20:59:09Z","DeletedAt":null,"paper_url":"https://paperswithcode.com/paper/anymal-an-efficient-and-scalable-any-modality","arxiv_id":"2309.16058","title":"AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model","abstract":"We present Any-Modality Augmented Language Model (AnyMAL), a unified model that reasons over diverse input modality signals (i.e. text, image, video, audio, IMU motion sensor), and generates textual responses. AnyMAL inherits the powerful text-based reasoning abilities of the state-of-the-art LLMs including LLaMA-2 (70B), and converts modality-specific signals to the joint textual space through a pre-trained aligner module. To further strengthen the multimodal LLM's capabilities, we fine-tune the model with a multimodal instruction set manually collected to cover diverse topics and tasks beyond simple QAs. We conduct comprehensive empirical analysis comprising both human and automatic evaluations, and demonstrate state-of-the-art performance on various multimodal tasks.","short_abstract":"We present Any-Modality Augmented Language Model (AnyMAL), a unified model that reasons over diverse input modality signals (i. e. text, image, video, audio, IMU motion sensor), and generates textual responses.","url_abs":"https://arxiv.org/abs/2309.16058v1","url_pdf":"https://arxiv.org/pdf/2309.16058v1.pdf","authors":"[\"Seungwhan Moon\", \"Andrea Madotto\", \"Zhaojiang Lin\", \"Tushar Nagarajan\", \"Matt Smith\", \"Shashank Jain\", \"Chun-Fu Yeh\", \"Prakash Murugesan\", \"Peyman Heidari\", \"Yue Liu\", \"Kavya Srinet\", \"Babak Damavandi\", \"Anuj Kumar\"]","published":"2023-09-27T00:00:00Z","tasks":"[\"Language Modeling\", \"Language Modelling\", \"Video Question Answering\"]","methods":"[]","has_code":false,"code_links":[{"ID":309138,"CreatedAt":"2026-03-04T21:00:12Z","UpdatedAt":"2026-03-04T21:00:12Z","DeletedAt":null,"paper_id":539908,"paper_url":"https://paperswithcode.com/paper/anymal-an-efficient-and-scalable-any-modality","paper_title":"AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model","repo_url":"https://github.com/nokia-bell-labs/papagei-foundation-model","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"framework":"pytorch","github_stars":0}]}