{"ID":2850717,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.21553","arxiv_id":"2510.21553","title":"Document Understanding, Measurement, and Manipulation Using Category Theory","abstract":"We apply category theory to extract multimodal document structure which leads us to develop information theoretic measures, content summarization and extension, and self-supervised improvement of large pretrained models. We first develop a mathematical representation of a document as a category of question-answer pairs. Second, we develop an orthogonalization procedure to divide the information contained in one or more documents into non-overlapping pieces. The structures extracted in the first and second steps lead us to develop methods to measure and enumerate the information contained in a document. We also build on those steps to develop new summarization techniques, as well as to develop a solution to a new problem viz. exegesis resulting in an extension of the original document. Our question-answer pair methodology enables a novel rate distortion analysis of summarization techniques. We implement our techniques using large pretrained models, and we propose a multimodal extension of our overall mathematical framework. Finally, we develop a novel self-supervised method using RLVR to improve large pretrained models using consistency constraints such as composability and closure under certain operations that stem naturally from our category theoretic framework.","short_abstract":"We apply category theory to extract multimodal document structure which leads us to develop information theoretic measures, content summarization and extension, and self-supervised improvement of large pretrained models. We first develop a mathematical representation of a document as a category of question-answer pairs...","url_abs":"https://arxiv.org/abs/2510.21553","url_pdf":"https://arxiv.org/pdf/2510.21553v1","authors":"[\"Jared Claypoole\",\"Yunye Gong\",\"Noson S. Yanofsky\",\"Ajay Divakaran\"]","published":"2025-10-24T15:12:08Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.LG\"]","methods":"[]","has_code":false}
