{"ID":2844962,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.05295","arxiv_id":"2511.05295","title":"Language Generation and Identification From Partial Enumeration: Tight Density Bounds and Topological Characterizations","abstract":"The success of large language models (LLMs) has motivated formal theories of language generation and learning. We study the framework of \\emph{language generation in the limit}, where an adversary enumerates strings from an unknown language $K$ drawn from a countable class, and an algorithm must generate unseen strings from $K$. Prior work showed that generation is always possible, and that some algorithms achieve positive lower density, revealing a \\emph{validity--breadth} trade-off between correctness and coverage. We resolve a main open question in this line, proving a tight bound of $1/2$ on the best achievable lower density. We then strengthen the model to allow \\emph{partial enumeration}, where the adversary reveals only an infinite subset $C \\subseteq K$. We show that generation in the limit remains achievable, and if $C$ has lower density $α$ in $K$, the algorithm's output achieves density at least $α/2$, matching the upper bound. This generalizes the $1/2$ bound to the partial-information setting, where the generator must recover within a factor $1/2$ of the revealed subset's density. We further revisit the classical Gold--Angluin model of \\emph{language identification} under partial enumeration. We characterize when identification in the limit is possible -- when hypotheses $M_t$ eventually satisfy $C \\subseteq M \\subseteq K$ -- and in the process give a new topological formulation of Angluin's characterization, showing that her condition is precisely equivalent to an appropriate topological space having the $T_D$ separation property.","short_abstract":"The success of large language models (LLMs) has motivated formal theories of language generation and learning. We study the framework of \\emph{language generation in the limit}, where an adversary enumerates strings from an unknown language $K$ drawn from a countable class, and an algorithm must generate unseen strings...","url_abs":"https://arxiv.org/abs/2511.05295","url_pdf":"https://arxiv.org/pdf/2511.05295v1","authors":"[\"Jon Kleinberg\",\"Fan Wei\"]","published":"2025-11-07T14:56:04Z","proceeding":"cs.DS","tasks":"[\"cs.DS\",\"cs.CL\",\"cs.DM\",\"cs.LG\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
