Found this Stephen Fitz's (from Keio university, Tokyo) paper yesterday.
This paper presents a novel method, based on the ideas from algebraic topology, for the analysis of raw natural language text. The paper introduces the notion of a word manifold - a simplicial complex, whose topology encodes grammatical structure expressed by the corpus. Results of experiments with a variety of natural and synthetic languages are presented, showing that the homotopy type of the word manifold is influenced by linguistic structure.
The analysis includes a new approach to the Voynich Manuscript - an unsolved puzzle in corpus linguistics. In contrast to existing topological data analysis approaches, we do not rely on the apparatus of persistent homology. Instead, we develop a method of generating topological structure directly from strings of words.
You are not allowed to view links.
Register or
Login to view.
These results show that the topology of the word manifold is influenced by linguistic structure expressed by the corpus. Furthermore, we can interpret dimensions of the word manifold by comparing natural and synthetic data.
New?