Classical Chinese Sentence Segmentation as Sequence Labeling

dc.contributor.advisor	Sanchez-Aguilar, Antonio
dc.contributor.author	Hu, Yizhou
dc.date	2014-12-01
dc.date.accessioned	2016-02-19T15:38:18Z
dc.date.available	2016-02-19T15:38:18Z
dc.date.issued	2014
dc.identifier.uri	https://repository.tcu.edu/handle/116099117/10350
dc.description.abstract	Classical Chinese was the medium of writing in East Asia and has since become extinct, leaving a large number of texts inaccessible to the general public. Expert-produced sentence segmentations are crucial to understanding classical Chinese texts. This study proposes utilizing various statistical models widely used in NLP models to automate such segmentation as a sequence labeling problem. Results produced by automated models such as HMM, CRF, Bidirectional LSTM and similar human reproduction are all validated against expert segmentation. CRF models overperform human work in accuracy metrics and, thus, are promising for potential real-life implementations. Fast and accurate automated segmentation improves the accessibility of historical texts in both their home culture and the rest of the world. Note: The source code, complete results and sample segmented texts of this study can be found at github.com/xlhdh/classycn.
dc.subject	classical Chinese
dc.subject	natural language processing
dc.title	Classical Chinese Sentence Segmentation as Sequence Labeling
etd.degree.department	Computer Science
local.college	College of Science and Engineering
local.college	John V. Roach Honors College
local.department	Computer Science

TCU Digital Repository