Classical Chinese Sentence Segmentation as Sequence Labeling

Hu, Yizhou

Publication

Classical Chinese Sentence Segmentation as Sequence Labeling

Hu, Yizhou

Date

2014

Additional date(s)

2014-12-01

Abstract

Classical Chinese was the medium of writing in East Asia and has since become extinct, leaving a large number of texts inaccessible to the general public. Expert-produced sentence segmentations are crucial to understanding classical Chinese texts. This study proposes utilizing various statistical models widely used in NLP models to automate such segmentation as a sequence labeling problem. Results produced by automated models such as HMM, CRF, Bidirectional LSTM and similar human reproduction are all validated against expert segmentation. CRF models overperform human work in accuracy metrics and, thus, are promising for potential real-life implementations. Fast and accurate automated segmentation improves the accessibility of historical texts in both their home culture and the rest of the world. Note: The source code, complete results and sample segmented texts of this study can be found at github.com/xlhdh/classycn.

Subject

classical Chinese
natural language processing

Show all metadata

Files

classical-chinese-sentence.pdf

Adobe PDF, 212.37 KB

Department

Computer Science

Advisor

Sanchez-Aguilar, Antonio

URI

https://repository.tcu.edu/handle/116099117/10350

Classical Chinese Sentence Segmentation as Sequence Labeling

Hu, Yizhou

Citations

Soloist

Composer

Publisher

Date

Additional date(s)

Abstract

Contents

Subject

Subject(s)

Files

Research Projects

Organizational Units

Journal Issue

Genre

Description

Format

Department

Advisor

URI