Abstract
Various methods have been proposed recently to solve the problem of weak domain adaptability of Chinese word segmentation (CWS) models based on neural networks. However, although some of these improved models achieve high segmentation accuracy in a specific domain, they need to be retrained when applied to another. After rethinking the domain adaptability, two criteria, including the segmentation accuracy and the universality, are suggested for measuring it. Taking the above two criteria into consideration, an improved neural-based CWS model is proposed, which incorporates the common lexicon and unlabeled data into BERT. To make the most use of lexicon, a new method is proposed to construct the lexicon-based feature vector. In addition, the domain-specific words can be effectively extracted by pre-training a language model on the unlabeled data. Finally, a GRU-like gate structure is used to integrate the lexicon-based feature vector and language model into BERT. Experiments on five different domains reveal that the domain adaptability of this model is very strong.
Get full access to this article
View all access options for this article.
