Sage Journals: Discover world-class research

Abstract

Various methods have been proposed recently to solve the problem of weak domain adaptability of Chinese word segmentation (CWS) models based on neural networks. However, although some of these improved models achieve high segmentation accuracy in a specific domain, they need to be retrained when applied to another. After rethinking the domain adaptability, two criteria, including the segmentation accuracy and the universality, are suggested for measuring it. Taking the above two criteria into consideration, an improved neural-based CWS model is proposed, which incorporates the common lexicon and unlabeled data into BERT. To make the most use of lexicon, a new method is proposed to construct the lexicon-based feature vector. In addition, the domain-specific words can be effectively extracted by pre-training a language model on the unlabeled data. Finally, a GRU-like gate structure is used to integrate the lexicon-based feature vector and language model into BERT. Experiments on five different domains reveal that the domain adaptability of this model is very strong.

Keywords

Chinese word segmentation domain adaptability neural network

Get full access to this article

View all access options for this article.

References

Zheng

Chen

and Xu

, Deep learning for Chines word segmentation and POS tagging, In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 2013, pp. 647–657.

Chen

Qiu

Zhu

et al., Long short-term memory neural networks for Chinese word segmentation, Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2015, pp. 1197–1206.

Yao

and Huang

, Bi-directional LSTM recurrent neural network for Chinese word segmentation, International Conference on Neural Information Processing. Springer, Cham. 2016, pp. 345–353.

Liu

and Zhang

, Unsupervised domain adaptation for joint segmentation and POS-tagging, Proceedings of COLING 2012: Posters. 2012, pp. 745–754.

Zhang

Miao

et al., Addressing domain adaptation for Chinese word segmentation with instances-based transfer learning, Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. Springer, Cham. 2018, pp. 24–36.

Ganchev

and Weiss

, State-of-the-art Chinese word segmentation with Bi-LSTMs, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018, pp. 4902–4908.

Zhang

Liu

and Fu

, Neural networks incorporating lexicons for Chinese word segmentation, Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence. 2018, pp. 5682–5689.

Zhao

Zhang

Wang

et al., Neural networks incorporating unlabeled and partially-labeled data for cross-domain Chinese word segmentation, IJCAI. 2018, pp. 4602–4608.

Shao

Zheng

Yang

et al., Domain-specific Chinese word segmentation based on bi-directional long-short term memory model, IEEE Access 7 (2019), 12993–13002.

10.

Devlin

Chang

M.W.

Lee

et al., Bert: Pre-training of deep bidirectional transformers for language understanding, CoRR. (2018), abs/1810.04805.

11.

Vaswani

Shazeer

Parmar

et al., Attention is all you need, Advances in Neural Information Processing Systems, 2017, pp. 5998–6008.

12.

Wang

and Guo

, Learning Chinese word segmentation based on bidirectional GRU-CRF and CNN network model, IJTHI 15 (2019), 47–62.

13.

Liu

et al., Neural Chinese word segmentation with lexicon knowledge, CCF International Conference on Natural Language Processing and Chinese Computing. Springer, Cham. 2018, pp. 80–91.

14.

Zheng

Che

Guo

et al., Enhancing LSTM-based word segmentation using unlabeled data, Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data, 2017, pp. 60–70.

15.

Peters

M.E.

Ammar

Bhagavatula

et al., Semi-supervised sequence tagging with bidirectional language models, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, vol. 1, 2017, pp. 1756–1765.

16.

Schuster

Chen

et al., Google’s neural machine translation system: Bridging the gap between human and machine translation, CoRR. (2016), abs/1609.08144.

17.

Zhang

Che

et al., Type-supervised domain adaptation for joint segmentation and pos-tagging, Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, 2014, pp. 588–597.

18.

Jiang

Huang

and Liu

, Automatic adaptation of annotation standards: Chinese word segmentation and POS tagging: a case study, Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. 1, 2009, pp. 522–530.

19.

Kingma

D.P.

and Ba

, Adam: A method for stochastic optimization, ICLR, arXiv preprint arXiv (2014), 1412.6980.

20.

Liu

Zhang

Che

et al., Domain adaptation for CRF-based Chinese word segmentation using free annotations, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 864–874.

21.

Liu

et al., Neural Chinese word segmentation with dictionary, Neurocomputing 338 (2019), 46–54.

22.

Liu

et al., Neural Chinese word segmentation with lexicon and unlabeled data via posterior regularization, CoRR. (2019), abs/1905.01963.

23.

Bao

et al., Neural domain adaptation for Chinese word segmentation, IALP (2017), 131–134.

24.

Zhang

et al., Improving cross-domain Chinese word segmentation with word embeddings, CoRR. (2019), abs/1903.01698.

25.

Emerson

, The Second International Chinese Word Segmentation Bakeoff, Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, SIGHAN@IJCNLP, 2005.

26.

Yang

Zhang

and Dong

, Neural word segmentation with rich pretraining, ACL 1 (2017), 839–849.