Crowdsourcing, Linguistic Analysis, and Language Resources
(GRF Project No. 544011)

Introduction

Empirical approaches to the scientific studies of language developed rapidly in the last few decades due to the introduction of psychological experiments and electronic corpora. As experiment and measurement tools become more and more sophisticated, and corpora grow bigger and more diversified, new research topics are frequently introduced and exciting discoveries are made. However, regardless of these two successful new directions, we still have not overcome one very basic bottleneck in linguistic research: a reasonably representative sampling size. Language is an ability shared by thousands, even millions, of speakers. So far, the experimental approach can only access the language production data of no more than a few scores of speakers, while corpus sampling cannot reflect distributional variations by a number of different speakers. Ideally, linguistic studies should be based on the data produced by a substantial sample of all speakers from different background. The recent development of crowdsourcing offers a new and unique opportunity to collect linguistic behaviour data from a substantial number of speakers effectively and economically.

Internet has emerged as one of the most dominant media of linguistic communication yet this medium is under-explored for linguistic research. Recently, crowdsourcing offers efficient tools for mining public opinion mining and for large scale language resources collection and annotation. crowdsourcing allows researcher to collect reliable data of tasks requiring human intelligence from a much larger number of subjects than traditional experimental methods. This study aims to explore and establish a new research paradigm in language sciences by applying internet-based tools for crowdsourcing. The overarching goal is to apply research methodologies to enable efficient collection of large scale felicitous linguistic judgements and/or behaviours. The success of our research will greatly increase the number of subjects in linguistic studies and allow generalizations to be made based on the linguistic judgements of a significantly large number of native speakers. We plan to establish this new paradigm by comparing the data and generalizations collected from internet crowdsourcing with corresponding studies using psycholinguistic experiments or corpus-based human annotation. We will propose three sets of experiments, which concern segmentation and transparency of compounds, to generalize Chinese native speakers’ performance on identifying the concept of word boundaries and the internal composition of a word. crowdsourcing data, using Mechanical Turk (MTurk), will be compared with both annotated corpus data and corresponding psycholinguistic research to establish a theoretical interpretation of the data. This cross-validation approach will not only create synergy among computational, psychological, and linguistic approaches, but will also bring new perspectives to the scientific studies of language.

The study has three major impacts: on research methodology in language sciences, on our understanding of how the concept of word words for the Chinese language, and on how large scale Chinese language resources can be collected. The first major impact will be to introduce a new research methodology to language sciences and to establish how results of this methodology can be evaluated and interpreted in comparison to previous studies. The availability of internet crowdsourcing tools, such as the Mechanical Turk, allows researcher to design tasks require human intelligence and to gather data of a great number of subjects performing these tasks. This new methodology allows us to gain a more complete picture on the status of the concept of word in Chinese, a still contentious subject among Chinese linguists. Most Chinese speakers are able to identify words given some instruction. However, there are great variations among themselves as to what constitutes a word. As crowdsourcing tools were developed in an English environment, our study will identify and resolve any issues which may arise when they apply to Chinese language. Our in-depth linguistic study will also inform future language technology research and applications using crowdsourcing in the Chinese context; and open a new way for efficient construction of large scale annotated language resources for Chinese.

Research Team

Principal Investigator:

Prof. Chu-ren Huang, Chair Professor of Applied Chinese Language Studies, Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University.

Co-investigators:

Dr. Yao Yao, Assistant Professor, Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University.

Dr. Angel Wing-Shan Chan, Assistant Professor, Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University.

Dr. Wenjie Li, Associate Professor, Department of Computing, The Hong Kong Polytechnic University.

Dr. Shoushan Li, Associate Professor, Department of Computing, Soochow University.

PhD Student:

Mr. Shichang Wang, PhD Candidate, Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University.

(Last Modified: June 6, 2014)

PolyU - Linguistic Theory & Language Technology Group

Crowdsourcing, Linguistic Analysis, and Language Resources (GRF Project No. 544011)

Introduction

Research Team

PolyU - Linguistic Theory &
Language Technology Group

Crowdsourcing, Linguistic Analysis, and Language Resources
(GRF Project No. 544011)