TheLancasterCorpusofMandarinChinese(LCMC)isdesignedasaChinesematchfortheFLOBandFROWNcorporaformodernBritishandAmericanEnglish.ThecorpusissuitableforuseinbothmonolingualresearchintomodernMandarinChineseandcross-linguisticcontrastofChineseandBritish/AmericanEnglish.Thecorpussampled15writtentextcategoriesincludingnews,literarytexts,academicproseandofficialdocumentsetcpublishedinP.R.Chinaintheearlier1990sforatotalofapproximately1millionwords.ThesamesamplingframeandperiodasFLOB/FROWNwereusedinLCMC.Thecorpusismarkedupfortextcategories,samplefilenumbers,paragraphs,sentencesandtokens.Linguisticannotationsundertakenonthecorpusincludetokenizationandpart-of-speechtagging.Thewholecorpusisannotatedatthewordlevelandincludesorthographicandmorphologicalannotations.ThetaggingsystemusedwasproducedbytheInstituteofComputingScienceChineseLexicalAnalysisSystem(ICTCLAS),theChineseAcademyofSciences.ThecorpusisencodedinUnicode(UTF-8)andmarkedupinXML.ThecorpuscomeswithaUserManualdetailingcorpusdesignspecificationsandpart-of-speechtags.TheXMLstructureofthecorpuswasvalidatedusingtheparserbuiltinXaira.Part-of-speechtaggingofallaspectmarkerswasmanuallychecked.
2023/8/3 6:27:31
5.15MB
LCMC
1