Prof Gui Shichun
 
 
The CBACLE Project: A Progress Report
 
Gui Shichun
 
itshgui@scut.edu.cn
Guangdong University of Foreign Studies, China
 
 
 

1. Overview

CBACLE (Corpus-based Analysis of Chinese Learner English) is an ongoing project in the Ninth Five-Year Plan of the National Foundation of Social Sciences and Humanities.

The project has two phases: Phase One is the development of CLEC (Chinese Learner English Corpus) which consists of one million words.
 

Phase Two:

An overall analysis of Chinese learner English which aims at looking at the relationship between error tags and grammatical tags
 

2. CLEC: Sampling

CLEC (Chinese Learner English Corpus) has collected one million words of written compositions from:

-- Secondary school students: 300,000 words
-- Non-major college students: 300,000 words
-- English-majors ( intermediate): 200,000 words

English-majors ( advanced): 200,000 words
 
-- The samples should be genuine work of the students; all errors should be kept intact.
-- The samples should be well distributed and proportionate over all levels.

The samples should be collected from various sources, so that the written work is a spontaneous production of the user. When taking the written papers in a test, the candidates tend to adopt the avoidance strategy.

3. CLEC: Error tags

Principles:

-- Concise but logical and systematic scheming
-- Clear conceptual framework for error defining
-- Detailed categorization of common grammatical errors  and rough for those with lower frequency of occurrence

Exclusion of stylistic errors  (which can*t be tagged objectively and with consistency)

-- Providing adequate amount of error information  for further analysis
-- Open-endedness of error tags for the convenience  of  later addition or revision
-- Easy / natural recognition and consistent tagging operation by different project members

4. Clec: Error classification

Errors are classified by word form, word class, wording, collocation, sentence and uncertainty respectively:

-- Word form: errors concerning individual words only
-- Word class: errors found in larger linguistic context (from phrase to discourse). Two levels of analysis

level 1 = word class division: verb phrases, noun phrases, adjectival phrases, prepositional phrases, pronouns, adverbs and conjunctions

* level 2 = specific errors for each word class
Wording: errors concerning larger linguistic context (from phrase to discourse), which  include order, choice, quantity and clarity

Collocation: errors concerning the co-occurrence of notional words in a linguistic context (from phrase to discourse)
Sentence: structural and semantic errors concerning a whole sentence, punctuation included.

* Uncertainty: erroneous or doubtful expressions whose classification/tagging awaiting further consideration

An error tag provides the following information:

   1) error type
   2) erroneous word(s)
   3) minimal error recognition context

Finally an error tagging scheme consisting of 63 error types was designed.

5. CLEC: Text Defining Tags

 Another set of tags was also designed to define the corpus materials by using  diamond brackets (<>). For example, <ST2><SEX1><AGE15> means that the text is written by students of type 2 (senior middle school), male and 15 years of age. All tags are put in the first line of the text.

The idea is that errors can be retrieved in different context and studied.
 
Code Type 
ST(student) = 1  (Junior Middle School) 
= 2  (Senior Middle School) 
= 3  (Non-Major English, Level 4, undergraduate) 
= 4  (Non-Major English, Level 6, undergraduate) 
= 5  (English Major, 1st to 2nd -year, undergraduate) 
= 6  (English Major,3rd to 4th -year, undergraduate) 
= 7  (English Major, postgraduate) 
 
 
Code Type 
SEX  = 1  (Male) 
= 2  (Female) 
Y (number of years in learning English)
= accumulative years (e.g. 6, 9,)
=DN (Don't know) 
SCH£¨the name of the school£©  
= Provided by the sample collector, using the first letter of Chinese Pinying as an acronym, its total length should be less than 3, e.g. Hanmin Senior Middle School as HMS) 
 = DN (Don*t know) 
 
 
Code Type 
AGE = Natural age (e.g. 15,20*) 
= DN (Don*t know) 
WAY£¨the way the paper is written)
=  1 (test paper)
= 2 (classroom assignment) 
= 3 (homework) 
= DN (Don*t know) 
DIC£¨Have dictionaries been used in ?£© 
= 1 (Yes) 
= 2 (No) 
= DN (Don*t know) 
 
 
Code Type 
TYP (essay type) = 1 (argumentative, expository) 
= DN (Don*t know) 
 = 2 (narrative, descriptive) 
= 3 (practical: letters, diaries, notes, form-filling, etc)
= 4 (others) 
 

6. CLEC: The Error Marking Program

Error tags have to be marked manually, and it is difficult for the marker to remember the tagging scheme. So a special error marking program has been designed to facilitate the process: once the position of an error has been identified, and the error type chosen from a pop-up menu on the screen, the computer will insert the tag into the right place.

7. CLEC: More

A specific concordancer is under construction, so that the researcher can easily retrieve the texts according to the error tagging scheme

-- Researchers can also use other concordancers like Longman*s , MicroConcord, TACT, etc., to retrieve the texts
-- Putting the corpus on the internet

8. Phase Two

-- General Statistical Report of the Corpus
-- Individual Statistical Reports according to Learner Types (Secondary School, College, English Majors, etc.)
-- Contrastive Statistical Reports of Error Types in terms of level, sex, age group, essay type, etc.
-- Individual studies:

* The correlation between error tags and grammatical tags
* Stylistic differences in English between   Chinese learners and native speakers
* Identification of  errors specific to learner level
* Error sources (influence of the mother tongue, over-generalization, etc.)
* Qualitative studies of specific error types (verbs, prepositions, patterns, etc.)
* Implications for English teaching in the Chinese context

Back