Building a Corpus of 2L English for Automatic Assessment: The CLEC Corpus

  1. Tejada, Ma Ángeles Zarco 1
  2. Gallardo, Carmen Noya 1
  3. Ferradá, Ma Carmen Merino 1
  4. López, Ma Isabel Calderón 1
  1. 1 Dpto. Filología Francesa e Inglesa, Universidad de Cádiz, Avda. Doctor Gómez Ulla, 1, 11003 Cádiz, Spain
Revista:
Procedia - Social and Behavioral Sciences

ISSN: 1877-0428

Año de publicación: 2015

Volumen: 198

Páginas: 515-525

Tipo: Artículo

DOI: 10.1016/J.SBSPRO.2015.07.474 GOOGLE SCHOLAR lock_openAcceso abierto editor

Otras publicaciones en: Procedia - Social and Behavioral Sciences

Resumen

In this paper we describe the CLEC corpus, an ongoing project set up at the University of Cádiz with the purpose of building up a large corpus of English as a 2L classified according to CEFR proficiency levels and formed to train statistical models for automatic proficiency assessment. The goal of this corpus is twofold: on the one hand it will be used as a data resource for the development of automatic text classification systems and, on the other, it has been used as a means of teaching innovation techniques.

Referencias bibliográficas

  • Alanen, R., Huhta, A., and Tarnanen, M. (2010). Designing and assessing L2 writing tasks across CEFR proficiency levels. Eurosla Monographs Series, 1, 21-56.
  • Banerjee, J., Franceschina, F., and Smith, A.M. (2004). Documenting features of written language production typical at different IELTS band score levels. IELTS Research Reports, 7, Retrieved June 12, 2014 from www.ielts.org.
  • Barbagli A., Lucisano P., Dell’Orletta F., Montemagni S., and Venturi G. (2014). Tecnologie del linguaggio e monitoraggio dell’evoluzione delle abilità di scrittura nella scuola secondaria di primo grado. Proceedings of the First Italian Conference on Computational Linguistics (CLiC-it), 9-10, December, Pisa, Italy.
  • Biber, D. (1988). Variation across speech and writing. Cambridge: Cambridge University Press.
  • Cimino A., Dell’Orletta F., Venturi G., and Montemagni S. (2013). Linguistic profiling based on general–purpose features and native language identification. Proceedings of eighth workshop on innovative use of NLP for building educational applications (pp. 207-215). Atlanta, Georgia, June 13.
  • Collins-Thompson, K., and Callan, J. (2005). Predicting reading difficulty with statistical language models. Journal of the american society for information science and technology, 56, 13, 1448-1462.
  • Council of Europe . (2001). Common European framework of references for languages: Learning, teaching, assessment. Cambridge: Cambridge University Press.
  • Dahlmeier, D., Ng, H.T., and Wu, S.M. (2013). Building a large annotated corpus of learner English: The NUS corpus of learner English. Proceedings of the eighth workshop on innovative use of NLP for building educational applications (pp. 22-31). Atlanta, Georgia, June13, 2013. Association for computational linguistics.
  • Dell’Orletta F., Montemagni S., and Venturi G. (2014). Assessing document and sentence readability in less resourced languages and across textual genres. Recent advances in automatic readability assessment and text simplification. Special issue of International Journal of applied linguistics, 165 (2), 163-193.
  • Dell’Orletta, F., and Montemagni, S. (2012). Tecnologie linguistico-computazionali per la valutazione delle competenze linguistiche in ambito scholastico. In S. Ferreri (Ed.), Linguistica Educativa, Atti del XLIV Congresso Internazionale di Studi della SLI (pp. 343-359). Roma: Bulzoni Editore.
  • Dell’Orletta, F., Montemagni, S., and Vecchi, E.M. (2011). Technologie linguistico-computazionali per il monitoraggio della competenza linguistica italiana degli alumni stranieri nella scuola primaria e secondaria. In G.C. Bruno, I. Caruso, M. Sanna, and I. Vellecco (Eds.) Percorsi Migranti: Uomini, Diritto, Lavoro, Linguaggi (pp. 319-336). Milano: McGraw-Hill.
  • Dell’Orletta, F., Montemagni, S., and Venturi, G. (2011). READ-IT: Assessing readability of Italian texts with a view to text simplification. Proceedings of the workshop on speech and language processing for assistive technologies (SLPAT 2011) (pp. 73-83). July 30, 2011, Edimburgh, UK.
  • Dell’Orletta, F., Montemagni, S., and Venturi, G. (2013). Linguistic profiling of texts across textual genres and readability levels. An exploratory study on Italian fictional prose. Proceedings of recent advances in natural language orocessing (pp.: 189-197). Hissar, Bulgaria, September 2013.
  • Frazier, L. (1985). Syntactic complexity. In D.R. Dowty, L. Karttunen, and A.M. Zwicky (Eds.), Natural language parsing. Cambridge: Cambridge University Press.
  • Gibson, E. (1998). Linguistic complexity: Locality of syntactic dependencies. Cognition, 68 (1), 1-76.
  • Heilman, M., Collins-Thompson, K., Callan, J., and Eskenazi, M. (2007). Combining lexical and grammatical features to improve readability measures for first and second language texts. Proceedings of NAACL HLT-2007 (pp. 460-467).
  • Hendriks, H. (2008). Presenting the English Profile Programme: In search of criterial features. Research Notes, 33, 7-10.
  • Hulstijn, J.H., Alderson, J.C., and Schoonen R. (2010). Developmental stages in second-language acquisition and levels of second-language proficiency: Are there links between them? Eurosla Monographs Series, 1, 11-20.
  • Kurtes, S., and Saville, N. (2008). The English Profile Programme-An overview. Research Notes, 33, 2-4.
  • Montemagni, S. (2013). Technologie linguistico-computazionali e monitoraggio della lingua italiana. Studi Italiani di Linguistica Teorica Applicata (SILTA), Anno XLII, N.1 (pp. 145-172).
  • Norris, J.M. (1996). A validation study of the ACTFL guidelines and the German speaking test. Unpublished MA dissertation. Honolulu: University of Hawaii.
  • Norris, J.M., and Ortega, L. (2009). Towards an organic approach to investigating CAF in instructed SLA: the case of complexity. Applied Linguistics, 30, 555-578.
  • Petersen, S.E., and Ostendorf, M. (2009). A Machine Learning Approach to reading level assessment. Computer Speech and Language 23, 89-106.
  • Roark, B., Mitchell, M., and Hollingshead, K. (2007). Syntactic complexity measures for detecting mild cognitive impairment. Proceedings of ACL workshop on Biological, translational, and clinical language processing (BioNLP’07) (pp. 1-8). Prague, Czech Republic.
  • Sagae, K., Lavie, A., and MacWhinney, B. (2005). Automatic measurement of syntactic development in child language. Proceedings of the annual meeting of the Association for Computational Linguistics (ACL 2005 (pp: 197-204). University of Michigan, USA.
  • Salamoura, A., and Saville, N. (2010). Exemplifying the CEFR: Criterial features of written learner English from the English Profile Programme. Eurosla Monographs Series, 1, 101-132.
  • Van Ek, J.A., and Trim, J.L. M. (1989a). Threshold. Council of Europe, Cambridge: Cambridge University Press.
  • Van Ek, J.A., and Trim, J.L. M. (1989b). Waystage. Council of Europe, Cambridge: Cambridge University Press.
  • Van Ek, J.A., and Trim, J .L. M. (2001). Vantage. Council of Europe, Cambridge: Cambridge University Press.
  • Van Halteren, H. (2004). Linguistic profiling for author recognition and verification. Proceedings of the Association for Computational Linguistics (ACL04) (pp. 200-207).
  • Yngve, V.H. A. (1960). A model and a hypothesis for language structure. Proceedings of the American Philosophical Society (pp. 444-466).