Development of lexical and grammatical competences in immigrant Swedish
Sweden has a growing number of immigrants, the need for courses and coursebooks in Swedish as a second language (L2) is increasing, as is the demand for standardized tests and qualifications. This project intends to study the development of lexical and grammatical competences in L2 learners of Swedish. We intend to perform the study through two corpora: coursebook texts and learner essays, both marked up for proficiency levels according to the Common European Framework of References (CEFR). The corpora will be processed by computational methods, after which the results will be analysed by linguists, lexicographers, grammarians, teachers and language assessors - both linguistically, and based on theory of teaching, to find ways of identifying minimal or central (need-to-know) vocabulary and grammar scopes, as well as peripheral (good-to-know) grammar and vocabulary at each level of proficiency as a way to support teachers, test-makers, assessor and learners.
The aim of this project is, thus, to provide an extensive description of what lexical and grammatical competence learners at each level possess, both receptively and productively, and explore the relation between the receptive and productive scopes. The project will result in a number of practical digital tools: online sites for browsing and downloading lexical and grammatical inventories, and a set of algorithms and tools that can be re-used on other corpora for extraction of similar type of resources.
Final report
The MAIN AIM of this project was to provide an extensive description of the lexical and grammatical competence learners at each level of proficiency possess, both receptively and productively. The aim has been successfully achieved NOT ONLY in the form of a long list of studies, experiments and publications, BUT ALSO - and maybe especially - in the form of a NOVEL TOOL Swedish L2 profile (https://spraakbanken.gu.se/larkalabb/svlp) that any interested user can access to explore Swedish L2 data in a user-friendly and transparent way. Swedish L2 profile provides an opportunity to build new insights and formulate data-driven hypotheses; to apply it to teaching, test item generation, course book writing, CALL-development (computer assisted language learning); and to use it for many other potential scenarios. In addition, the project has generated two UNIQUE RESOURCES for Swedish - a Morpheme Family and a Word Family resource - these are the first of their kind since they include not only morphological analysis of the words in relation to derivational morphemes and word formation patterns (e.g. compounding, derivation), but also link each morpheme (roots, prefixes, etc.) to the level of proficiency where they have been used in the data and to statistics of their usage. The Swedish L2 profile, openly available from April 2023, can be used by anyone, with a possibility to download filtered sets of data.
The MAIN RESEARCH FINDINGS, documented in numerous articles, reflect the complexity of learner language analysis, and cover insights, among others, into multi-word expressions in L2 Swedish language (Lindström Tiedemann et al., Submitted-a, Lindström Tiedemann et al. 2022), into behavior of core and peripheral vocabulary per level of proficiency (Volodina et al., Accepted), into word formation in relation to proficiency and frequency (e.g. Ingves and Lindström Tiedemann, Submitted), into mixed method approaches combining quantitative and qualitative empirical approaches with crowdsourcing (Alfter et al., 2021; Volodina et al., Accepted; Lindström Tiedemann, et al. 2022), and a lot of other aspects of learner language and research on L2 language.
The project results were DISSEMINATED at numerous venues typical to relevant research fields, such as Linguistics, Scandinavian languages, Learner corpus research (LCR), Second Language Acquisition (SLA), Intelligent Computer-Assisted Language Learning (ICALL), Natural Language Processing (NLP), Lexicography and onomastics. One PhD thesis and MA and BA dissertations have been produced in relation to the project. Dissemination took the form of:
* publications in conference proceedings, journals, book chapters
* PhD thesis and MA/BA dissertations
* organization of workshops, meetings and events
* papers, posters and demos at conferences and workshops
* invited talks, guest talks and research seminars
To reach out to the general public, a number of blogs with exciting findings were published.
All experiments, datasets and tools are well documented and are openly available for reuse by other researchers/projects. A series of guidelines are also available, among others, for annotation of multi-word expressions by type, for using the lexicographic tool LEGATO in annotation work, for word formation annotation as well as a manual for the online SweL2P tool. Altogether, nine (9) reports have been produced in the guidelines-series (see https://spraakbanken.gu.se/en/projects/l2profiles/l2p-project-output).
Workflow__________
The aim of the project was to gain insights into linguistic developmental patterns in non-native Swedish as the language proficiency develops. The work has been split into several packages (described shortly below) and relied on two corpora: corpus of coursebook texts (COCTAILL, Volodina et al. 2014) and a collection of learner essays (SweLL-pilot, Volodina et al. 2016), both marked up for proficiency levels according to the Common European Framework of References (CEFR, COE 2001).
WP1: Data preparation: corpora, resources, tools
1. The first step included work with data preparation:
* First, the SweLL-pilot corpus reported in Volodina et al. (2016) was extended by 163 essays that were transcribed and anonymized following the same guidelines as the initial 339 essays.
* Second, the two corpora were automatically annotated, including lemmatization, part-of-speech tagging, dependency relations, multi-word detection and word sense disambiguation.
* Third, a subset of the learner essays was normalized (i.e. rewritten slightly so as to conform more closely to the standard language) and then annotated in the same way as the rest of the data as described above.
* Fourth, automatic annotation quality was manually examined on a subset of corpora to make sure that automatic tags provide good ground for further research (Volodina et al. 2022b).
* Finally, we generated lists of items, including sense-based vocabulary inventories, statistical representations over grammatical features (e.g. noun declension, adjectival declension & inflectional patterns, verbal conjugations, etc.), and statistical overviews over noun and verb patterns. All items were cross-linked with Språkbaken’s lexicographic resources to obtain rich lexicographic information for further re-use, such as information about inflectional patterns (Alfter, 2021).
2. The second step included manual annotation of the resources above:
* All automatically identified multi-word expressions (MWE), such as “dra slutsatser” (‘draw conclusions’), were classified into relevant subgroups by syntactic principles (contiguous/non-contiguous), lexical categories (e.g. nominal, verbal, non-lexical, etc.) and verbal subcategories (e.g. particle verb, reflexive verb, etc) (Lindström Tiedemann et al., Submitted-a)
* All sense-based vocabulary items were analyzed for their morphological constituents (e.g. roots, prefixes, binding morphemes, etc.) (Volodina et al., 2021)
* To support (semi-)manual annotation in the steps above, a tool LEGATO (Alfter et al., 2019) was implemented and successfully employed.
3. The input from step 2 was visualized in a tool Swedish L2 profile, including searches, graphs, frequency statistics and actual corpus hits (Volodina et al., 2022c) featuring:
* Lexical profile, including adjectival declension, adjectival and adverbial structure of comparison, Multi-Word Expressions (Lindström Tiedemann et al., Submitted-a), sense-based word list Sen*Lex (Alfter 2021)
* Grammatical profile, comprising 38 verb patterns and 143 noun patterns (Lindström Tiedeman et al., Submitted-b)
* Morphological profile, that includes Word Family and Morpheme Family (Volodina et al., Submitted)
WP 2-4. Lexical, grammatical and morphological competencies
Using input from WP1, especially points (2) and (3), and visualization from WP2, a range of studies was performed, to discover actual patterns that are typical for learners at different levels of proficiency, some of them being:
* about the nature of perceived lexical difficulty, comparing L2 learners and L2 professionals (Alfter et al., 2021)
* about the nature of core vocabulary versus peripheral vocabulary at different levels of proficiency (Volodina et al., Accepted)
* about Multi-Word expressions in L2 data (Lindström Tiedemann et al., Submitted-a)
* about proper names in L2 data (Lindström Tiedemann, Accepted)
* about word families, their growth from level to level and hypothetical priming effects of derivational knowledge (Volodina et al., Submitted; 2022)
* about prepositions, passive, verb phrases and noun phrases in learner language (presented at conferences, workshops and seminars but not yet published)
* …and many more, see Project publication list
More information: https://spraakbanken.gu.se/en/projects/l2profiles
REFERENCES____________
Council of Europe. Council for Cultural Co-operation. Education Committee. Modern Languages Division. (2001). Common European framework of reference for languages: Learning, teaching, assessment. Cambridge University Press.
Elena Volodina, Ildikó Pilán, Ingegerd Enström, Lorena Llozhi, Peter Lundkvist, Gunlög Sundberg, Monica Sandell. (2016). SweLL on the rise: Swedish Learner Language corpus for European Reference Level studies. Proceedings of LREC 2016, Slovenia.
Elena Volodina, Ildikó Pilán, Stian Rødven Eide and Hannes Heidarsson (2014). You get what you annotate: a pedagogically annotated corpus of coursebooks for Swedish as a Second Language. Proceedings of the third workshop on NLP for computer-assisted language learning. NEALT Proceedings Series 22 / Linköping Electronic Conference Proceedings 107: 128–144.
BLOGS_______
Elena Volodina (December 2021). God Jul from the Swedish Word Family (https://spraakbanken.gu.se/blogg/index.php/2021/12/20/god-jul-with-the-swedish-word-family/). Blog for Språkbanken Text.
Elena Volodina (April 2021) Swedish derivational morphology with CoDeRooMor (https://spraakbanken.gu.se/blogg/index.php/2021/04/14/swedish-derivational-morphology-with-coderoomor/). Blog for Språkbanken Text.
Elena Volodina (September 2020). How reliable is sense disambiguation in texts by native and non-native speakers? (https://spraakbanken.gu.se/blogg/index.php/2020/09/30/how-reliable-is-sense-disambiguation-in-texts-by-native-and-non-native-speakers/). Blog for Språkbanken Text.
The MAIN RESEARCH FINDINGS, documented in numerous articles, reflect the complexity of learner language analysis, and cover insights, among others, into multi-word expressions in L2 Swedish language (Lindström Tiedemann et al., Submitted-a, Lindström Tiedemann et al. 2022), into behavior of core and peripheral vocabulary per level of proficiency (Volodina et al., Accepted), into word formation in relation to proficiency and frequency (e.g. Ingves and Lindström Tiedemann, Submitted), into mixed method approaches combining quantitative and qualitative empirical approaches with crowdsourcing (Alfter et al., 2021; Volodina et al., Accepted; Lindström Tiedemann, et al. 2022), and a lot of other aspects of learner language and research on L2 language.
The project results were DISSEMINATED at numerous venues typical to relevant research fields, such as Linguistics, Scandinavian languages, Learner corpus research (LCR), Second Language Acquisition (SLA), Intelligent Computer-Assisted Language Learning (ICALL), Natural Language Processing (NLP), Lexicography and onomastics. One PhD thesis and MA and BA dissertations have been produced in relation to the project. Dissemination took the form of:
* publications in conference proceedings, journals, book chapters
* PhD thesis and MA/BA dissertations
* organization of workshops, meetings and events
* papers, posters and demos at conferences and workshops
* invited talks, guest talks and research seminars
To reach out to the general public, a number of blogs with exciting findings were published.
All experiments, datasets and tools are well documented and are openly available for reuse by other researchers/projects. A series of guidelines are also available, among others, for annotation of multi-word expressions by type, for using the lexicographic tool LEGATO in annotation work, for word formation annotation as well as a manual for the online SweL2P tool. Altogether, nine (9) reports have been produced in the guidelines-series (see https://spraakbanken.gu.se/en/projects/l2profiles/l2p-project-output).
Workflow__________
The aim of the project was to gain insights into linguistic developmental patterns in non-native Swedish as the language proficiency develops. The work has been split into several packages (described shortly below) and relied on two corpora: corpus of coursebook texts (COCTAILL, Volodina et al. 2014) and a collection of learner essays (SweLL-pilot, Volodina et al. 2016), both marked up for proficiency levels according to the Common European Framework of References (CEFR, COE 2001).
WP1: Data preparation: corpora, resources, tools
1. The first step included work with data preparation:
* First, the SweLL-pilot corpus reported in Volodina et al. (2016) was extended by 163 essays that were transcribed and anonymized following the same guidelines as the initial 339 essays.
* Second, the two corpora were automatically annotated, including lemmatization, part-of-speech tagging, dependency relations, multi-word detection and word sense disambiguation.
* Third, a subset of the learner essays was normalized (i.e. rewritten slightly so as to conform more closely to the standard language) and then annotated in the same way as the rest of the data as described above.
* Fourth, automatic annotation quality was manually examined on a subset of corpora to make sure that automatic tags provide good ground for further research (Volodina et al. 2022b).
* Finally, we generated lists of items, including sense-based vocabulary inventories, statistical representations over grammatical features (e.g. noun declension, adjectival declension & inflectional patterns, verbal conjugations, etc.), and statistical overviews over noun and verb patterns. All items were cross-linked with Språkbaken’s lexicographic resources to obtain rich lexicographic information for further re-use, such as information about inflectional patterns (Alfter, 2021).
2. The second step included manual annotation of the resources above:
* All automatically identified multi-word expressions (MWE), such as “dra slutsatser” (‘draw conclusions’), were classified into relevant subgroups by syntactic principles (contiguous/non-contiguous), lexical categories (e.g. nominal, verbal, non-lexical, etc.) and verbal subcategories (e.g. particle verb, reflexive verb, etc) (Lindström Tiedemann et al., Submitted-a)
* All sense-based vocabulary items were analyzed for their morphological constituents (e.g. roots, prefixes, binding morphemes, etc.) (Volodina et al., 2021)
* To support (semi-)manual annotation in the steps above, a tool LEGATO (Alfter et al., 2019) was implemented and successfully employed.
3. The input from step 2 was visualized in a tool Swedish L2 profile, including searches, graphs, frequency statistics and actual corpus hits (Volodina et al., 2022c) featuring:
* Lexical profile, including adjectival declension, adjectival and adverbial structure of comparison, Multi-Word Expressions (Lindström Tiedemann et al., Submitted-a), sense-based word list Sen*Lex (Alfter 2021)
* Grammatical profile, comprising 38 verb patterns and 143 noun patterns (Lindström Tiedeman et al., Submitted-b)
* Morphological profile, that includes Word Family and Morpheme Family (Volodina et al., Submitted)
WP 2-4. Lexical, grammatical and morphological competencies
Using input from WP1, especially points (2) and (3), and visualization from WP2, a range of studies was performed, to discover actual patterns that are typical for learners at different levels of proficiency, some of them being:
* about the nature of perceived lexical difficulty, comparing L2 learners and L2 professionals (Alfter et al., 2021)
* about the nature of core vocabulary versus peripheral vocabulary at different levels of proficiency (Volodina et al., Accepted)
* about Multi-Word expressions in L2 data (Lindström Tiedemann et al., Submitted-a)
* about proper names in L2 data (Lindström Tiedemann, Accepted)
* about word families, their growth from level to level and hypothetical priming effects of derivational knowledge (Volodina et al., Submitted; 2022)
* about prepositions, passive, verb phrases and noun phrases in learner language (presented at conferences, workshops and seminars but not yet published)
* …and many more, see Project publication list
More information: https://spraakbanken.gu.se/en/projects/l2profiles
REFERENCES____________
Council of Europe. Council for Cultural Co-operation. Education Committee. Modern Languages Division. (2001). Common European framework of reference for languages: Learning, teaching, assessment. Cambridge University Press.
Elena Volodina, Ildikó Pilán, Ingegerd Enström, Lorena Llozhi, Peter Lundkvist, Gunlög Sundberg, Monica Sandell. (2016). SweLL on the rise: Swedish Learner Language corpus for European Reference Level studies. Proceedings of LREC 2016, Slovenia.
Elena Volodina, Ildikó Pilán, Stian Rødven Eide and Hannes Heidarsson (2014). You get what you annotate: a pedagogically annotated corpus of coursebooks for Swedish as a Second Language. Proceedings of the third workshop on NLP for computer-assisted language learning. NEALT Proceedings Series 22 / Linköping Electronic Conference Proceedings 107: 128–144.
BLOGS_______
Elena Volodina (December 2021). God Jul from the Swedish Word Family (https://spraakbanken.gu.se/blogg/index.php/2021/12/20/god-jul-with-the-swedish-word-family/). Blog for Språkbanken Text.
Elena Volodina (April 2021) Swedish derivational morphology with CoDeRooMor (https://spraakbanken.gu.se/blogg/index.php/2021/04/14/swedish-derivational-morphology-with-coderoomor/). Blog for Språkbanken Text.
Elena Volodina (September 2020). How reliable is sense disambiguation in texts by native and non-native speakers? (https://spraakbanken.gu.se/blogg/index.php/2020/09/30/how-reliable-is-sense-disambiguation-in-texts-by-native-and-non-native-speakers/). Blog for Språkbanken Text.