An Approach to Assessing the Linguistic Difficulty of Tasks

This article proposes an approach to assessing the linguistic difficulty of tasks, that is, the linguistic features involved in performing a communicative task that may make it more or less challenging for language learners. The procedure follows the methodology proposed by Pallotti (2019) for operationalizing task interactional difficulty. This consists, firstly, in identifying what linguistic-communicative features are particularly difficult for language learners, based on previous research showing that they appear late in the course of acquisition. Secondly, native speakers’ performance is observed in order to determine which tasks most involve these difficult linguistic features. The dimensions observed in this study concern lexical diversity and sophistication, morphological complexity, and length and depth of syntactic constructions. Data come from 10 native speakers of Italian performing 5 communicative tasks. Results show that different dimensions of linguistic difficulty are relatively independent of each other, and that interindividual variation is rather limited as regards the lexicon and morphology, but more pronounced for syntax. Implications for SLA research, Task-Based Language Teaching and Task-Based Language Assessment are discussed.


Introduction
Over the last decades a considerable body of research has accumulated on the relationship between the characteristics of communicative tasks and their effects on second language performance. The results achieved to date, however, are not very clear and consistent, and this is frequently attributed to the fact that a large number of measures and operationalizations have been proposed, with little attention to construct validity and the replicability of results (Ellis, 2018;Long, 2015;Plonsky & Kim, 2016).
In recent years several works have appeared with the aim of clarifying key constructs, both with regards to the dependent variables of complexity, accuracy and fluency and the independent variable of task complexity, or difficulty (in this article the term difficulty will be preferred, for reasons that will be explained in the next section) (Bulté & Housen, 2012;Norris & Ortega, 2009;Pallotti, 2009;Révész, 2014;Révész, Michel et al., 2016;Sasayama, 2016). Continuing along this line, this article presents an approach to explicitly defining and operationalizing linguistic difficulty, one of the aspects that makes a task more or less challenging.
The argument follows the approach proposed by Pallotti (2019) to assess task interactional difficulty, extending it to a new domain, i.e. linguistic difficulty. First, some linguistic features will be deemed to be more difficult than others, based on previous research showing that they systematically appear later in interlanguage development. Indeed, "a language feature is more difficult than another if its processing and learning requires more time and/or more mental activity" (Housen & Simoens, 2016, p. 166). Then, the performance of native speakers on five tasks will be analysed, in order to assess whether different tasks elicit variable amounts of difficult features. These results will be used to establish, in an empirically grounded, explicit way, whether one task is more difficult than another from a linguistic point of view. The relative difficulty of the same tasks may change with respect to other dimensions, such as interactional difficulty, reasoning demands or pragmatic constraints. The point made here is that different dimensions of task difficulty can and should be assessed independently, in order to arrive at a clearer picture of the demands that different tasks make on task performers and, as a consequence, a better understanding of how these demands impact on participants' communicative behaviours.
Linguistic difficulty is a key element contributing to a task's global difficulty, and it is mentioned in virtually all accounts of L2 communicative tasks, beginning with

Complexity and difficulty
In this article the expression task difficulty will be preferred to task complexity, which has been prevalent in the SLA literature over the past two decades or so; this terminological choice thus needs some justification.
As a matter of fact, most early works referred to task "difficulty" (e.g. Brindley, 1987;Candlin 1987, Nunan, 1989Skehan 1992). In the language testing literature, too, the term difficulty is almost exclusively employed (e.g. Elder et al., 2002;Fulcher & Márquez Reiter, 2003). One of the first to consistently use the expression "task complexity" was Peter Robinson (1995Robinson ( , 2001, after which the term gained more and more ground in SLA research. This terminological choice, however, is not without problems, mainly because of the polysemy of the term complexity, which can mean both an object's structural properties (the number of its parts and of the relations among them) and the cognitive demands faced by human beings when interacting with that object. 1 In the interest of terminological clarity, some authors have proposed that the two notions should be labeled with different terms, such as complexity for the former, and difficulty for the latter (Bulté & Housen, 2012;Housen, in press;Housen & Simoens, 2016;Pallotti, 2009Pallotti, , 2015Skehan, 2015), which, among other things, would also facilitate research on the relationships between them. This holds for both tasks and linguistic features, that can be said to be more or less complex (composed of several elements with intricate structural relationships) or difficult (posing higher demands on the users). It is certainly possible to study whether and to what extent more structurally complex objects are more difficult for human beings to deal with. However, this is not a reason for using the same term for the cause (structural complexity) and the effect (cognitive difficulty), but actually suggests that the two notions should receive different labels. Figure 1 graphically depicts the relationships among these constructs. The first column concerns complexity, defined by Rescher (1998, p. 1) as "the number and variety of an item's constituent elements and of the elaborateness of their interrelational structure". Linguistic features or texts may be complex because they contain many different elements (e.g. a high variety of lexical items or morphological processes) or because their relationships are intricate (e.g. long syntactic structures with deeply embedded constituents) (Bulté & Housen, 2012;Pallotti, 2015). This structural linguistic complexity may contribute to linguistic difficulty, that is, to the effort required of a human being to process and master such structures or produce texts containing them (DeKeyser, 2005;Housen, in press;Housen & Simoens, 2016;Spada & Tomita, 2010). Linguistic difficulty in turn contributes to task difficulty when a task, in order to be adequately performed, requires many difficult linguistic features. However, this is just one source of task difficulty, which may also be increased by the structural complexity of Figure 1: Complexity and difficulty in language and tasks. the task itself, for instance when the task contains many elements related to another in a variety of ways, or with constraints on their co-occurrence (Skehan, 1998(Skehan, , 2015Robinson, 2001Robinson, , 2011Robinson, , 2015. Finally, according to some theoretical models (e.g. Robinson, 2011), task complexity itself may also lead to the production of more complex linguistic structures and thus contribute, in a more indirect way, to task difficulty.
The arrows in Figure 1 should not be taken to imply that relationships are circular, as if everything caused everything. There is a clear directionality between complexity and difficulty: as Rescher puts it, "cognitive difficulty reflects rather than creates complexity" (1998, p. 17) or, with specific regard to second language acquisition, "structural complexity can contribute to psycholinguistic complexity or difficulty, but does not coincide with it" (Housen,in press,p. 2). As a matter of fact, the bottom right cell, task difficulty, has arrows pointing to it, but none pointing from it, which means that task difficulty is the (more or less direct) product of many factors, but not their cause.

Defining and assessing linguistic difficulty in SLA
Notions like "code complexity" (Candlin, 1987;Skehan, 1992Skehan, , 1998 express the intuition, shared by researchers, teachers and lay people, that some communicative tasks are more difficult than others because they require more complex linguistic structures -such as a varied and sophisticated vocabulary or the use of intricate syntactic and textual structures -and that this complexity leads to higher difficulty for task performers. These intuitions have been developed in subsequent research, and several criteria have been proposed to establish whether linguistic features are more or less difficult (see reviews by Collins et al., 2009, DeKeyser, 2005, Housen & Simoens, 2016). Structural complexity is often cited as one of the causes of linguistic difficulty, together with frequency and saliency in the input. Acquisitional timing, on the other hand, is considered to be an effect of linguistic difficulty: A structure may be said to be more difficult if it is acquired late, that is, if it appears at relatively advanced levels of L2 development.
Based on these general criteria for establishing linguistic difficulty, the following aspects may be examined in order to identify more specific constructs and their measures. The list does not exhaust all the features that have been shown to develop over time in L2 acquisition, but selects only some, chosen among those most investigated in previous research and that are not limited to particular languages.

Lexicon
Several studies have shown that in initial interlanguage varieties the lexicon tends to be repetitive and mostly contains high frequency words; rarer words, which can also be called more sophisticated, are acquired later, as well as the ability to use a varied lexicon, i.e. a high proportion of lexical types compared to the tokens produced (De Clercq, 2015;Dóczi & Kormos, 2016;Kang, 2013;Treffers-Daller, 2013;Yu, 2010). Therefore, a task requiring a varied lexicon (higher structural complexity) with several low-frequency words (higher acquisition difficulty) will be considered to be more difficult than one implying just a small set of frequent words.

Morphology
One of the first and most replicated findings of SLA research is that inflectional morphology is absent or very limited in basic interlanguage varieties, as it develops later, with variable speed and outcomes depending on individual factors and on the structural complexity of the system to be acquired (for recent contributions and overviews of previous literature, Brezina & Pallotti, 2019;. For these reasons, a task involving the use of a wide range of morphological processes can be said to require greater skills, and thus be more difficult, than one involving just a few morphological processes. The range of morphological processes can be calculated using the Morphological Complexity Index (Pallotti, 2015;Brezina & Pallotti, 2019), which measures the variety of morphological types appearing in a text.

Syntax
Languages also differ as regards syntax, with some having just one or two basic word orders, and others displaying a wide array of constructions with several constraints on their occurrence, based for instance on the type of constituency relation or illocutionary force. Thus, a task involving certain linguistic constructions or speech acts can be easy in one language and difficult in another. Nevertheless, research shows that, in general, the initial phases of second language acquisition are characterized by syntactically simple constructions, i.e. short and relatively independent of each other; only later are learners able to control more far-reaching structures, consisting of a large number of words or clauses. Vercellotti (2018, p. 7), in her study of the longitudinal development of L2 English speech, provides the following examples: Next time I can pay them back (less complex); if I don't like this man and I don't want to have a next date I think they pay the bill first (more complex), and shows that more structurally complex constructions tend to increase over time.
Measures such as mean length of production unit, number of clauses per unit and subordination ratio all represent this greater complexity of syntactic structures, and they have been shown to steadily increase at least from initial to intermediate levels, while at more advanced levels there is stabilization with greater variability, probably linked to individual stylistic preferences (for recent contributions and overviews of previous literature, De Clercq & Housen, 2017;Kuiken et al., 2019;Vercellotti, 2018;for Italian, Chini, 2003). It can thus be maintained that tasks involving the production of long and complex syntactic structures, containing several elements linked together, require more skills and are therefore more difficult from a linguistic point of view.

Native speakers' task performance
After having established, on the basis of empirical research, which linguistic features are more difficult as they take longer to be acquired, it is necessary to observe which tasks most require these features. Since we are concerned with difficulty for additional language users, it would seem natural to observe their performance. However, this is more problematic than it seems. In fact, if these learners were not to produce difficult linguistic behaviours in a task, it would be impossible to say whether this is due to the fact that the task does not require them, or to the fact that their skills do not allow it. Previous research has in fact shown that L2 proficiency systematically mediates between task characteristics and linguistic performance (e.g. Malicka & Levkina, 2012;Sasayama, 2016).
To overcome this problem, one may look at the performance of native speakers (Ellis, 2011;Long, 2015, p. 239;Pallotti, 2019), who form a more homogeneous group than learners, at least as regards the fundamental structures of "basic language cognition" (Hulstijn, 2015(Hulstijn, , 2019. As far as this type of language structures is concerned, native speakers consistently reach the highest scores, representing a sort of ceiling with respect to the wider range of scores obtained by learners at different levels (Abrahamsson & Hyltenstam, 2009;Granena & Long, 2013). Of course, it may be possible for some non-native speakers to reach the same levels as the natives, at least in some areas, so that the whole category may be labelled, in more general terms, "top language performers", to refer to individuals whose performance is at or close to ceiling levels. In any case, observing which structures are used by these top language performers in different tasks provides an indication of how the tasks themselves, rather than the speakers' (in)capacities, favour or limit their use. In other words, the observation of top language performers, who have at their disposal the whole range of structures, from the easiest to the most difficult, makes it possible to more directly observe how different tasks involve the use of more or less difficult structures.
Some previous studies have looked at native speakers' performance on tasks, with the aim of comparing it with that of language learners. For instance, Skehan (2009) observed that native and non-native speakers behaved rather similarly as regards their use of infrequent words and of varied lexicon in personal information exchange and decision-making tasks, while differences were more noticeable in picture-story retellings. For both groups, the two measures varied independently of each other, thus demonstrating that lexical variety and sophistication are independent constructs. Foster and Tavakoli (2009) showed that native speakers' syntactic complexity varied across different narrative tasks depending on storyline complexity, while Ellis (2011) found that syntactic and lexical complexity were different in different types of tasks (reporting a car accident vs giving directions on a map), although manipulating each type of task in order to make it more or less cognitively demanding did not lead to any changes in native speakers' linguistic performance.
There is thus evidence showing that native speakers' linguistic behaviours do indeed vary -like those of second language learners, though sometimes in different ways -depending on task conditions. However, none of these studies saw these variations in native speakers' performance as indexing higher or lower levels of potential difficulty for language learners, which is the focus that will be taken in this article.  Pallotti et al., 2011) corpus, also used by Pallotti (2019). Participants were girls aged 15-20 at the beginning of data collection, attending high schools in Northern Italy. 14 were non-native speakers with a variety of L1s, while 10 were native Italian speakers -this study will look at these only (mean age = 18.0). The relatively small sample used in this study implies that quantitative analyses should be taken as illustrating how the procedure may be practically implemented and indicating areas worthy of further investigation, rather than as making inferential claims about the generalizability of results for this particular set of tasks and participants.

The study
Participants performed a variety of oral communicative activities, so that their linguistic skills could be assessed in a range of contexts. The procedure consisted of two sessions on two different days. The first session involved a series of essentially monologic tasks and began with a semi-structured interview, followed by retelling a silent film and a picture story, then by a map task with the adult interviewer. The second session proposed more interactive tasks, with participants working in pairs. There was another map task, this time with the peer, and two information-seeking activities, one requiring them to plan a school trip, the other to select a present for a friend. Both these tasks involved making a number of phone calls to shops, travel agencies, restaurants and hotels, and to a list of "experts" (both youths and older adults) who were asked to provide advice and information. Apart from the initial ice-breaking conversation, all the other tasks were presented in a counter-balanced order in different sessions.
Tasks for this project were devised so that they would vary mostly on pragmatic and sociolinguistic dimensions, such as the type of communicative moves to be performed (e.g. initiating, responding, negotiating), monologic vs dialogic activities, social distance between interlocutors (acquaintances vs strangers, peers vs adults). No task manipulations were envisaged to target specific linguistic dimensions, so that all tasks were assumed to involve rather ordinary everyday language of comparable difficulty, a point that should be borne in mind when interpreting the results presented in the next pages.
In this article we will look at native speakers' data from the interview, film retelling, map task with a peer, and from phone calls and face-to-face negotiations during the school trip organization (total corpus size: 71,500 words). Given that the interviews and the school trip organization task lasted much longer than other activities, only the first ten minutes of the first task and the last ten of the second will be analysed here (the choice is due to the fact that these parts of the activities were more uniform across dyads, so that in the interpretation of results intratask variability would have a lower impact than inter-task systematic variation). Transcription followed a modified version of the Chat-CA system.
Transcribed data were prepared for quantitative analysis by first dividing them into AS-units (Analysis-of-Speech Unit, Foster et al., 2000) and clauses. This segmentation was carried out by students and research assistants on about 60% of the data, and then checked by the principal investigator; after some initial training and discussions, inter-rater agreement was always over 85%. The remainder of the data were coded by the principal investigator only.
Morphological and lexical analyses were conducted using automatic tools, which implied standardizing orthography and removing from the original transcription all non-verbal behaviour markers, such as pauses, breaths, laughter. Given that Italian is a highly inflected language, lexical diversity and sophistication were calculated on lemmas, obtained with the part of-speech analyser Treetagger (Schmid, 1994) and subsequent manual revision.

Lexicon
Lexical variety was assessed with the Moving Average Type-Token Ratio (MATTR, Covington & McFall, 2010), that is, the average type-token ratio (TTR) in fixed-length samples taken from a text (in this case, 250 words, which was slightly less than the shortest text in the corpus). MATTR is calculated by averaging the TTR of multiple text samples, one after another, so that each sample includes all the words of the previous sample except the first, plus a new word, until the end of the text is reached.
Lexical sophistication was calculated as the proportion of words not belonging to the most frequent 2,000 lexical types, deemed the "fundamental" lexicon of Italian (De Mauro, 2016) and computable with the online tool Dylan Text Tools 2.1.9 (www.ilc.cnr.it/dylanlab/apps/texttools). The 2,000 most frequent words list is considered to be an important threshold for lexical richness according to Laufer's (1995) notion of "Beyond 2000".
As shown in Table 1 and Figure 2, lexical diversity values, calculated using the Moving Average Type/Token ratio with a window of 250 tokens (MATTR-250), did not exhibit large differences across tasks, except for the map task, where a smaller range of types was used (MATTR-250 = 0.36). Another indicator for a task's lexical difficulty is the proportion of non-basic words used in its performance. This was operationalized as the percentage of words in the text not belonging to the 2,000 most frequent lexemes in Italian. In this domain, too, values do not change very much across tasks, with negotiations and film retelling having the lowest proportion of non-common words (Table 2 and Figure 3).
It is worth noting that the two measures of lexical difficulty, viz. type/token variety and low frequency lemmas, do not go exactly hand-in-hand. For instance, negotiations had the highest lexical diversity, but the lowest proportion of infrequent words; on the other hand, the lexicon for performing the map task was not very basic, but it was rather repetitive.

Morphology
The Morphological Complexity Index (MCI, Pallotti, 2015) was computed with the online Morpho Complexity Tool (Brezina & Pallotti, 2015; corpora.lancs.ac.uk/vocab/ analyse_morph.php), and was calculated as the average within-and across-sets diversity of samples of 10 verbal exponents, randomly sampled 100 times from each text. This measure thus gives an indication of the variety of verbal inflections used in different tasks. Table 3 and Figure 4 show that the MCI values are quite similar for phone calls, negotiations and interviews, which display relatively high scores, all over 12. The variety of verbal exponents was slightly lower in the film retelling (11.24), where the plot was typically told using a few persons of the present tense, and much lower (8.54) in the map task, where most verb forms were in the second person singular of the imperative or in the third person singular of the present tense.

Syntax
Among the many measures that have been proposed to assess syntactic development in an additional language, two were selected for this study. The first is the mean length of AS-Unit, defined as a main clause or sub-clausal unit with all the dependent clauses attached to it (Foster et al., 2000). This measure provides a general indication of the breadth of unitary syntactic structures. The second measure is the number of dependent clauses per AS-Unit, which more specifically represents the degree of syntactic embedding. Both measures have been extensively applied in the SLA literature on several languages and have been shown to increase at higher proficiency levels in L2 oral productions (e.g. De Clercq & Housen, 2017;Vercellotti, 2018).
Results for the syntactic analysis are more variegated than for other linguistic dimensions, with rather conspicuous variations across tasks. As regards the number of words per AS-Unit (Table 4 and Figure 5), the film retelling had the highest value (8.28), followed by the interview (7.12). The other three tasks elicited rather shorter units whose mean length ranged from 4.51 to 5.74 words.
The dependent clauses per AS-Unit ratio shows a similar picture, with even more marked differences (Table 5 and Figure 6): While in the film retelling about half of the AS-Units contained dependent clauses, and these were one out of three in the interview, the proportion drops to about 1/8 in the other tasks. Interestingly, the map task implies relatively long syntactic structures, but very little      subordination: it seems that what is needed to perform it is to construct rich and detailed clauses describing the path and the landmarks, although it does not seem to be necessary to embed other clauses inside them.

Individual variation
The previous sections reported mean scores achieved by the ten participants across the five tasks. However, it is also important to look at inter-individual variation around these means, to assess whether it differed across tasks. Given that the measures came from different scales, with different value ranges, the coefficient of variation (CV: standard deviation/mean) was used to standardize fluctuations around the mean in order to make them comparable.
What appears from Table 6 and Figure 7 is that CV values for lexical diversity and sophistication are rather low across participants and tasks, which means that all participants tended to behave similarly with regards to these dimensions. Variation in the use of morphological processes is slightly higher, especially in some tasks, like the map task, phone calls or film retelling, but still relatively modest. What appears to be highly variable across individuals are syntactic phenomena, which display, in all tasks, high and very high coefficients of variation for both mean length of AS-Unit and the number of dependent clauses per AS-Unit. Syntactic complexity thus seems to be more related to individual style, allowing for a rather wide range of inter-individual variation, while the lexicon and morphology are more related to task features and less subject to individual preferences.

Ranking tasks along different dimensions
A final question may be whether there are tasks with high or low levels of all or most dimensions of linguistic difficulty, so that they could be said to be more or less difficult in general, or whether different dimensions vary in a relatively independent manner, so that a task may score high in one and low in another, with a number of possible combinations. To answer this question, Table 7 sorts tasks in ascending order of difficulty according to the different dimensions considered.
Overall, the map task seems to be relatively easy on most dimensions: It does not require varied lexicon or morphology and it contains the lowest proportion of subordinate clauses. However, compared to other tasks, it elicited a relatively high proportion of infrequent words and its AS-Units were not among the shortest. The interview seemed to imply medium-high use of linguistically difficult structures on all dimensions, viz. lexicon, morphology and syntax, although in no cases did they reach the highest values. For other tasks, the picture is more varied. For instance, retelling the   video clip implied on the one hand the highest levels of syntactic complexity, with long AS-Units containing a number of dependent clauses; on the other hand, the task could be performed with rather basic vocabulary and little lexical and morphological variation. Phone calls and negotiations offer quite an interesting picture. While both tasks required relatively simple syntactic constructions but a high degree of morphological variety, they sharply differed as regards the lexicon. Phone calls elicited a high number of infrequent words, but the lexicon was rather monotonous, as evidenced by the low MATTR-250 value. The opposite occurred with negotiations, where words were very varied (highest MATTR-250 score), but also very frequent (lowest proportion of uncommon words). This provides additional evidence to the claim that lexical diversity and lexical sophistication are indeed separate dimensions (Skehan, 2009). More generally, the various dimensions of potential linguistic difficulty investigated in this study seem to be rather independent of one another, so that a given task may require high levels in one but relatively low levels in another.

Discussion and conclusions
It has long been argued that task difficulty is a multidimensional construct, with many factors contributing to it (Brindley, 1987;Candlin, 1987;Ellis, 2018;Nunan, 1989;Robinson, 2001Robinson, , 2015Skehan, 1998Skehan, , 2015. This article has proposed a procedure to empirically assess one of these dimensions, linguistic difficulty. Results show that this construct is in turn multidimensional and that its sub-components -lexicon, morphology, syntax -vary independently of one another, and there may even be variation in the same sub-domain, such as the lexicon, as evidenced by the different profiles of lexical diversity and sophistication. This implies that future research on task features and demands should take this multidimensionality as a starting point, by carefully manipulating difficulty dimensions one by one rather than pursuing a dichotomic view of tasks as being +/-difficult. Explicit, analytic and empirically grounded definitions of task linguistic difficulty are desirable for several reasons. First of all they are necessary to continue, in a more principled way, research on the interactions between task difficulty and linguistic performance. Secondly, this line of investigation may contribute to Task-Based Language Teaching (TBLT), by offering more solid grounds to determine the linguistic and communicative demands of different tasks, which is a key aspect for syllabus progression (Baralt, Gilabert & Robinson 2014). Indeed, it has been shown that teachers consider linguistic features one of the most relevant aspects in their evaluation of task difficulty (Révész & Gurzynski-Weiss, 2016). Finally, Task-Based Language Assessment (TBLA) is also concerned with establishing whether different tasks have comparable levels of potential linguistic difficulty, in order to ensure uniformity across multiple editions of the same test or to develop appropriate tasks for different proficiency levels (Elder et al., 2002).
The empirical study reported here was a pilot investigation with the main purpose of presenting an empirical approach to the assessment of task difficulty, and its results seem to be encouraging. It was possible to apply the proposed measures and procedures to the data, and analysis confirms the intuition that the selected tasks should have been rather uniform with respect to their linguistic difficulty. The main purpose of the VIP project on task-based language production was to elicit variation along interactional and sociolinguistic dimensions, while keeping linguistic aspects constant (Pallotti et al., 2011). Therefore, the relatively small range of variation found here for several linguistic dimensions should not be interpreted as a limitation of the approach, nor of the tasks used, but, on the contrary, as a validation of their choice. In other words, a procedure to empirically assess task difficulty can be employed not only to prove that tasks are different, but also that they are similar, or equivalent, at least in some respects.
Despite this overall similarity, the analysis shows that, among the tasks investigated, the map task is the easiest for most of the linguistic dimensions examined. Other tasks present a more complex picture, which suggests that different sub-dimensions of linguistic difficulty are relatively independent of one other. Telling a film (at least the film used in this project), for example, seems to require rather broad and complex syntactic constructions, but fairly basic vocabulary and morphology. On the contrary, asking for information over the phone (again, as regards the phone calls used in this study) implies rather telegraphic syntax and repetitive vocabulary, but involves a rich range of morphological exponents and several lowfrequency words. It is also worth noting that the map task, which was found here to be the one with the lowest levels of linguistic difficulty, proved to be the task with the highest interactional difficulty in Pallotti's (2019) study. All this confirms the idea that different dimensions of difficulty, linguistic or of other sorts, are relatively independent and can be manipulated autonomously. It is also possible -although it is just a hypothesis awaiting empirical verification -that trade-off effects may occur, so that as one dimension of difficulty increases, others tend to decrease, even in top language performers.
Finally, it is worth reflecting on inter-individual variability. Even in a relatively homogeneous population such as native speakers, not all individuals behave in the same way, as is to be expected (Andringa, 2014;Dąbrowska, 2019). The present study shows that this individual variability, which may be deemed stylistic, seems to be greater in syntax, as some participants tend to prefer broad and complex structures while others typically produce rather short and simple constructions. Variability is much more limited in the areas of vocabulary and morphology, that seem to be more directly linked to the nature of the task and less to individual preferences.
This study has limitations, too, and calls for further research. Observing top language performers, for instance, has the advantage of reducing the influence of a factor such as linguistic competence on task performance. However, the question remains to what extent learners are actually conditioned by a task's linguistic demands, so that their behaviour follows that of top language performers. This is the well-known distinction between task-as-workplan and task-in-process (Breen 1987): A task may require, by its very nature, the use of a varied and sophisticated vocabulary, or a wide range of morphological processes, but in its concrete realization learners may resort to much simpler forms. In some cases, these more basic alternatives may still allow learners to achieve the task's goals, perhaps with some more effort and in a less efficient way. There might be other cases, however, in which these requirements are essential, so that the lack of linguistic or communicative skills may result in the impossibility of adequately performing the task. It would therefore be necessary to demonstrate the relationship between the use of certain linguistic behaviours and task success, taking into consideration functional adequacy among the criteria for assessing task performance (Kuiken & Vedder, 2018;Pallotti, 2009;Révész, Ekiert et al., 2016). Furthermore, future research should look at how learners' performance more or less closely matches that of top language performers, and how L2 proficiency may systematically mediate this relationship.
Another outstanding issue is whether different dimensions of linguistic difficulty can be added together, in order to obtain a unitary index of linguistic difficulty, as Pallotti (2019) did for interactional difficulty. This would have clear advantages on a practical level but would require a careful examination of the construct validity of a highly multidimensional notion such as "linguistic difficulty".
Finally, the study presented here needs to be replicated on larger samples and different languages, observing a greater number of potentially difficult linguistic dimensions and employing several tasks. In particular, differences between oral and written productions should be explored and, for each modality, tasks should be controlled for register and genre. This may lead to the inclusion of other measures, for instance phrase length, that have been claimed to be more relevant for assessing syntactic variation in contexts such as academic writing (Biber, Gray & Poonpon, 2011;Ortega, 2012).
Despite these limitations, the present study can be seen as a first attempt at developing a principled, empiricallybased procedure to establish which tasks imply higher or lower levels of linguistic difficulty, going beyond current models that assume, on a theoretical level, that this dimension has an impact on global task difficulty, but do not indicate specific methodologies to quantitatively assess it. It also contributes to the debate on how task difficulty may be operationalized and measured, complementing and extending current endeavours based on different methodologies (e.g. subjective perceptions, raters' intutitions, dual task performance, as proposed by Révész, 2014;Révész, Michel et al., 2016;Sasayama, 2016); all these approaches, taken together, will provide a fuller picture of task demands and their potential effects on language performance by native and non-native speakers.

Note
1 Robinson (2001Robinson ( , 2011 calls difficulty only the challenges that a task poses to a specific individual, while he uses the term complexity to refer to both a task's structural features (e.g. number of elements) and the cognitive processes it requires of everyone (e.g. spatial or causal reasoning). This terminology however is confusing, as it employs different terms, difficulty and complexity, for similar constructs (challenges for a given person vs for everyone) and the same term, complexity, to refer to different constructs, such as a task's structural properties and the cognitive demands it makes on performers (Skehan, 2015).