1. Introduction

Language-learning aptitude has seen a resurgence of interest in recent years with second language researchers increasingly turning towards aptitude as a factor in explaining individual differences (Wen, Biedroń & Skehan, 2017). Dörnyei and Skehan (2003) suggest a general working definition for aptitude: “there is a specific talent for learning foreign languages which exhibits considerable variation between individual learners” (Dörnyei & Skehan, 2003, p. 590). However, beyond this definition, there is considerable variation among researchers about what components make up language-learning aptitude although they share many common elements. This has given rise to a number of different aptitude tests (e.g. MLAT (Carroll & Sapon, 1959); Pimsleur Aptitude Battery (Pimsleur, 1966); DLAB (Petersen & Al-Haik, 1976); CANAL-FT (Grigornko, Sternberg & Ehrman, 2000); LLAMA (Meara, 2005); HiLAB (Linck et al., 2013)). These tests all have slightly different emphases in what they (claim to) measure and many are not currently available to researchers (see Skehan, 2016, for a fuller discussion).

This study focuses on the free, easily available LLAMA tests given their increasing popularity (over 700 citations on Google scholar since 2013). Before explaining the LLAMA tests in detail, we briefly outline some of the areas investigated in terms of language-learning aptitude. As the tests have been used to address some of these questions, we then report on whether the tests are influenced by some of these factors themselves (e.g. age, bilingualism). We conclude by suggesting that norms are needed for instructed second language learners and also for different age groups.

2. Background on Language Aptitude

Language-learning aptitude research in SLA has generally been founded on the early work by Carroll and Sapon (1959). This approach to language-learning aptitude can be summed up by the following quote by Carroll (1990, p. 26):

The amount of time a student needs to learn a given task, unit of instruction, or curriculum to an acceptable criterion of mastery under optimal conditions of instruction and student motivation.

For Carroll, aptitude was a relatively stable, unchanging characteristic comprising four sub-components: phonemic coding ability, grammatical sensitivity, inductive language learning ability and associative memory (Carroll, 1973).2 This approach is epitomised in the Modern Languages Aptitude Test (MLAT) (Carroll & Sapon, 1959). This concept of aptitude has been subject to criticism in terms of how memory is conceptualised (Wen, 2016), the role of implicit learning (De Graaff, 1997), the links with intelligence (Sasaki, 1999) and its stability over time (Kormos, 2013; Ganschow, Fluharty & Little, 1995). For a more in-depth look at the history of language learning aptitude research see Skehan (2016).

Li (2016) carried out a meta-analysis of 66 studies examining the construct validity of language learning aptitude. He concluded that aptitude was independent of other individual differences like motivation (contra Pimsleur, 1966) and classroom anxiety (indeed Sparks and Patton (2013) suggested that low aptitude may cause classroom anxiety). Li concluded that aptitude was a strong predictor of general proficiency but not of vocabulary learning or L2 writing yet different test sub-components predicted different aspects of learning. This strongly supports a multi-component approach to aptitude.

In terms of memory, Li found that studies showed that executive working memory was more strongly associated with aptitude than phonological short-term memory. However, Linck et al. (2013) argued that phonological short-term memory was of relevance to advanced learners, suggesting that different aspects of memory or aptitude may be relevant at different ages (cf. Abrahamsson and Hyltenstam, 2008; Muñoz, 2014). In an attempt to bring together these different aspects of (working/short-term) memory, Wen (2016) has proposed the “Integrated Approach” in which phonological working memory is a “language learning device” and executive working memory is involved with “language processes” (Wen, 2016, p. 147). This links different types of working memory to different types of aptitude. This approach is along similar lines to Granena (2016), who argued that different types of aptitude are linked to different cognitive styles.

In addition to the general findings regarding aptitude that arise from Li’s (2016) meta-analysis, the range of aptitude tests and the variety of assumptions they make about the concept or construct of aptitude is clearly evident. One of these tests is the LLAMA test battery developed by Meara (2005). This test has gained in popularity as it is free, quick to administer and easily available, yet it has not been standardised or validated. In the following sections we outline some of the history and details of the LLAMA tests before turning to some practical, empirical questions that might influence LLAMA test results.3

3. LLAMA Aptitude Tests

The LLAMA tests were initially developed as part of a research training program for MA students at Swansea University. They are loosely based on the components that appear in Carroll & Sapon’s (1959) Modern Language Aptitude Test (MLAT) but the aim was to take advantage of developments in technology at the time to develop an easier, more appealing user interface.

The 2005 version of the LLAMA tests described in this paper consist of four sub-tests, conventionally referred to as LLAMA_B, LLAMA_D, LLAMA_E and LLAMA_F.

LLAMA_B is the vocabulary learning module of the LLAMA tests. It assesses the users’ ability to attach unfamiliar names to unfamiliar objects. Carroll and Sapon’s tests assess this ability by asking test-takers to remember a set of paired associates – in the English version of the MLAT, this involves English words paired with Kurdish words. One obvious disadvantage of this approach is that it assumes the English native speaker test-taker is unfamiliar with Kurdish but moreover, that multiple versions of the test are required for different first languages thus adding additional variables when comparing classes with multiple L1s. LLAMA_B solves this problem by presenting test-takers with a set of pictures that do not have obvious names, but can easily be described in any language. This approach breaks away from the paired-associate format, and it allows test-takers a lot of flexibility in the way they approach the vocabulary learning task. Figure 1 shows an example of type of stimulus used for this task.

Figure 1 

Five of the pictures used in LLAMA_B.

There are twenty of these figures in the LLAMA_B test, all displayed simultaneously on screen. Clicking on an object causes its name to be displayed. Test-takers have two minutes to examine all 20 objects and learn their names. The program places no constraints on how they do this, so test-takers can adopt a number of different strategies to complete the task. At the end of the learning phase, LLAMA_B moves to a testing phase: the program displays the name of each of the 20 objects, and test-takers have to identify the object by clicking on it. Five points are scored for each object correctly identified, and there is no correction for guessing. This means that LLAMA_B scores range from 0–100, and the expected score for random guessing is 5.4

LLAMA_D is a new test that does not appear in MLAT. It is based on a suggestion that a core skill in language learning is the ability to recognise repeated sounds in spoken language: basically, a learner who is able to recognise repeated stretches of sound is more likely to notice small variations in speech, and this makes it easier for them to isolate the individual words and variants of these words that signal morphology. To this extent, it can be considered a measure of implicit learning. The test works in two phases. In Phase 1 the test-taker hears a series of short sound clips in an unfamiliar language. In Phase 2 the test-taker hears another set of sound clips. Some of these are new, but some are repeated from Phase 1. For each clip, the test-taker has to indicate whether they have heard it in Phase 1 before or not. Five points are scored for each correct answer, and test-takers are penalised for guessing. The entire test takes about five minutes. It generally gets positive comments from users but appears to be very hard in that very few test-takers score highly on it.5

LLAMA_E is an adaptation of MLAT’s sound-symbol correspondence task. The test interface consists of a series of 24 labelled buttons in a Roman alphabet, but one that uses these familiar symbols in an unfamiliar way (see Figure 2). Clicking a button plays the syllable that is represented by the label. Test-takers have two minutes to explore this interface. The programme then moves to its test phase. In this phase, test-takers hear a complex two-syllable “word” and have to decide which of two spellings is correct. Five points are scored for each correct answer, and five points are deducted for an incorrect answer.

Figure 2 

The syllabary used in LLAMA_E.

LLAMA_F is a grammar inferencing test. The presentation phase of the program shows the test-taker a series of pictures depicting shapes and objects, and a short sentence in an artificial language which describes each picture. An example is shown in Figure 3. The test-taker is expected to work out how the descriptions relate to the pictures. From this, they should be able to intuit some of the grammatical and morphological features of the language: word order, gender, singular, dual and plural numbers, conjugating prepositions, and so on. Test-takers have five minutes to explore this data set.6 Then they are presented with a new set of pictures that incorporate new elements. Each picture is accompanied by two sentences which might describe it, and test-takers indicate which is the correct description. They should be able to do this if they have internalised the grammatical rules evidenced in the presentation phase. Five points are awarded for a correct answer and five points deducted for an incorrect choice.7

Figure 3 

An example of the stimuli used in the presentation phase. The pictures are designed to highlight key grammatical features.

4. Methodology

4.1. Research questions and hypotheses

This study arose out of limitations from our previous study investigating the factors which might influence LLAMA test performance (Rogers et al., 2016). In that study, we tested 229 participants aged 10–75 from a range of typologically distinct L1s, with various education levels. We found no effect for L1 or gender, but we did find an effect of education level on three subcomponents of the LLAMA tests (B, E & F). The younger participants were outperformed by the adults on LLAMA_E (sound-symbol correspondence). These findings provided an important early step in validation of the LLAMA tests but raised a number of issues, particularly in terms of their suitability for use with younger participants and the role of education in test results. However, many of these sub-groups had low participant numbers thus limiting their generalisability. These limitations led us to address the following questions.

  1. Are the LLAMA tests language neutral?
  2. What is the effect of monolingualism, early bilingualism and instruction in a L2 on LLAMA test scores?
  3. What is the effect of age on LLAMA test scores?
  4. How much variance do key background factors (e.g. age, gender, L1, L2 status) account for in the LLAMA test results?

As previously noted, one of the reasons for developing the LLAMA tests was to enable it to be used with a range of L1s. However, two of the tests, LLAMA_B and LLAMA_F, use words written in a Roman script. LLAMA_E also has letters and numbers from a Roman script. This led us to the first research question. Several studies suggest the degree of distance between an L1 and an L2 plays a fundamental role in word processing and retention in an L2 (Gholamain & Geva, 1999; Green & Meara, 1987; Hamada & Koda, 2008). If the language script of the L1 can influence the acquisition of the L2, then the question arises if the L1 script of the learner influences their aptitude scores.

In Rogers et al. (2016), we looked at this question but had a small sample size (n = 14) and grouped Arabic and Chinese native speakers together as a non-Roman script group. This was less than satisfactory due to the differences between Chinese as morphosyllabic (Tolchinsky, Levin, Aram & McBride-Chang, 2012, p. 1598) or logographic (Crystal, 1987, p. 200) script and Arabic, which as a consonant alphabetic script shares a common Semitic ancestor with Roman scripts (Sampson, 1985, p. 77). This current study addresses this limitation by comparing these two groups to each other and to L1 English participants. This enables us to formulate two hypotheses. The first is that the L1 English group will outperform the other two groups in LLAMA_B and LLAMA_F, as having the words in a Roman script will increase the processing load for the other two groups (Tan et al., 2003). Our second hypothesis is that the Arabic group will outperform the Chinese group as Arabic is an alphabetic system. We may also see an effect for LLAMA_E with the Chinese group as it contains a combination of words and letters as Akamatsu (1999) found that manipulating the way words were presented (use of capital letters, etc.) affected ESL speakers of L1 logo-graphic languages more than other ESL learners.

Our second research question asks if having a second language or being bilingual would account for differences in LLAMA test performance and is motivated by previous research suggesting that aptitude can be trained (e.g. Grigorenko et al., 2000; McLaughlin, 1990; Sternberg & Grigorenko, 2002) or changed due to experience (e.g. Hyltenstam, 2016; Kormos, 2013; Safar & Kormos, 2008; Sawyer, 1992; Sparks et al., 1995; Thompson, 2013). Our previous study did not find any significant differences in a post-hoc analysis of reported language experience. However, this did not take into account the level of language proficiency. This study specifically targets bilinguals (two L1s before age five) and instructed L2 learners in comparison with monolinguals. Our first hypothesis is that following Sparks et al. (1995), the instructed L2 group will outperform the other groups on the explicit measures (LLAMA_B, LLAMA_E and LLAMA_F) as they will have developed strategies for learning vocabulary (LLAMA_B) and grammar or pattern detection (LLAMA_E & LLAMA_F).8 We do not expect a difference between the bilinguals and the monolinguals, as they will not have been instructed in any language-learning strategies. Our second hypothesis is that due to purported bilingual cognitive advantage effects (Bialystok, Luk & Kwan, 2005; Kaushanskaya & Marian, 2009), the bilingual participants will outperform the monolingual group due to their greater language awareness.

Our third research question arose as the question of aptitude and age of onset has been contested in the literature (e.g. Abrahamsson & Hyltenstam, 2008; Muñoz, 2014). Although the LLAMA tests were not originally designed for use with children, it seems appropriate to investigate the use of these tests with younger populations. While many researchers have used aptitude tests retrospectively, i.e. tested adults who started learning another language at a young age, there has also been a trend to test younger participants to determine how their aptitude can predict subsequent language results. This conflates the age of onset with the age of testing.9 Here we focus on the issue of age at testing to establish if the LLAMA tests can be used with younger learners in the same way as with adolescents and adults. To address this third research question, we examined the test results from the vocabulary test (LLAMA_B) and the sound recognition measure (LLAMA_D) from three groups of learners. Group 1 comprises 10–11-year-olds, Group 2 comprises 20–21-year-olds and Group 3 comprises adults over the age of 30.10 These age groups were chosen to examine a range of ages, including both younger and older participants who are long past any possible critical period and are cognitively mature. The decision to concentrate on these two tests was both principled and practical. For principled reasons, we wished to investigate areas in which these tests may be used to see differences in ages following cognitive development or critical/sensitive period hypotheses views (Bley-Vroman, 1989, 2009; Patkowski, 1980) and also in which we would not expect there to be age-related differences. Practically, we were constrained by the amount of time available with each child and so could not administer the whole LLAMA test battery.

We make two contrasting hypotheses for LLAMA_B (vocabulary learning). Our first hypothesis is that we would not see any differences between the groups as we continue to learn new words throughout our lives and so cognitive development or critical-period effects should not be evident. Our second hypothesis follows work by Miralpeix (2006, 2009) that older participants (over 11) would outperform the younger learners due to their increased cognitive advantages and maturity.11 Our third hypothesis relates to LLAMA_D. As it is purported to be a measure of implicit learning (Granena, 2013) and if younger learners are claimed to make greater use of implicit learning12 in comparison to adult learners (DeKeyser, 2000), then we would predict that the younger learners would outperform the older learners.

For the fourth research question, we have combined these results with Rogers et al. (2016) as the data were collected under similar conditions, with similar background questionnaires giving a total of 404 participants. This allows us to carry out a more powerful statistical analysis to consider the effects of various individual background variables on LLAMA test performance. These variables are age, gender, L1 script, L2 status, highest education qualification and whether or not the participant regularly plays logic puzzles. The rationale for the inclusion of these variables will be discussed in more detail in the results section.

4.2. Tasks and administration

The four sub-components were administered to all participants over the age of 18. They were administered on Windows computers, either on an individual basis or in larger computer classrooms. The latter were drop-in sessions advertised to the students at a UK university. Tests were scored automatically by the programs (as outlined above) and the results noted on a piece of paper.

Participants also took a background questionnaire. This was computer based using our university’s Lime Survey software. Participants were given a URL and asked to give the same name that was on their LLAMA results sheet to allow for subsequent matching. Unfortunately, not all participants did so and their data were discarded. The questionnaire software automatically coded the results and it was imported into SPSS v. 20 for analysis.

Before participants took the LLAMA tests or completed the questionnaire, they were briefed on the nature of the research project,13 given an information sheet and asked to complete a consent form. For participants under the age of 18, a parental consent form was given with the information sheet. These had to be returned before data collection from the children could be carried out. A simplified background questionnaire was given to the children under 18 in paper format. For the 10–11-year-olds, data-collection time was restricted to a maximum of 30 minutes, so as not to place an undue burden on the children.

4.3. Participants

Participants were recruited either through university-wide emails and posters or individually by some of the research team. No participants were paid for their time and, therefore, represent a generally opportunistic sample.

Data were collected from a total of 240 participants (128 female and 112 male). Participants’ ages ranged from 10–75, but 148 were aged between 18–24, with a total of 211 over the age of 18. This was due to the majority of data collection taking place with undergraduate students as outlined above.

Of the participants over 18, we also had a range of educational backgrounds: 14 had left school at the end of compulsory education (aged 16), 112 had obtained qualifications at age 18 (A-level or equivalent), 70 already had an undergraduate degree and 13 had postgraduate qualifications. Again from the participants over 18, we had a range of prior language experience: 142 participants had learnt another language in school, 46 were monolingual English native speakers and 23 were bilingual speakers. Bilingualism was self-reported but defined as having acquired both languages before the age of five. In addition to English native speakers (n = 136), we had 56 L1 Chinese, 32 L1 Arabic and fewer than five each of German, Japanese, Welsh, Greek and Polish.

5. Results and Discussion

In this section, we present the results relating to each research question in turn before discussing them in terms of the hypotheses outlined previously.

5.1. Research question 1: language neutrality

The first question examined the role of L1 script in LLAMA tests results and whether they could be considered language neutral. To investigate this, we examined three groups: L1 Arabic speakers, L1 Chinese speakers and L1 English speakers as shown in Table 1. All participants over age 18 took all four tests, giving a total of 195 participants.

Table 1

Results of LLAMA tests according to L1 script.


English (n = 107) M 45.28 27.94 68.32 36.40
s.d 21.608 16.653 29.065 24.618
Chinese (n = 56) M 55.89 31.16 56.34 46.96
s.d. 27.288 24.458 28.034 25.984
Arabic (n = 32) M 53.75 34.38 62.19 49.06
s.d. 24.163 15.748 25.207 24.933

The results of a one-way between groups ANOVA show that there were significant differences between the groups for all of the LLAMA sub-components except LLAMA_D: LLAMA_B F (2, 192) = 4.212, p = 0.016; LLAMA_E F (2, 192) = 3.389, p = 0.036 and LLAMA_F F (2, 192) = 5.000, p = 0.008 but not for LLAMA_D F (2, 192) = 1.563, p = 0.212.

Results from a post-hoc Games-Howell test (unequal variances) showed that for LLAMA_B, the L1 Chinese group (M = 55.89, s.d. = 27.288) significantly outperformed (p = 0.035) the L1 English group (M = 45.28, s.d. = 21.608). There were no significant differences between the L1 Arabic and either L1 Chinese group or L1 English group. For LLAMA_E, again there was a significant difference between the L1 Chinese and L1 English groups (p = 0.032) but this time the L1 English group (M = 68.32, s.d. = 29.065) outperformed the L1 Chinese group (M = 56.34, s.d. = 28.034). Again there were no significant differences between the L1 Arabic and either L1 Chinese group or L1 English group. For LLAMA_F, there were significant differences between the L1 English group (M = 36.40, s.d. = 24.618) and both the L1 Chinese group (M = 46.96, s.d. = 25.984) and the L1 Arabic group (M = 49.06, s.d. = 24.933). The L1 English group performed significantly worse than both the L1 Chinese (p = 0.036) and L1 Arabic (p = 0.038) groups. There was no significant difference between the L1 Chinese and L1 Arabic groups.

We were concerned that perhaps the three groups were not comparable as many of the L1 English group were monolinguals. Table 2 shows the results of the participants over 18 who reported having studied another language. As Table 2 shows, this reduces the L1 English group from n = 107 to n = 48. It also reduces the L1 Arabic group to 30, as two participants were bilingual with English and had not studied another language.

Table 2

Results of LLAMA tests according to L1 script for L2 learners.


English (n = 48) M 52.40 28.33 69.90 42.19
s.d. 20.499 15.890 29.867 27.789
Chinese (n = 56) M 55.89 31.16 56.34 46.96
s.d. 27.288 24.458 28.034 25.984
Arabic (n = 30) M 54.17 34.83 62.33 49.00
s.d. 24.917 15.838 24.835 25.643

The results of a one-way between groups ANOVA for these L2 groups show that there were no significant differences on any of the LLAMA sub-components: LLAMA_B F (2, 131) = 0.263, p = 0.769; LLAMA_D F (2, 131) = 0.986, p = 0.376; LLAMA_E F (2, 131) = 3.021, p = 0.052; LLAMA_F F (2, 131) = 0.714, p = 0.492. While none of these results show overall significant differences, the results for LLAMA_E approach significance. This is due to differences between the L1 Chinese group (M = 56.34, s.d. = 28.034), who scored lower than the L1 English group (M = 69.90, s.d. = 29.867). This is in line with the findings of Akamatsu (1999) regarding the extra difficulties faced by speakers of logographic languages (like Chinese) when Roman alphabet text is manipulated.14

In terms of our hypotheses for this question, our first hypothesis predicted that the L1 English group (Roman script) would outperform the other groups. As shown in both Tables 1 and 2, this is not the case regardless of whether the role of language instruction experience is considered or not. Our second hypothesis suggested that L1 Arabic participants would outperform the L1 Chinese group as Arabic is a consonant alphabetic language. This hypothesis was also not supported by the data; there were no differences between the groups in Table 2. This suggests that the LLAMA tests are indeed language neutral as there were no differences between groups once other factors (e.g. L2 instruction) were controlled for. This result follows Granena (2013), who also found no difference between her 187 Chinese, English and Spanish subjects. If the LLAMA tests can be used across participants of different language backgrounds and language pairings, as these results suggest, then this opens up aptitude testing to a much wider audience. Most of the existing aptitude tests are designed for homogeneous groups and require multiple versions for different L1s (e.g. MLAT).

5.2. Research question 2: L2 status

The second research question asked if bilingualism, monolingualism or instructed second language learning would impact on LLAMA scores. We divided the participants into three groups based on their answers in our background questionnaire; monolinguals, bilinguals (prior to age five) and instructed L2 learners. We compared the results of participants over the age of 18 who had completed all four of the LLAMA tests’ sub-components (n = 211). The results are given in Table 3.

Table 3

Results of LLAMA tests according to L2 status.


L2-er (n = 142) M 53.24 30.85 63.31 45.25
s.d. 24.234 19.902 28.434 27.310
Monolingual (n = 46) M 39.57 25.65 65.11 31.20
s.d. 20.759 17.720 28.800 20.033
Bilingual (n = 23) M 42.39 32.83 66.52 38.260
s.d. 22.303 14.834 30.243 25.876

The results of a one-way between groups ANOVA for these L2 status groups show that there were significant differences on two of the LLAMA sub-components: LLAMA_B F (2, 208) = 7.032, p = 0.001 and LLAMA_F F (2, 208) = 5.366, p = 0.005 but not for LLAMA_D F (2, 208) = 1.604, p = 0.204 or LLAMA_E F (2, 208) = 0.164, p = 0.849. Post-hoc Games-Howell (unequal variances) tests showed that for LLAMA_B the L2-er group (M = 53.24, s.d. = 24.234) significantly outperformed (p = 0.001) the monolingual group (M = 31.20, s.d. = 20.759). There were no significant differences between the bilingual group (M = 42.39, s.d. = 22.303) and either of the other groups. For LLAMA_F the situation is the same as the L2 group (M = 45.25, s.d. = 27.310) significantly outperformed (p = 0.001) the monolingual group (M = 31.20, s.d. = 20.033). Again there were no significant differences between the bilingual group (M = 38.260, s.d. = 25.876) and either of the other two groups.

Earlier, following Sparks et al. (1995), we hypothesised that the instructed L2 group would outperform the other two groups on explicit measures LLAMA_B, LLAMA_E and LLAMA_F, as they would have developed strategies for learning vocabulary and grammar/pattern recognition. This hypothesis was partially confirmed. There were overall effects of group on both LLAMA_B and LLAMA_F, but significant differences were only found between the instructed L2 group and the monolinguals – not the bilinguals. The instructed L2 group did outperform the bilinguals in both LLAMA_B and LLAMA_F, but this did not reach significance.

Our second hypothesis suggested that if bilinguals have a cognitive advantage, then we would expect them to outperform the monolingual group. This hypothesis was not confirmed statistically; there were no significant differences between the bilinguals and the monolinguals. However, the bilinguals did perform better than the monolinguals.

Granena (2013) found that LLAMA_B, E and F all weighted on the same component and suggested that these measured more explicit aspects of language-learning aptitude. In this respect, it is perhaps not surprising the instructed L2 learners perform best on these measures, as learning vocabulary and grammar rules are core elements of much L2 classroom instruction. To this extent, the idea of a training effect in aptitude testing (Grigornko et al., 2000; Kormos, 2013) is perhaps not a surprise. However, whether this suggests that aptitude itself is trainable or whether it is test performance that is affected remains an open question and one that would be difficult to empirically address. Nayak, Hansen, Krueger and McLaughlin (1990) suggest that multilingual learners are more adept at using strategies in taking the tests rather than being more successful overall, and this may be the case with our participants as well.15

This question of training, however, does lead to certain methodological consequences. It appears that irrespective of whether you regard aptitude as stable or trainable, the LLAMA tests seem to be influenced by prior experience or training (instruction). This leads us to suggest the caveat that when using the LLAMA tests (particularly B and F), researchers should be aware of the language-learning background of their participants. By this we mean that in situations with a mix of participants with no prior L2 instruction experience (L2-ers) and those who have had instruction (L3-ers), then we would anticipate that the learners with prior instruction would outperform the others and therefore their results cannot be taken as a whole or compared to each other as a single measure, particularly in high stakes situations.

5.3. Research question 3: age

The third research question considered the effect of age on LLAMA scores. We used two of the LLAMA sub-tests (vocabulary and sound recognition) with three different age groups: Group 1 aged 10–11, Group 2 aged 20–21 and Group 3 aged 30–70. We also matched for gender. The results are given in Table 4.

Table 4

Results of LLAMA tests according to age.


Group 1: 10–11 (n = 30) M 28.67 18.50
s.d. 14.920 13.528
Group 2: 20–21 (n = 44) M 45.68 29.32
s.d. 21.529 17.206
Group 3: 30–70 (n = 30) M 44.33 24.50
s.d. 24.380 17.536

The results of a one-way between groups ANOVA for these age groups show that there were significant differences on both of the LLAMA sub-components tested: LLAMA_B F (2, 101) = 6.741, p = 0.002 and LLAMA_D F (2, 101) = 3.919, p = 0.023. Post-hoc Games-Howell (unequal variances) tests showed that for LLAMA_B Group 1 (aged 10–11, M = 28.67, s.d. = 14.920) performed worse than both Group 2 (aged 20–21, M = 45.68, s.d. = 21.529, p = 0.000) and Group 3 (aged 30–70, M = 44.33, s.d. = 24.380, p = 0.012) There were no significant differences between Group 2 and Group 3 for LLAMA_B. For LLAMA_D again Group 1 (aged 10–11, M = 18.50, s.d. = 13.528) performed significantly worse than Group 2 (aged 20–21, M = 29.32, s.d. = 17.206, p = 0.010). There were no significant differences between Group 3 (aged 30–70, M = 24.50, s.d. = 17.536) and either of the other groups.

Our first hypothesis was that we would not see any differences between the age groups for LLAMA_B because vocabulary is a skill thought to be relatively independent of critical or sensitive period effects (Milton, 2009). This hypothesis was disconfirmed; the younger groups performed significantly worse than the two older groups. Our alternate hypothesis for LLAMA_B was that the older participants would outperform the younger ones (Miralpeix, 2006, 2009) due to their superior cognitive abilities, and this hypothesis was confirmed. Our third hypothesis was that the younger group (10–11-year-olds) would outperform the older groups on LLAMA_D because this taps into implicit learning processes (Granena, 2013, 2016; Skehan, 2016), which may be subject to critical period effects. The results disconfirmed this hypothesis as well; the younger learners (10–11-year-olds) performed significantly worse than the 20–21-year-olds.

Overall the younger learners scored lower on both tests. We therefore advise caution when using the LLAMA tests with children. Separate norms may be required for younger age groups. Alternatively, we may have to conclude that the current LLAMA tests are not suitable for use with younger learners. This would be particularly relevant for researchers investigating the role of aptitude in different age groups, as purported differences between younger versus older learners in the relevance of aptitude to their learning situation may be artefacts of the test rather than any comment on aptitude itself (cf. Abrahamsson & Hyltenstam, 2008). Further investigation with larger groups across the whole LLAMA test battery would be required to fully address this. It should be noted that while the 10–11-year-olds did not report any problems in actually taking these two tests,16 these tests are based on the original MLAT tests (Carroll & Sapon, 1959), and alternate versions of the MLAT for younger learners have since been specifically developed.

5.4. Research question 4: individual differences

For this final research question, we combined the results from this study with Rogers et al. (2016) to examine how much of the variance in LLAMA test scores can be accounted for by six individual background variables. These variables are the three we have examined so far – L1 script (language neutrality), L2 status and age plus three other variables – highest formal education qualification, gender and logic puzzles. In total we tested 404 participants, although the 10–11-year-olds did not take the whole test battery. We included these additional variables to examine whether the tests were influenced by formal education or by logic training (e.g. playing chess or sudoku) because previous research into aptitude has suggested links between IQ and MLAT scores (Sasaki, 1999; Wesche, 1981).17

The multiple regression results from all 404 participants show that for LLAMA_B (vocabulary), these six variables accounted for 9.1% of the variance but only L2 status (i.e. whether the participant was bilingual, monolingual or had received L2 instruction) reached significance (β = –0.250, p < 0.05) and contributed 6.0% to the overall variance.

In total, 375 participants took the LLAMA_D test of implicit learning. The multiple regression results show that together these six factors accounted for 4.8% of the overall variance. In terms of the individual factors, L2 status and gender both reached significance. L2 status contributed 1.8% to the overall variance (β = 0.136, p = 0.012). Gender contributed 1.3% to the overall variance (β = 0.116, p = 0.030).

LLAMA_E is the measure of sound-symbol correspondence, and 370 participants took this test. The multiple regression shows that the six factors account for 3.4% of the overall variance. Only the playing of logic games reached significance (β = 0.152, p = 0.004) and contributed 2.3% to the overall variance.

Finally, 346 participants took LLAMA_F, the grammatical inferencing measure. Overall, the multiple regression shows that the six factors accounted for 6.6% of the overall variance with two individual factors reaching significance. These were L2 status and L1 script. L2 status contributed 2.6% to the overall variance (β = –0.165, p = 0.002) and L1 script accounted for 1.3% of the total variance (β = 0.114, p = 0.036).

Overall, the results of the multiple regression analysis suggest that the LLAMA tests can generally be used across different L1s, with male and female participants of differing education levels and with different ages, as these do not consistently affect the overall variance in LLAMA scores. The only consistent finding is that prior instruction in a second language can account for significant amounts of variance in LLAMA_B (6%) and LLAMA_F (2.6%). This suggests that the LLAMA tests are robust and not subject to significant external factors or individual variables that would influence their results although we make no claims regarding how well they measure aptitude (however defined).

6. Conclusions and Future Research

Overall using a large sample, we have shown that the LLAMA aptitude tests are robust as they are not subject to external individual differences. Our results confirm previous studies by Granena (2013) and Rogers et al. (2016). Additionally, we have identified two possible limitations of the tests in their use with younger children and in mixed L2/L3 groups. This study represents a significant step in the ongoing validation of the LLAMA tests, as we have recruited a large number of participants and provided a thorough examination of the tests in terms of targeted individual differences that could affect test performance.18 However, the LLAMA tests still need to be validated in terms of their ability to predict language learning. Skehan (2016) highlights the changing role in aptitude validation work from the macro (large-scale) predictive studies to the more micro studies looking at language-learning processes. Within this latter framework, Skehan (2016) links aptitude to stages of acquisition and Wen (2016) considers whether working memory is the key component in language-learning aptitude. Our scrutiny of the LLAMA tests establishes a strong platform to conduct a large-scale macro validation study for the LLAMA tests to put them on a level playing field with the other tests (e.g. MLAT) and to provide crucial norming data. But as our knowledge of the interaction between different components of aptitude grows then we will also need to consider how the LLAMA tests interact with other areas of intelligence and memory (Sasaki, 1999; Wen et al., 2017).

As one of the only free aptitude tests available to researchers, the attractiveness of the LLAMA tests is ongoing and increasing based on the number of citations on Google scholar (over 700 since 2013). As the LLAMA tests are currently being developed for cross-platform online access (unlike the current Windows-only downloadable versions) with LLAMA_B already available online, we expect this interest and use of the LLAMA tests to continue. It is within this context that we hope this study provides researchers using the LLAMA tests with some useful background and helpful caveats to the use of the LLAMA tests.