Language neutrality of the LLAMA test explored: The case of agglutinative languages and multiple writing systems

The ability to learn a foreign language, language aptitude, is known to differ between individuals. To better understand second-language learning, language aptitude tests, tapping into the different components of second-language learning aptitude, are widely used. For valid conclusions on comparisons of learners with different language backgrounds, it is crucial that such tests be language neutral. Several studies have investigated the language neutrality of the freely available LLAMA tests (Granena, 2013; Rogers et al., 2016, 2017). So far, comparing a number of L1 backgrounds, including those using different writing systems such as Arabic and Mandarin, no significant differences between participants have been found. However, until now, neither participants with agglutinative language backgrounds nor with first-language backgrounds that use multiple writing systems have been included. Therefore, this study selected participants from three different first-language backgrounds: Dutch (non-agglutinative, phonogram/Latin alphabet), Hungarian (agglutinative, phonogram/Latin alphabet), and Japanese (agglutinative, phonogram/syllabic alphabet and logogram/Japanese kanji). The participants performed three subsets of the LLAMA test. Significant differences between the groups were found on two of these tests: The ability to implicitly recognize sounds (LLAMA_D subtest) and inductive grammar learning ability (LLAMA_F), but no differences were found on vocabulary learning ability (LLAMA_B). Additionally, for


INTRODUCTION
Language aptitude has been regarded as a specific talent for learning foreign languages (Carroll, 1981;Skehan, 2002) and is known to vary considerably among individuals (Dörnyei & Skehan, 2003). Language aptitude is generally seen as consisting of several cognitive components, distinct from verbal intelligence and motivation (e.g., Carroll, 1973Carroll, , 1981. Throughout the years, a number of language aptitude tests have been developed, such as the Modern Language Aptitude Test (MLAT; Carroll & Sapon, 1959), Pimsleur Aptitude Battery (Pimsleur, 1966), Defense Language Aptitude Battery (DLAB; Petersen & Al-Haik, 1976), Test of Cognitive Ability for Novelty in Acquisition of Language -Foreign (CANAL-FT; Grigorenko et al., 2000), Swansea Language Aptitude Tests (LAT; Meara et al., 2002), LLAMA (Meara, 2005) and High-Level Language Aptitude Battery (HiLAB; Linck et al., 2013). The different aptitude tests also reflect the somewhat different views on the construct itself (for an overview, see Li, 2016). In recent years, the LLAMA test has been widely used to measure language aptitude of second-language (L2) learners. A number of studies have investigated the validity of the LLAMA test (Granena, 2013;Rogers et al., 2016). These studies have concluded the test to be valid with respect to participants' language background. However, to date, agglutinative languages have not been taken into account as a first language (L1) nor have researchers investigated any language that makes use of a different writing system. Therefore, the present study addresses these language characteristics to explore the validity of the LLAMA test with respect to L1 background by comparing Hungarian (agglutinative language with phonogram writing system), Japanese (agglutinative language with phonogram and logogram writing systems) and Dutch (non-agglutinative language with phonogram writing system).

LANGUAGE APTITUDE IN THE LLAMA TEST
Language aptitude is seen as one of the characteristics that may explain individual differences in success in learning an L2 (Grigorenko et al., 2000). Language aptitude has been theorized to consist of a number of cognitive abilities (e.g., Carroll, 1973;Dörnyei, 2014). Phonemic coding ability, grammatical sensitivity, inductive language learning ability and associative memory are seen as key cognitive abilities for learning an L2 (Carroll, 1973), and these components are included in the MLAT developed by Carroll and Sapon (1959). Based on the MLAT, Meara et al. (2002) published the LAT tests that consist of five subtests, measuring the following elements: Phonetic memory skills, capacity of vocabulary learning, grammatical inferencing ability, memory ability for sequences of unknown sounds, and ability for sound symbol association. The LAT test was initially developed for L1 speakers of English, and it may be less suited for L1 speakers who use other orthographic systems than the Latin alphabet, such as Greek or Japanese. Moreover, one of the subtests of the LAT test is designed based on existing languages, which means the test cannot be used to measure aptitude of L1 speakers of these languages.
Based on the LAT test, Meara (2005) developed the LLAMA Language Aptitude Test, which is freely available (http://www.lognostics.co.uk/tools/llama/). The LLAMA test battery consists of four sub-components: The LLAMA_B (vocabulary learning ability), LLAMA_D (sound recognition), LLAMA_E (sound-symbol association) and LLAMA_F (grammatical inferencing). Recently (after data collection for this study had been completed), a new version of the LLAMA tests was published (Meara & Rogers, 2020).

PREVIOUS STUDIES
Language independence or language neutrality is a crucial characteristic for aptitude measures, meaning that the scores on the language aptitude test should not be affected by the writing systems, phonetic features, or grammar characteristics of the L1 of the test users. Especially if the LLAMA test is to be used in studies that measure language learning aptitude of L2 learners with diverse L1 language backgrounds, language independence is crucial. Likewise, when comparing across studies with diverse L1 backgrounds (e.g., when comparing Artieda & Munoz, 2016;Kepinska et al., 2017;Saito, 2017;Yalçin et al., 2016;Yilmaz & Granena, 2016), language independence of the test for aptitude is important. For this reason, several studies have investigated the validity and the reliability of the LLAMA test itself. These studies have focused on internal validity of the tests (Bokander & Bylund, 2020) and on a number of individual variables of the participants, such as gender, age of onset (Granena & Long, 2013), multilingualism (Rogers et al., 2017), prior L2 instruction (Rogers et al., 2017) and L1 language background (Granena, 2013;Rogers et al., 2016Rogers et al., , 2017. Here we briefly review the studies focusing on L1 language background. Granena (2013) investigated the LLAMA test with 186 participants (79 males, 107 female) who were under the age of 40 (M = 25, SD = 5.29, range = 18-39) with different language backgrounds: English, Spanish, and Chinese. The results showed that the scores did not differ depending on the L1. Furthermore, the results of her study drew the conclusion that the LLAMA test measured two different dimensions of language aptitude; the LLAMA_B (vocabulary learning capacity), LLAMA_E (sound-symbol association) and LLAMA_F (grammatical inferencing) invoke explicit learning ability, whilst the LLAMA_D (sound recognition) is more related to implicit learning ability. Moreover, in their study of 350 participants with diverse L1s (L1groups with more than 20 participants were Afrikaans, English, German, and Wolof), Bokander and Bylund (2020) carried out a principal component analysis that also identified the subtest LLAMA_D (sound recognition) as the odd one out. This finding supports the conclusion that LLAMA_D, unlike the other tasks, taps into implicit processing. Rogers et al. (2016) examined 229 participants ranging in age between 10 to 75 with diverse L1 backgrounds and educational levels. Because they wanted to compare participants who had learned different types of scripts in their L1, participants were divided into three groups: English Latin script (n = 99), non-English Latin script (n = 18), and non-Latin script (mostly Chinese; n = 17). 1 They investigated whether differences in L1 writing systems impacted the scores of language aptitude. The results did not reveal any specific effects for the L1, but they did find an effect for educational level in three components: LLAMA_B (vocabulary learning capacity), LLAMA_E (sound-symbol association) and LLAMA_F (grammatical inferencing).
Consequently, Rogers et al. (2017) carried out an even broader validation study of the LLAMA test. They examined the role of the L1 writing system and the language neutrality in general of the LLAMA test. They chose L1 speakers of English (n = 107), Chinese (n = 56) and Arabic (n = 32) as participants. The L1 writing systems of the participants therefore consisted of the English Latin alphabet, Chinese morphosyllabic, Chinese logographic system (Tolchinsky et al., 2012), and the Arabic consonantal alphabet. The results showed that there were significant differences among these three groups in the LLAMA_B, LLAMA_E and LLAMA_F, but not in the LLAMA_D. It turned out, however, that these groups were not comparable because the number of monolinguals was different across groups. Therefore, they analysed a new dataset -L1 speakers of English (n = 48), Chinese (n = 56), and Arabic (n=30), excluding all monolingual participants. For these new groups, there were no significant differences. They concluded that with respect to these writing systems and language backgrounds, the LLAMA test is language neutral. Rogers et al. (2017) also followed up on research and hypotheses by Sparks et al. (1995), among others, that showed that language learning experience may have a positive influence on language learning aptitude scores. Among the variables investigated, they found that prior L2 instruction was the strongest predictor of aptitude, especially in LLAMA_B (vocabulary learning ability) and LLAMA_F (grammatical inferencing).
These studies have examined diverse L2 learners varying in L1 typology and different writing systems in order to investigate the validity of the LLAMA test, yet none of them has taken agglutinative languages into account as L1s, nor have they investigated languages with multiple writing systems. The current study therefore includes three languages as L1 background (Hungarian, Japanese and Dutch), which are distinctive in terms of orthographic systems and typology.

SYNTACTIC CHARACTERISTICS OF HUNGARIAN, JAPANESE, DUTCH AND THE ARTIFICIAL LANGUAGE IN LLAMA_F (GRAMMATICAL INFERENCING)
Hungarian is member of the Finno-Ugric language family, whilst most of European languages are categorized as Indo-European languages (Kiss, 2002). One of the distinctive aspects of the Finno-Ugric languages is its agglutinative characteristics. In agglutinative languages, words contain multiple morphemes that mostly remain unchanged. It is also noteworthy that Hungarian has Subject-Verb-Object (SVO) as its main structure, followed by SOV (Kas et al., 2016). Hungarian, however, is often said to be a free word order language due to the variety of structures: SVO, SOV, OVS, OSV, VSO and VOS (Kiss, 2002).
Like Hungarian, Japanese is categorized as an agglutinative language. It is said to be part of the Altaic language family, but the exact origin of Japanese has not been determined (Maeda, 2007). The basic structure of Japanese is SOV, but the word order is usually flexible. With respect to verb position, the verb is placed at the end of a sentence. Japanese verbs have no personal conjugation, but they conjugate accompanying information such as tense. The Japanese sentence (1) is an example of a sentence without a subject and with the conjugation of the verb, showing the agglutinative nature of the language: 2 (1) 答えさせられたくなかったら Kotae-sase-rare-tak-na-katta-ra Answer CAUS PASS DES NEG PST COND 2 'If (you) don't want to be made to answer' (Hasegawa, 2018, p. 4) Dutch belongs to the West-Germanic group of Indo-European languages. Dutch is closely related to English and German and has two syntactic structures, SVO and SOV (Bennis & Israel, 2010). SVO mostly appears in the main clause, whilst SOV is found in subordinate clauses. Instead of grammatical cases, relations between words are indicated by using pronouns mostly (van der Sijs, 2005), and unlike Hungarian and Japanese, Dutch is not agglutinative in nature.
Upon analysing the sentences in the LLAMA_F (grammatical inferencing) test, it was found that the syntax of this artificial language has a rich morphology, using inflection with adpositions. For instance, the sentence 'inut-ek ipot-arap' is paired with a picture of 'two red rounded objects on a board'. The sentence would be broken down as follows: Inut-ek ipot-arap inut: on a board or above something -ek: two objects ipot: red colour arap: rounded form A morphological element '-ek' follows the adposition 'inut', which can be interpreted that the adposition has a characteristic of inflection. This suggests that the artificial language might be similar to an agglutinative language such as Hungarian in which postpositions can be attached to the stem of the word.

ORTHOGRAPHIC CHARACTERISTICS OF HUNGARIAN, JAPANESE, DUTCH, AND THE SYSTEMS USED IN LLAMA_B (VOCABULARY LEARNING)
Hungarian and Dutch are written using the Latin alphabet, which is a phonogram writing system. The two Japanese syllabic alphabets (hiragana and katakana) are also categorized as phonogram systems, whilst Japanese kanji is regarded as a logogram because each character of kanji depicts a specific meaning (Coderre et al., 2008;Tanaka, 2015). The most distinctive aspects of Japanese compared to both other languages in the present study may therefore be the number of writing systems used: Japanese uses two syllabic phonogram systems and one logographic writing system. The Japanese writing system motivated the reinvestigation of the language neutrality of the LLAMA test because of its combination of orthographies. It is noteworthy that using both kana and kanji could affect cognitive processing (Coderre et al., 2008;Sakurai et al., 2000) because syllabic writing systems connect visual forms (graphemes) to sounds (phonemes), whilst logographic writing systems join forms (characters) with meanings (morphemes) and sounds (phonemes) (Coderre et al., 2008). It has also been shown that phonograms and logograms are Mikawa and  processed differently (Tanaka, 2015). Such differential processing may transfer when reading in the L2, such that L2 learners with an alphabetic L1 background and L2 learners with nonalphabetic L1 backgrounds rely on different types of information during lexical processing in the L2 . Moreover, L1 speakers with logographic writing systems tend to rely more on visual information than on phonological information, while on the other hand, L1 readers with alphabetic language backgrounds tend to directly analyse phonological information (Wang & Geva, 2003). Because the writing experience in the L1 affects decoding and learning a new vocabulary in the L2 (Hamada & Koda, 2008), a language learning aptitude test may (unwantedly) be affected by the writing system used. L2 learners whose L1 writing system is closely related to that of the L2 are able to perform better than those who have less similar L1 language backgrounds (de Groot et al., 2002).
In previous validation studies, Chinese was included as a non-alphabetic language (Granena, 2013;Rogers et al., 2016Rogers et al., , 2017. Although the ascendant of Japanese kanji is the Chinese character hanji, it is important here to emphasize that they have developed differently and they should not be classified in the same category (Hashimoto et al., 2017). The Japanese kanji is more graphic than the recent simplified Chinese characters mainly used in the mainland, and the written forms do not always match. Therefore, these writing systems should not be regarded as one type of logogram (Yokoyama, 2016). Furthermore, the Chinese hanji has only one mora and one pronunciation per character, by contrast, each Japanese kanji has one to three morae and it usually has more than two pronunciations. These differences argue for a separate investigation of Japanese as an L1. In addition to this need to study Japanese separately from Chinese, the L1 Japanese speakers may perform the LLAMA_B (vocabulary learning) subtest in a more efficient way because they are accustomed to processing two different writing systems simultaneously. In the LLAMA_B subtest, a picture (graphic) and a word (spelled in the Latin alphabet, thus phonographic) have to be matched and memorized, thus a graphic and phonographic system need to be processed at the same time. Given that the L1 Japanese speakers are used to processing two different writing systems in which one is graphic and the other phonographic, they may be found to score higher on this subtest.

RESEARCH QUESTIONS AND HYPOTHESES
The present study investigates the validity with respect to language neutrality of the LLAMA test, taking into account differences in syntactic typology and in writing systems. It focuses on two LLAMA subtests and uses one LLAMA subtest as a control test. To maximize potential contrasts between the agglutinative and non-agglutinative groups, whilst keeping the groups homogeneous and maximally comparable, the following L1 participant groups were chosen: Agglutinative language groups: -L1 Hungarian speakers who are majoring in Japanese -L1 Japanese speakers who are majoring in Hungarian Non-agglutinative language group -L1 Dutch speakers who have never studied any agglutinative language majoring in a Romance language Regarding the writing systems, Hungarian and Dutch both use the Latin alphabet (a so-called phonogram writing system), whereas Japanese uses two syllabic alphabets and a logographic writing system.
The abovementioned differences between Hungarian, Japanese and Dutch in terms of syntax and writing systems led to the following research questions and hypotheses:

(a)
To what extent do speakers with L1 agglutinative language backgrounds outperform speakers with non-agglutinative L1 backgrounds on the LLAMA_F (grammatical inferencing) test?
Since the LLAMA_F subtest was found to have agglutinative characteristics, we hypothesize that the L1 Japanese and L1 Hungarian speakers will outperform the L1 Dutch speakers.

(b)
To what extent do speakers with an L1 background that consists of multiple writing systems (phonogram and logogram) outperform speakers with L1 backgrounds using a single writing system (only phonogram) on the LLAMA_B (vocabulary learning) test?
Given that the L1 Japanese speakers are used to processing two different writing systems, we hypothesize that they will outperform L1 Dutch participants and likely the L1 Hungarian speakers (who are learning Japanese). In addition, because the L1 Hungarian participants have started to learn a logographic writing system, they may outperform the L1 Dutch group.
Finally, the LLAMA_D subtest was added as a control. Because research has indicated that the LLAMA_D (sound recognition ability) is a measure for implicit learning ability (Granena, 2013), we expect to find no differences between the three language groups for this subtest.
Because earlier research (Rogers et al., 2017) has shown that prior L2 instruction is a significant variable in especially the LLAMA_B and LLAMA_F subtests, we added the number of languages known as a covariate in this study, serving as a proxy for amount of instruction in second (foreign) languages.

PARTICIPANTS
Three groups of university students were recruited. All participants received a small gift as a token of appreciation for their participation. The first group consisted of Hungarian students (studying at Eötvös Loránd University or Károli Gáspár University) majoring in Japan Studies.
The second group consisted of Japanese students (studying at Osaka University) majoring in Hungarian language and culture. The third group consisted of Dutch students (studying at Leiden University) majoring in a Romance language (e.g., French language and culture or Latin American Studies). Recruitment was initiated by e-mails to the relevant language teachers at the four universities. These teachers shared the call for participation in class or on online platforms. Additionally, the first author recruited participants in situ, by asking students just before or after class to participate. In total, 86 participants were recruited. Due to incomplete datasets or because of conditions that were not met, however, seven participants were excluded from the analyses. For instance, one of the Dutch students reported having learned Finnish, an agglutinative language, so this participant's data of this student were not included. Table 1 shows the background information of all included participants in the three groups. All participants were informed of the purpose of the study and signed a consent form. Data were anonymized prior to data storage and analysis.
co.uk/tools/llama/ onto a desktop or laptop and executed. The LLAMA_B was designed to measure the capability to learn and memorize new vocabulary items within a relatively short time. The LLAMA_B consists of a study phase and a test phase. In the study phase, participants have two minutes to study a combination of 20 abstract pictures and corresponding words on the screen. These words are adopted from existing Central American languages (Meara, 2005) and presented with the Latin alphabet. Participants are required to learn the combinations of pictures and corresponding words within the given time frame. In the test phase, participants see one of 20 words on the screen and they are asked to choose the corresponding picture by clicking on the possible picture. Participants receive five points for each correct answer; the total scores of the LLAMA_B can range between 0 to 100 (thus, the score is calculated as a percentage).
The LLAMA_D test was developed to measure the ability to recognize repeated sounds in spoken languages, which is an essential ability for learning a foreign language (Service, 1992;Skehan, 2002;Speciale et al., 2004). The LLAMA_D is designed to make it as language neutral as possible. The produced words are based on names, flowers and other natural objects in a British Columbian Indian language and these sounds were synthesized by using text transformer (AT&T Natural Voices) to make it difficult to recognize the sounds (Meara, 2005). In this test, a set of ten words in an unfamiliar language is played only one time. After that, participants need to click the arrow in the middle of the screen to perform the test. There is no study phase in this test. Compared to other LLAMA subtests, the LLAMA_D barely invokes analytical abilities and could be considered as an implicit learning task (Granena, 2013). In the test phase, participants must indicate whether they have just heard the sound in the string of sounds by clicking on an icon on the screen. The sound recognition test consists of 30 questions and the obtained score is indicated on the screen. The score is converted by subtracting points for incorrect answers, and the highest possible score of the LLAMA_D test is 75.
The LLAMA_F was developed to measure explicit inductive learning ability, in other words, the analytical ability to extract the grammar rules of an unknown language from given information or materials. In this test, participants must discover the grammar rules of an artificial language in the study phase lasting five minutes. They are allowed to take notes. There are 20 buttons on the screen and by clicking one of them a combination of one sentence and one picture is displayed. This artificial language has a number of characteristics that might be difficult for L1 English speakers and it is based on languages such as Welsh (e.g., postpositions '-sa' to show 'one person') (according to Meara, personal communication). In the test phase, two sentences and one picture are displayed at the same time and participants must choose one corresponding sentence. There is no time limit in the test phase. For correct answers five points are added, and as in the LLAMA_D subtest, five points are subtracted from the score for each wrong answer. The total score ranges from 0 to 100.

Questionnaire
Participants were asked to give information about their gender, age and the languages that they have learned (including a self-reported evaluation on their proficiency in these languages).

Informal interview
After participation, a subset of participants was asked to reflect on their performance and potential strategies that they used. We aimed to ask mainly participants with relatively high scores and determined 'relatively high' according to the score interpretation indicated in the manual of the LLAMA test (Meara, 2005). For LLAMA_B and LLAMA_F, the manual says the scores between 75-100 are interpreted as "outstandingly good scores" (Meara, 2005, pp. 6, 17). Based on this interpretation, we have interviewed a selection of participants who were assessed as performing 'outstandingly good' on either LLAMA_B or LLAMA_F. Because of time constraints and practical constraints, for the Hungarian participants the selection was not based on the scores. Since the interview was held in an informal way, no set list of questions was used. However, the questions always concerned the following themes: (1) What are your thoughts on the subtest? (2) How did you work on the subtest? Did you use a strategy? And with respect to LLAMA_F (grammatical inferencing), we also asked a question along the following lines: (3) Did you notice characteristics of the artificial language in the subtest? Specifically, do you think the artificial language is similar to any existing languages?

PROCEDURE
Each participant first read and signed the consent form and filled out the language background questionnaire. The first author then provided instructions about the experiment (orally), using a script based on the LLAMA manual (Meara, 2005). For carrying out the aptitude tasks, we used the same order as Saito (2017): LLAMA_D, then _B, and finally _F. This order ensures that the most implicit task (LLAMA_D, sound recognition) is carried out first, such that potential conscious strategies are less likely to be used. Most participants carried out the experiment in a quiet environment, such as a classroom or computer room. However, this was not possible for a couple of L1 Hungarian and L1 Dutch participants; they carried out the tasks in a common room. However, all participants used headphones during the experiment for LLAMA_D and LLAMA_B (vocabulary learning). To be able to make notes, participants received pen and paper before the start of LLAMA_F (grammatical inferencing). Scores for all the tests were recorded by the software and noted down by the first author. For 16 students, the first author asked them in an informal interview how they had proceeded in carrying out the task. The answers to these questions were written down by the first author. The entire experiment including the interview after the test lasted around 30 minutes.

DATA ANALYSIS
To test whether there were differences between the three groups, we carried out a MANCOVA, with the three scores for the LLAMA tests as dependent variables, language group as the independent factor (with three levels) and number of languages known as covariate. When checking the assumptions for running a MANCOVA, because some participants noted knowing eight languages but none noted seven, we rank-transformed the number of languages known. Three Levene's tests showed that for the dependent variables, homogeneity of variance could be assumed (p's > 0.19), and Box's M test revealed that equality of covariance could be assumed (p = 0.91). Finally, no anomalies were found when analysing the standardized residuals. Whenever a significant difference was indicated for one of the dependent variables, we carried out post-hoc t-tests to check which groups differed from each other. Taking into account the group sizes, the analyses would reach a power of 0.88, if we assumed the effects to be large (f = 0.4). However, if we assume effects to be medium or small (thus given an f of 0.25 or 0.1), power in this study is calculated to reach only 0.48 or 0.11. The quantitative data are available on this webpage: https://osf.io/ve7xa/. The notes made during the informal interviews were summarized, and exemplary findings from these summaries are reported in the next section. Table 2 shows the means and standard deviations of the original scores on the LLAMA tests by the three participant groups. Boxplots of these data can be found in Figure 1. As explained, the covariate number of languages known was rank-transformed prior to analyses. 3 Table 3 shows the estimated marginal means and standard errors of all subtests and groups after correcting for (rank) number of languages known.

EXPLORING DIFFERENCES IN STRATEGIES
Informal interviews were held with a selection of participants, most of whom had scored relatively high on either LLAMA_B or LLAMA_F. In total, four L1 Hungarian, eight L1 Japanese, and four L1 Dutch participants answered informal questions immediately after finishing LLAMA_F. 4 All 16 participants reported that LLAMA_D (sound recognition) had been particularly demanding but did not mention any conscious strategies they used. Regarding LLAMA_B (vocabulary learning), the most used (self-reported) strategy was to mentally connect the "words" they read to similar sounding words from their L1 (or an L2). For instance, L1 Hungarian speakers connected "CIB" to the existing commercial bank CIB. Some L1 Hungarian speakers who were learning Japanese as an L2 specifically mentioned that they tried to use similar strategies they had been using for learning Japanese kanji. L1 Japanese speakers memorized "KABAN" (bag in Japanese) with the accompanying picture of a log by remembering "a wooden bag". Similarly, the L1 Dutch speakers noted that they used their L2 English to associate "MEN" with the picture of a little man. Furthermore, an L1 Dutch speaker mentioned resemblance of the words used in the LLAMA_B to a Mayan language. The participant thus made use of personal knowledge of a Mayan language to complete this subtest. 5 4 As mentioned in the Method section, because of time constraints and practical constraints, we were not able to only select high-scoring participants for the Hungarian group. One of the Hungarian participants did have a 'relatively high' score (90) on both LLAMA_B and LLAMA_F. The other three scored at least 60 on either the LLAMA_B or LLAMA_F subtest.

5
Excluding this participant from the inferential statistical analyses did not lead to any changes in conclusions drawn from these analyses.

Figure 1
Boxplots of scores on the LLAMA_D, LLAMA_D, and LLAMA_F subtests by the three L1 background groups (Boxes include 25%-75% of the data. Horizontal lines mark medians. Whiskers are +/-1.5* interquartile range. The dots are outliers below or above the interquartile range). Finally, regarding LLAMA_F (grammatical inferencing), the L1 Japanese speakers noted that the artificial language resembled Hungarian, especially in the way that the artificial language used affixes. L1 Hungarian speakers noted that the artificial language resembled an agglutinative language. In contrast, the L1 Dutch speakers mentioned no resemblances of the artificial language to a language that they knew.

DISCUSSION AND CONCLUSION
Several studies have investigated the extent to which the LLAMA language aptitude tests can be considered language neutral (Granena, 2013;Rogers et al., 2016Rogers et al., , 2017. So far, comparing a number of L1 backgrounds, including those using different writing systems such as Arabic and Mandarin, no significant differences between participants with different language backgrounds have been found. However, until now, neither agglutinative language backgrounds nor L1 backgrounds in which multiple writing systems are used have been taken into account. Therefore, this study selected participants from three different L1 backgrounds: Dutch (nonagglutinative, phonogram/Latin alphabet), Hungarian (agglutinative, phonogram/Latin alphabet), and Japanese (agglutinative, phonogram/syllabic alphabet and logogram/Japanese kanji). The three groups were otherwise comparable: All participants were university students majoring in language and culture. All Japanese students studied Hungarian, all Hungarian students studied Japanese, and all Dutch students studied a Romance language (and had never learnt any agglutinative language).
Three subtests of the LLAMA aptitude battery were administered: LLAMA_D (an implicit learning test gauging the ability to recognize new sounds), LLAMA_B (a vocabulary learning test that uses icons combined with a Latin script) and LLAMA_F (a grammar inferencing test that includes agglutinative characteristics). Due to the agglutinative characteristics in LLAMA_F, the combination of icons and Latin script in LLAMA_B and the implicitness of LLAMA_D, we hypothesized (a) that Japanese and Hungarian participants would outperform the Dutch participants in LLAMA_F, (b) that Japanese speakers would outperform Dutch speakers and would possible outperform Hungarian in LLAMA_B and (c) that no differences would be found for LLAMA_D.
For the first hypothesis concerning LLAMA_F (grammatical inferencing), results showed that the Japanese participants outperformed both groups. Although we expected the Japanese participants to outperform Dutch participants, because they have experience with two agglutinative languages, we cannot explain why the Japanese participants outperformed the Hungarian participants. The L1 Hungarian speakers likewise have an agglutinative language as L1 background and are majoring in language and culture of another agglutinative language -Japanese. The informal interview did reveal that both Japanese and Hungarian participants noted that the grammar was similar to their own language or to the language they were learning.
Regarding the second hypothesis, results showed that there were no differences between the participant groups with respect to the scores on LLAMA_B (vocabulary learning). This finding confirms the conclusions from the studies by Granena (2013) and Rogers et al. (2016Rogers et al. ( , 2017 that LLAMA_B can be seen as a language-neutral aptitude test. The conclusion can now be broadened to include a language such as Japanese, which uses multiple writing systems. Finally, the hypothesis regarding LLAMA_D (sound recognition) was also not borne out. Contrary to our expectations that this part of the LLAMA-test would be language neutral, the Japanese outperformed both Dutch and Hungarian. No further differences between the groups were found. This test has been argued to be a test of implicit learning and therefore we hypothesized that no differences between the three groups would occur. However, considering the task of recognizing novel arbitrary sounds, it may be the case that L1 Japanese speakers were more apt at listening for and recognizing new sounds because Japanese is known as a language with a very rich inventory of sound-symbolic or mimetic words (Hamano, 1998;Kita, 1997). For instance, Japanese uses not only onomatopoeia, but also mimetic words for emotions or states, such as being excited ('wakuwaku'), in love ('kyunkyun'), fidgety ('sowasowa') or edgy ('iraira'). They are expressed by sounds that are repeated and their sound-to-meaning correspondence is not (perceived as) arbitrary. Hirata (2013) says that Japanese speakers will typically attach an impression to a sound. Because of the large inventory of mimetic words and onomatopoeic words that L1 Japanese speakers have, a new sound is likely to induce a specific impression or semantics. From this, we may speculate that participants with Japanese as an L1 background are more attuned to recognizing new sounds, as they have possibly already related the sound to (a kind of) meaning when hearing it the first time. There is no confirmation of this hypothesis in the comments made by the participants in the informal interview. However, such a lack of comments is to be expected if the LLAMA_D does indeed mainly tap into implicit learning. An alternative explanation for our results may be that the L1 Japanese group scored higher due to an unknown general superiority in language aptitude abilities or another covariate not included in our questionnaire, leading them to have a higher aptitude in general. It is worth noting, however, that no differences were found with respect to the LLAMA_B subtest, which indicates that this unknown general confound would therefore not hold for all subtests, or at least not to the same degree.
To account for possible differences in the L1 groups with respect to amount of L2 instruction received, we added the number of languages known, as reported by our participants, as a covariate. This covariate was significant in the analyses of the LLAMA_B (vocabulary learning) and approached significance in the analyses of LLAMA_F (grammatical inferencing), which confirms conclusions from earlier research (Rogers et al., 2017), who explicitly investigated whether receiving L2 instruction would lead to advantages on the LLAMA test.
As mentioned in the Method section, the power calculation showed that the current study was suitably sensitive to large differences between groups (L1 backgrounds only). This may be specifically troublesome in the case where research is attempting to 'prove' a null hypothesis, as is the case in the current study and in similar studies comparing participant scores with different language backgrounds. Such studies would conclude a test to be valid if there were no significant differences between the groups. Another potential limitation of the study might be the fact that gender was not balanced in our study: In all three groups, female participants outnumbered male participants. This means that the results of this study can only be generalized to a similar population.
We conclude that the results of the current study have implications for the use of the LLAMA tests, especially when participant scores are compared across different language backgrounds. First of all, the amount of instruction and the number of languages that the participants know need to be taken into account when comparing participants. The more instruction or languages known, the higher participants will score on LLAMA_B (this study and Rogers et al., 2017) and on LLAMA_F (Rogers et al, 2017; marginal effect in this study). Secondly, we found no differences between the different L1 backgrounds regarding the scores on the LLAMA_B (vocabulary learning) subtest, which suggests that LLAMA_B is indeed language neutral. From the differences we found on the scores of the LLAMA_D (sound recognition) and LLAMA_F (grammatical inferencing) subtests between the groups, no firm conclusions can be drawn yet. With respect to the LLAMA_D subtest, we speculated that for participants who have Japanese as an L1, a language that uses many non-arbitrary sound-to-meaning mappings, recognizing new sounds in a new language may be easier. This speculation needs to be investigated in future studies. If the finding were to hold, it would mean that participants who were more attuned to listening to new sounds because of their L1 background would indeed perform better in the LLAMA_D subtest. Likewise, the differences we found for the LLAMA_F subtest also warrant further research. If it is the case that knowing agglutinative languages gives learners an advantage for learning another language with agglutinative characteristics, such as the artificial language used in LLAMA_F, one could argue that the LLAMA_F subtest is not entirely language neutral. However, since we found that the L1 Japanese participants outperformed not only L1 Dutch participants (non-agglutinative) but also the L1 Hungarian speakers (agglutinative language), no firm conclusion can be drawn. We can only conclude that L1 Japanese speakers may be at an advantage for both the LLAMA_F and LLAMA_D subtests. 6 6 In April 2020, a new version of the LLAMA test was published (Meara & Rogers, 2020). The fact that this version is even more easily accessible, and still free to use, will no doubt inspire new research using the LLAMA test, including research investigating the language neutrality of this new version. As one of the reviewers pointed out, a number of issues that have been raised with the version of the LLAMA tests used in the current study have been corrected in this latest version of the online LLAMA tests (e.g., an issue with incorrect coding in some items of the LLAMA_F subtest, see also Bokander & Bylund, 2020); how LLAMA_ D and LLAMA_F are scored, correcting for wrong answers yet being a binary choice compared to LLAMA_B, which has a much lower guessing possibility).