Teaching to the test: The effects of coaching on English-proficiency scores

Despite arriving with the required language qualifications, many international students struggle with the linguistic demands of a university degree. Using the International English Language Testing System (IELTS) as an example, this study explored how short but intensive preparation programmes may affect high-stakes English-proficiency test scores with which students apply for university places. The participants were 89 Chinese speakers of English as a foreign language in Shanghai. They were tested twice, four weeks apart, on IELTS and three other measures of English ability: The Oxford Online Placement Test, a vocabulary test, and the speed and accuracy of sentence comprehension. Between the two testing points, 45 participants underwent testspecific training consisting of previous IELTS papers, offered by a large test-preparation establishment with a network of over 1,000 training centres. The remaining 44 participants did not engage in any test preparation at the time. Teaching to the test led to a half a band rise in IELTS scores above the gain from test repetition alone, suggesting that the training was effective. Importantly, the IELTS gain did not generalise to the other measures of English ability; the groups performed similarly on all other language tests at both times. This suggests that test-specific, curriculum-narrowing courses could be inflating the scores with which international students apply for university places, with important consequences for test-developers, universities and students. DANIJELA TRENKIC RUOLIN HU


INTRODUCTION
As millions of international students prepare to demonstrate their readiness to pursue university education in English as a condition for enrolment, a powerful language test-preparation industry has emerged to meet the rising demand. In a bid to help their clients achieve the required scores on high-stakes exams in a time-efficient way, many language schools and training centres offer dedicated test-preparation programmes.
But do such programmes work? Do they reliably improve scores, and are the scores improved in this way trustworthy? In the context where many international students struggle with the linguistic demands of their programmes (Murray, 2010) and where as a group they experience lower academic success than home students (Morrison et al., 2005), the question of whether, and to what extent, the language test-preparation industry may be subverting the validity of scores is an important one to address. Here we explore this question in the context of China, a country which currently sends the largest number of students abroad and has a wellestablished test-preparation industry.

LANGUAGE PROFICIENCY AS A CRITERION FOR UNIVERSITY ENTRY
In countries where admission to higher education is academically selective, universities set criteria to choose students with the capacity to learn quickly and perform well in their studies. Scholastic aptitude is typically the chief criterion, usually indexed by candidates' previous academic success, and in some cases by scholastic aptitude tests. However, because the capacity to acquire new knowledge also depends on proficiency in the language of instruction (Elder et al., 2007;Daller & Phelan, 2013;Trenkic & Warmington, 2019), most universities require international students to demonstrate their linguistic readiness to study on one of the approved language-proficiency tests.
At universities teaching in English, one of the most widely accepted tests is the academic version of the International English Language Testing System (IELTS Academic, henceforth IELTS), which is owned and administered by the Cambridge Assessment English, the British Council and the International Development Program Australia Consortium (henceforth, the Consortium). It is a test constructed, trialled and validated for university admissions purposes, and it operationalises English proficiency as the ability to communicate through listening, speaking, reading and writing in academic contexts. It comprises four sections testing each of these skills, with the results reported on a 9-band scale (including half-point scores) for each skill and as the overall test score (the average of four skills scores).
The test does not have a pass/fail boundary. Instead, in its guidance to educational institutions, the Consortium lists scores between 7.5 and 9.0 as acceptable for studying linguistically demanding academic courses, and scores of 7.0 and above as sufficient for linguistically less demanding courses (IELTS, 2019). Each university, however, sets its own minimum requirements, with the decisions usually guided by pragmatism and compromise (Deygers & Malone, 2019). The requirements thus differ by institution and by programme, with scores between 6.0 and 7.0 (corresponding to the range from the mid B2 to low C1 level of the Common European Framework of Reference) typically accepted for unconditional entry (Feast, 2002;Green, 2007).

THE EFFECTS OF TEST PREPARATION ON THE VALIDITY OF SCORES
When the question of validity in high-stakes language assessment is raised, the concern is typically with score validity: Whether trustworthy inferences about a candidate's language proficiency can be drawn from the behaviours elicited by the test and from the resulting test scores. In relation to test preparation, the question is whether particular activities could reliably lead to higher scores, and whether in the process, they could be undermining the validity of scores.
Theoretically, test preparation can improve test scores via three main mechanisms, each with different consequences for score validity. First and foremost, on tests that appropriately measure the intended construct, scores should improve along with the underlying skill. Thus, test-preparation activities that focus on the development of language proficiency should improve scores on language-proficiency tests with no consequence, positive or negative, for their validity. The limitation of this approach as a mechanism for improving test scores, however, is that it is not quick. Language proficiency is an ability that develops over a long time-typically over many years of education and practice-and therefore significant progress is difficult to achieve through short programmes of intervention.
The second mechanism that leads to higher scores involves countering some of the common threats to score validity. For example, familiarising students with the format of the test can both reduce evaluative anxiety and ensure that the manner in which the skill is tested does not take them by surprise. Although such activities do not change the underlying ability being tested, they do make it easier for candidates to demonstrate it on the test, and in so doing, stand to improve the validity of the score (Messick, 1982).
The third mechanism, in direct contrast to the last, leads to higher scores by actively exploiting threats to validity. For example, training students in test-taking strategies and answerselection tricks seeks to achieve high scores by taking advantage of the construct-irrelevant properties of a test (Messick, 1982). Furthermore, narrowing the curriculum to only focus on practising previous tests and parallel items (for example, teaching writing in only one genre that is assessed on the test) may result in higher scores through exploiting both the constructirrelevant properties of a test and the test's construct under-representation (Haladyna & Downing, 2004).
Unlike activities that target language-proficiency development, both those that minimise and those that exploit common threats to validity have shown promise in quick score gains. For example, acquiring even a basic familiarity with the format through sitting a test once seems to statistically improve the chances of getting a higher score during the following attempt. Zhang (2008) analysed 2,000 candidates who took the internet-based Test of English as a Foreign Language (TOEFL iBT, henceforth TOEFL), a test which, like IELTS, measures academic English proficiency in listening, reading, speaking and writing. The study showed small but reliable gains in scores when the test was repeated within a single month.
Curriculum-narrowing test preparation is also shown to lead to score gains over a period of one to two months. For example, focusing on the relationship between test preparation and test performance of 14,593 students taking the TOEFL in China, Liu (2014) found that after controlling for participants' English proficiency on an independent test prior to the study, the strongest predictor of TOEFL scores was preparation activities that involved intense and shortterm practice using similar test formats (TOEFL practice tests and previously used authentic test items). Similar findings were reported by Xie (2013) in a study that followed a group of 1,003 undergraduate students as they prepared for the national College English Test Band 4 in China. This study recorded participants' test scores, both before and after a two-month preparation period and collected detailed information about their preparation activities. As in Liu (2014), the effects of test preparation on test performance were significant but small, explaining 2.4% of the variance in the final scores after controlling for the proficiency with which the participants started the preparation. Crucially, the only test-preparation practices that made a significant positive contribution to the final scores were those that were of a curriculum-narrowing nature: The intensive practice and memorisation of specific test items and of earlier versions of the test. Xie (2013) argued that as activities that enhance the target construct in a very narrow domain, they effectively weaken the extrapolation link from the test scores to untested behaviours.

IELTS PREPARATION AND SCORE VALIDITY
When it comes to the effects of IELTS preparation on its test scores, the previous literature was predominantly focused on understanding how long it takes to observe reliable improvements in scores following different training regimes. Much of this research was conducted in Anglophone countries (United Kingdom, Australia, New Zealand), often involving the English for Academic Purposes (EAP) programmes offered by the receiving universities (e.g., Brown, 1998;Archibald, 2001;Elder & O'Loughlin, 2003;Rao, et al., 2003;Read & Hayes, 2003;Green 2005Green , 2007. Of those, the study that touched most directly on the question of test preparation and score validity was that reported in Green (2005Green ( , 2007, which explored IELTS writing score gains amongst 476 international students across 15 institutions in the UK. The study focused on learners enrolled in courses preparing them for academic study in the United Kingdom over periods of between three and ten weeks. Most of the participants (n = 331) were attending EAP pre-sessional courses provided by universities in the United Kingdom as an enrolment condition for international students with language scores falling short of the requirements for unconditional entry. Although these courses involved some training resembling an IELTS writing test (e.g., timed writing practice), they were not actively preparing students for the test. The rest of the participants were either enrolled in programmes whose curriculum combined a general EAP training with IELTS test preparation (n = 60) or were following special IELTS-preparation programmes (n = 85).
In line with research exploring the link between test preparation and score gains in other highstakes tests, Green's (2005Green's ( , 2007 study found a small but significant gain in IELTS writing scores (roughly 2/10 of an IELTS band) across programmes. Echoing the findings from Xie (2013) and Liu (2014), the only course parameter correlating significantly with score gains was class activities similar to the test ( Table 5 in Green, 2007, p. 90). Despite this, no discernible differences in score gains were observed between the three different programme types.
What this means for the validity of scores is not entirely clear. Green (2007) argued, on the strength of the finding that different programmes led to similar score gains, that teaching to the test was not necessarily more effective in boosting IELTS scores than teaching the targeted skill, thus dismissing the power of dedicated test-preparation programmes to exploit test characteristics and undermine the validity of IELTS scores.
There are, however, at least two other plausible explanations. Theoretically, and as discussed earlier, test scores could improve through different mechanisms, with different consequences for score validity. Thus, despite the similar level of gain, the extrapolation link from tested to untested behaviours may still have been weaker for scores that improved through dedicated test-preparation programmes (assuming they improved only a subset of the writing domain, as tested by IELTS) compared to those achieved through broader EAP courses (assuming they improved academic writing more broadly construed). Alternatively, and probably more likely here, the three nominally different programme types may, in fact, not have been too dissimilar from each other. They could have led to the improvement in scores via the same mechanisms: Through striving to develop the underlying skill, fortified by activities that, by design or otherwise, familiarised students with the test format. Although the actual practices were not documented, given that all of the courses were described as preparing international students for academic study, it seems reasonable to assume that they were all making an earnest effort to help the students improve their English, and that none was designed to actively game the test. Furthermore, the finding that class activities similar to the test were the only part of the training that was significantly associated with score gains strongly suggests the possibility that different programmes, more intent on gaming the test, may still hold the potential to subvert score validity.
In sum, findings from previous research suggest that concentrating one's efforts on the repetitive practice of specific test formats can lead to quick and significant, if not very large, score gains in high-stakes language assessment. But whether, and if so to what extent, this actually undermines score validity remains unresolved. Specifically, Xie's (2013) argument that curriculum-narrowing training weakens the extrapolation link from the test performance to what one is capable of demonstrating in a different format requires empirical verification in studies that combine both tests for which the specific training was provided and alternative measures of language proficiency. Furthermore, to be able to confidently attribute score gains to test preparation, we need to be able to tease apart the effects of test preparation from the effects that arise through test repetition and from the baseline gain through a growth in ability regardless of test preparation. This is why studies with a control group are needed (Messick & Jungeblut,1981). We incorporated both features (a control group and alternative measures of proficiency) in the design of the present study.

LANGUAGE TEST PREPARATION IN CHINA
Large-scale written examinations and test-driven classroom instruction appear firmly rooted in Chinese society (Spolsky, 1995). Language test-preparation centres are widely used and argued to be sites of "the most egregious negative washback" ( negative washback is understood as test-preparation activities that do not actively encourage language learning but focus instead on raising scores through exploiting test characteristics. Typical activities include test-taking strategies, repetitive practice, and memorisation of parallel test items. In addition to practice tests and retired test papers that are publicly available, some centres are also reported to collate leaked items from active tests (Yan, 2015). Even though some of the major test-preparation providers have incurred large fines for the breach of copyright and hundreds of high-stakes test results have been annulled, their services remain highly praised and sought after by both students and their parents. As Matoush and Fu (2012, p. 114) observe, the competitive nature of attaining success in a heavily populated nation with limited university places has contributed to a disproportionate focus on test-preparation and parents willing to pay high prices for classes taught by those who seem able to predict test items.
Students themselves acknowledge that their goal in attending such programmes is not to improve their language skills but to hone the techniques needed to pass the test (Ma & Cheng, 2016).

OVERVIEW OF THE PRESENT STUDY
Given the ubiquity of test-preparation centres in China and assuming that the intensity and focus of test-driven instruction may, at least in part, be culturally specific, we set the present study in a large training centre in Shanghai. The study addressed two research questions: (1) How much can IELTS scores improve through a brief but intensive preparation course in China?
(2) Do such gains generalise to other measures of language proficiency?
To evaluate the effectiveness of coaching and address the first question, we recruited a group of students undertaking a four-week preparation course and compared their gains on the IELTS test with that of a control group of uncoached participants. We approached the second question by comparing whether the gains in IELTS scores can be generalised to participants' performance on three other English-ability tests.

PARTICIPANTS
Eighty-nine Chinese speakers of English as a foreign language participated in the study. Fortyfive were recruited through a large training centre in Shanghai where they were enrolled on an IELTS-preparation course (the coached group). The other 44 served as a control group and were recruited through snowball sampling, asking the coached group participants to invite friends of a similar age and educational background. Participants in the control group were not attending any test-preparation training at the time.
All participants were from mainland China, received their education in Mandarin Chinese and spoke it as their dominant language. They started learning English in school at the age of 11 and had never lived abroad. None reported having had previous experience with the IELTS test or IELTS test-preparation courses. At the time of testing, the majority of participants were in full-time education (10 final-year high school students and 67 university undergraduates), six were recent university graduates preparing for further studies abroad, and six were university graduates in employment. The median age of the participants was 21 years (range 18-37). The groups were similar in the demographic factors of age, gender balance and level of education (see Table 1).

DESIGN
Participants were tested twice, four weeks apart, on IELTS and on three other tests, measuring English proficiency and related enabling skills. These were Oxford Online Placement Tests (OOPT), a vocabulary test and a speed and accuracy of sentence comprehension test (the latter two from Baddeley, Emslie & Nimmo Smith, 1992). Different versions of test papers were counterbalanced across participants and between times.
Between the two testing points, the coached group underwent an intensive four-week IELTSpreparation programme. It was offered by one of the leading schools for language teaching and test-preparation training in China with a national network of over 1,000 providers. The course curriculum consisted of previous IELTS papers from the Cambridge IELTS book series (Books 6,7,8,9,and 10 were used at the time of data collection), supplemented by a collection of speaking and writing topics featured at recent IELTS exams (practice known as Ji Jing 机经 1 ), put together by candidates and tutors. Eighteen hours of preparation and practice was devoted to each part of the test (Reading, Listening, Writing and Speaking). The instruction involved the identification of common topics across papers, a close analysis of the format, grammar and lexis of previous tests and model answers and a range of test-taking strategies, from predicting where in the text the answer to a question is likely to be, to memorising paragraphs and practising to repurpose them for similar topics in speaking and writing tasks. Participants followed the course at different points during 2015 and 2016, but the overall teaching and learning objectives, the course length and the main materials remained consistent.
The participants in the control group were not undergoing any test preparation at the time. The design allowed us to tease apart the effects of test preparation from the effects of test repetition or any growth in ability (e.g., through the consolidation of previous knowledge) that may have occurred regardless of the coaching programme.

IELTS academic test
The format of the IELTS test consists of four components: Listening, Reading, Writing and Speaking. The components are administered in this order, with the first three being tested in a controlled group setting, and the last one in a face-to-face individual interview. The Listening and Reading tests are allocated 40 and 60 minutes, respectively. The Writing test takes 60 minutes and the Speaking test is between 11 and 14 minutes long. The results are expressed on a 1 to 9 scale, in half-point increments.
We used authentic past IELTS examination papers from the Cambridge IELTS book series (Books 1 and 2; Jakeman & McDowell, 1996;University of Cambridge Local Examination Syndicate, 2000). The selected papers were not used by the training centre in teaching. However, they were from the same book series as the papers used in teaching and were similar to them in length, format and content. The format of the active IELTS test at the time of our study (2015-2016) did not differ substantially from that of the retired papers we used. When administered in a controlled environment, retired tests are expected to provide an accurate indication of the candidate's likely performance on the active IELTS test (Cambridge University Press, 2020). Test-retest reliability of the papers in our study was good, r = .79.
Writing tasks were scored by two experienced IELTS instructors, following the IELTS procedures and criteria. 2 Interclass correlations (ICC,  Performance on the speaking component was assessed by one experienced IELTS speaking instructor using the published IELTS criteria, and the listening and reading tests were scored using the answer key provided with the tests.

Online Oxford Placement Test (OOPT)
The OOPT is a widely used test in second-language learning research to evaluate participants' proficiency in English, as well as by language schools, universities or employers to determine the level of training that their recruits need. It tests knowledge of grammar, as well as the ability to understand literal and implied meanings of English words, phrases and sentences. It assesses this knowledge through listening and reading only. Compared to IELTS, which taps directly into the four language skills, the OOPT can be seen as an index of the linguistic knowledge that underpins those skills.
The OOPT is a computer-adaptive test; it adapts to the ability level of each test-taker. It does so by selecting items based on how the test-taker answered the previous question. If the last question was answered correctly, the next one is selected to be more difficult. If the wrong answer is given, the test goes on to select an easier question next. Because of its adaptive nature, the number of items and the length of the test are not fixed. There is no set time-limit, but the test usually takes between 30 and 40 minutes to complete (Purpura, 2009).
Scores are generated automatically on a scale of 120. For an in-depth account of the score calculation algorithm (see Pollitt, 2009, and for a summary see Hu, 2018). On its website (www.oxfordenglishtesting.com), the OOPT is reported as validated on 19,000 test-takers in 60 countries. Test-retest reliability in our study was r = .61.

Speed and Capacity of Language-Processing (SCOLP) test
The SCOLP (Baddeley et al., 1992) has two components that between them test vocabulary and the speed and accuracy of sentence comprehension. Although not a test of English proficiency per se, SCOLP provides measures of those abilities that are critical for language use. Vocabulary knowledge, in particular, is a precondition for all other language skills and is strongly correlated with other measures of language proficiency (Roche & Harrington, 2013;Trenkic & Warmington, 2019). The speed at which written language is processed and how much of it is understood are also strong indicators of developing language proficiency.
The vocabulary component of the test, also known as The Spot the Word Test, is a version of the yes-no vocabulary task (Meara & Buxton, 1987), providing an index of receptive vocabulary knowledge in English. Participants performed a silent lexical decision task on 60 pairs of items, containing one real and one pseudo word (e.g., kitchen -harrick). The target words ranged in frequency of occurrence from common to extremely rare, thus providing a measure of the richness of the participants' vocabulary (number of correctly identified words out of 60). Previous research has established that measuring vocabulary size through yes-no judgements correlates highly with other standard measures of vocabulary knowledge and is a good measure of L2 proficiency (Roche & Harrington, 2013). The test-retest reliability was r = .88.
The sentence-processing component of SCOLP contains 100 short sentences, half of which are true (e.g., Dogs have four legs; Birds have wings) and half are false (e.g., Dogs have wings; Birds have four legs). The participants' task is to verify the statements as quickly as they can. The test was administered using a pen and paper format. The total reading times and accuracy scores (scale 0 to 100) were used in the analyses. The test-retest reliability was .94 for the speed of reading and .92 for accuracy. The full SCOLP test (vocabulary task + sentence-processing component) took participants between 10-15 minutes to complete.
Although clearly different in format, scope and purpose, both IELTS and these additional measures test aspects of the same trait (i.e., English-language ability). Score gains on one, therefore, may be expected to be reflected, at least to some degree, in score gains on the others. If corresponding improvements are absent, this should raise validity concerns.

PROCEDURE
Participants sat the IELTS Listening, Reading and Writing components, in this order, in a controlled group setting similar to the authentic IELTS exam. Later that day or the following morning, they were individually tested on the IELTS Speaking component by an experienced IELTS spea king instructor. The following day, the OOPT was administered in a controlled group setting. Finally, the participants sat the two components of the SCOLP Test. Testing at both Time 1 (T1) and Time 2 (T2) followed the same steps.

ANALYSIS
Preliminary analysis showed that the scores on most measures were not normally distributed. As the assumption of normality of distribution was not met, we used non-parametric tests in our main analyses: The Wilcoxon signed rank test to compare changes in the within-group performance between T1 and T2 and the Mann-Whitney's U test to compare groups on the gain in scores achieved between T1 and T2.

IELTS SCORES AT T1 AND T2
The median overall score of the participants at the start of our study (T1) was band 6.0. According to the IELTS band descriptors, this level denotes a competent user who has an effective command of the language despite some inaccuracies, inappropriate usage and misunderstandings and who can use and understand fairly complex language, particularly in their own field (IELTS, 2019).
Both groups saw some gain in IELTS scores from T1 to T2. 3 The coached group's median scores rose by half a band: The Listening, Reading and the Overall scores moved from 6.0 to 6.5, and the Writing and Speaking scores moved from 5.5 to 6.0 ( Table 3). The mean of the overall score improved 6/10 of a band, from 5.66 (SD = 0.87) to 6.26 (SD = 0.60). The shift was evident for the majority in the coached group. For example, the Overall IELTS score improved for 36 out of 45 participants and for the remaining nine participants it remained unchanged. No participant saw a drop from T1 to T2 (see Figure 1). The effect sizes were large (medium in the case of Writing), and the Wilcoxon signed rank test confirmed all the gains as statistically significant (Table 3).
In contrast, the control group's gains were small ( Table 4). The median values remained unchanged: The Listening, Reading, Speaking the Overall scores were 6.0 and the Writing score was 5.5 both times. A small shift was only observable in the means, which for all scores increased by about 1/10 of a band. The change reached significance for the Writing and the Overall scores, but the effect sizes were small and driven by a handful of participants (nine in each case); for the majority, the scores remained unchanged, and for some they were reduced at T2 (Figure 2). 3 One participant in the coached group received an overall IELTS score of 2.0 at T1. This was lower compared to all other participants (their initial scores ranging from 4.0 to 7.5). This participant also showed a higher gain at T2 (2.5 bands) compared to the rest (range -0.5 to 2.0). To check the effect that this has had on the results, we repeated all the analyses with this participant excluded; the results did not change. For the sake of completeness, the full set of data is reported here.  Mann-Whitney's test statistics confirmed that IELTS gains experienced by the coached group were statistically larger than those seen in the control group ( Table 5). The effect size for group differences was medium in the case of Writing and large for the other three skills. Table 4 Comparison of Time 1 and Time 2 scores in the control group (n = 44).

Figure 2
The proportion of participants in the control group who improved the overall IELTS score at T2, broken down by the initial IELTS score.

Figure 1
The proportion of participants in the coached group who improved the overall IELTS score at T2, broken down by the initial IELTS score. The results of the control group demonstrate that simply repeating an IELTS test may in some cases lead to an improvement of overall score or some of the sub-scores. This could be a reflection of test-takers' improved familiarity with the test, the growth in the underlying ability or a measurement error. In the context of the present study, this is important because it suggests that a small part of the gain observed in the coached group may have stemmed from the same sources and may not be related to the test-preparation activities. The group difference in score gains, however, suggests that the largest part of the coached group's gain is most likely coming from the test-preparation activities.

OTHER SCORES AT T1 AND T2
To assess whether the IELTS gains generalise to alternative measures of English ability, we considered the participants' performance on the other three tests. Both groups made similarly large gains on their sentence reading speed: The sentences were read faster at T2 than at T1 (Tables 3 and 4). However, neither group saw an improvement in how many sentences they understood correctly. The increase in reading speed, therefore, is most likely a reflection of familiarity with the task. But the absence of improvement in comprehension suggests that the underlying knowledge on which the comprehension is based has not changed substantively between T1 and T2.
The control group also showed small but significant changes to the vocabulary score, which the coached group did not. However, this effect was small, and the group difference in gain was not significant. Finally, no significant improvement in English proficiency from T1 to T2 was detected for either group on the widely used measure of English proficiency, the OOPT.
In sum, in contrast to the IELTS test, both groups performed very similarly on all other tests at both T1 and T2. Put differently, the coached group's improvement on IELTS scores was not accompanied by a corresponding improvement in English-language ability as measured by the other tests in our study.

SUMMARY OF FINDINGS AND THEIR SIGNIFICANCE
In this study, we followed a group of 45 students as they prepared for an IELTS exam in a large test-preparation centre in China. After an intensive four-week preparation programme, their scores improved by 6/10 of an IELTS band on average. Although the participants with the lowest scores on entry stood to gain the most, even the participants with initial IELTS results of 6.0 and 6.5 reliably increased their scores after the training. In a control group of 44 students who were not undergoing any test preparation at the time, repeating the test four weeks apart also led to a statistically significant gain in the overall score, but at 1/10 of an IELTS band it was substantially smaller than that observed in the coached group. On the tests of vocabulary, sentence comprehension and the OOPT, the two groups performed similarly to each other at both testing points, with no measurable improvement from the first test. The large gains in IELTS scores observed in the coached group did not generalise to other measures of English ability. The findings extend the current state of research in several important ways. First, they demonstrate that test-preparation centres in China may be more efficient in raising IELTS scores than preparation programmes in the United Kingdom, Australia and New Zealand, where much of the previous research was conducted. Although some of the earlier studies report a similar level of gain to the one observed here, this was typically achieved over two to three times longer periods of training (cf. Brown, 1998;Elder & O'Loughlin, 2003). In other studies, the reported gains were substantially smaller (Green, 2005(Green, , 2007Read & Hayes, 2003). Furthermore, while previous studies observed training-induced gains primarily in students starting with IELTS scores of 5.5 and below, here we show that even students with initial scores of 6.0 and 6.5 can reliably improve their results. This has particularly important consequences for university admissions, as IELTS scores just half a band higher-6.5 and 7.0-are a typical requirement for unconditional entry at many universities. It suggests that for students who need a final push to get them over this important threshold, some short preparation courses may, indeed, be effective.
Second, by evaluating the gains of the coached group against a control group, we were able to tease apart the effects of test preparation from those of test repetition. Similar to the results reported for TOEFL (Zhang, 2008), the results of our study show that for test-takers not previously familiar with the IELTS test format, merely repeating the test four weeks later statistically improves the chance of getting a higher score. This highlights the importance of attaining familiarity with the format of a high-stakes test in order to perform to one's true level of ability (Messick, 1996). Critically, the difference in score gains between the coached and the control group demonstrates that teaching to the test can lead to a half a band rise in IELTS scores, over and above the improvement that could be attributed to familiarity with the test format or experiential growth in ability.
Finally, by including the additional measures of English ability and finding that the IELTS gain did not generalise here, our study empirically confirms that even in high-stakes language tests designed to minimise threats to validity through authentic and direct assessment, curriculumnarrowing practices and instruction in test-taking strategies have the power to weaken the extrapolation link from coached scores to what students are able to demonstrate in alternative contexts (Xie, 2013). Thus, while some courses may indeed be effective in raising scores, they appear to undermine the validity of such scores.

LIMITATIONS, ALTERNATIVE EXPLANATIONS, AND DIRECTIONS FOR FURTHER RESEARCH
As our study was conducted in a single test-preparation centre, it is important to consider whether and how far the findings generalise. In another study with 153 Chinese university students in the United Kingdom (Hu & Trenkic, 2019), we observed the same effect: Among students arriving with an identical IELTS score, those who achieved it by attending IELTSpreparation programmes did less well on alternative tests of English proficiency compared to students who achieved the same score without coaching. This occurred even though the participants attended a range of IELTS-preparation programmes, differing in length, provider and location. The finding suggests that similar curriculum-narrowing test-preparation programmes, with similar outcomes for IELTS scores and their validity, may be the norm across different training centres.
At present we do not know whether the same practices are employed by the language testpreparation industry outside of China and if so, whether they have the same effects. In addition to being the result of the properties of a training regime, the effects we observed here and in Hu and Trenkic (2019) could be a consequence of the broader cultural milieu in which these programmes were embedded. Memorisation is hugely important in Chinese education, not least because of the need for verbal rote learning of hundreds of logographic characters. But it is also greatly valued and admired in other contexts, as evidenced by the popularity of TV contests based on rote memory performance (Mattys et al., 2018). It is therefore possible that test-preparation courses that rely in great part on the narrowing of the curriculum and memorising answers to probable test questions are particularly effective for quick score gains in cultures that put a premium on verbal rote memory. Unlike IELTS which tests academic reading, writing, listening and speaking, the alternative measures of English ability in this study all tested linguistic knowledge that underpins these skills. Could it be that coaching improved participants' academic reading, writing, listening and speaking skills without changing their general language proficiency, thus explaining the gain in IELTS scores without a change on other measures? This might be feasible but is theoretically unlikely. Academic language skills are tightly linked to and directly dependent on welldeveloped general language proficiency (Hoover & Gough, 1990). In fact, general language proficiency becomes more, not less, critical as academic language skills develop. For example, in both monolingual and bilingual school-age populations, measures of general language proficiency (such as vocabulary knowledge) explain greater variance in academic language skills (such as reading comprehension) in higher grades than in lower grades (Geva & Farnia, 2012). This is why a substantial gain on measures of academic reading, writing, speaking or listening without an associated improvement on measures of general language proficiency raises questions about the robustness and the interpretation of the gain. Future research could probe this further by including both measures of general language proficiency and alternative tests of academic language skills (e.g., TOEFL).

IMPLICATIONS AND CONCLUSIONS
Using IELTS as an example, our findings indicate that short but intensive curriculum-narrowing courses can reliably improve scores in high-stakes language-proficiency tests, but that such test-specific gains do not readily generalise to other measures of English ability. This raises important implications for test-developers, test-users and test-takers alike.
We wish to stress that our results do not question the soundness of IELTS as a test of English proficiency, nor its appropriateness for university admissions purposes. Rather, and to paraphrase Goodhart's law (Strathern, 1997), they underscore that every measure that becomes a target, also becomes a target for gaming. English-language tests for university entry, even when designed to measure language proficiency directly and using as authentic tasks as possible, appear to be no exception.
One way to lessen the attractiveness of gaming a test-and to make the alternative of more gradual gains through language-developing more appealing-is to space out the opportunities for taking the test. Until 2006, IELTS had a 90-day resit rule: A rule that acknowledged that language development takes time and that quick gains in scores are unlikely to reflect a corresponding improvement in English proficiency. The removal of this rule seems to have played directly into the hands of the extreme end of the test-preparation industry and encouraged other gaming behaviours (e.g., see Hamid, 2016, for a case study of a serial repeater who took IELTS 14 times within eight months, including three attempts within a single month). Given the results of the present study and of Hu and Trenkic (2019), re-instating the resit rule to protect the validity of the score interpretation and use would be a step in the right direction.
For receiving universities who use proficiency tests for admissions purposes, our results suggest that many students may have weaker English than their qualifications indicate, and therefore more extensive measures need to be in place to support their learning. Furthermore, universities themselves may be fuelling the dubious practices of the test-preparation industry by making offers that are conditional on unrealistic improvements in English proficiency within an application cycle. For test-takers, our study confirms that test-specific preparation programmes may help them to cross the threshold needed for a university place. However, it also raises concerns that without a corresponding improvement in English proficiency, this may be putting them at a distinct disadvantage in their studies.
Previous research has demonstrated that well-developed language and literacy skills are essential for success in every academic subject. Specifically in the context of higher education, students who arrive with English-proficiency scores recommended by test-developers do on average better than students who only meet the (typically lower) minimum requirements set for their programmes (Trenkic & Warmington, 2019); the latter, however, outperform peers who bypass these requirements altogether by gaining a direct entry into university through different pathway options (Eddey & Baumann, 2011;Oliver et al., 2012). If, as our study suggests, the test-preparation industry helps candidates improve test scores with