Annotated Bibliography of Machine Grading of Essays, Part 2

Ericsson, Patricia Freitag & Haswell, Richard H. (Eds.). (2006). Machine Scoring of Student Essays: Truth and Consequences. Logan, UT: Utah State University Press.*

A compilation of seventeen original essays by teachers of composition discussing the assessment methodology and educational impact of commercial computer-based essay-rating software such as the College Board’s WritePlacer Plus, ACT’s e-Write, ETS’s e-rater, Measurement, Inc.’s Project Essay Grade (PEG), as well as essay feedback software such as Vantage Learning’s MY Access!and ETS’s Criterion. Addresses many issues related to the machine scoring of writing: historical understandings of the technology (Ken S. McAllister & Edward M. White; Richard Haswell; Bob Broad); investigation into the capability of the machinery to “read” student writing (Patricia F. Ericsson; Chris M. Anson; Edmund Jones; William Condon); discussions of how students have reacted to machine scoring (Anne Herrington & Charles Moran); analysis of the poor validity in placing students with machine-produced scores (Richard N. Matzen, Jr. & Colleen Sorensen; William W. Ziegler; Teri T. Maddox); a comparison of machine scores on student essays with writing-faculty evaluations (Edmund Jones); a discussion of how writers can compromise assessment by fooling the computer (Tim McGee); the complicity of the composition discipline with the methods and motives of machine scoring (Richard Haswell); writing instructors’ positive uses of some kinds of computer analysis, such as word-processing text-checkers and feedback programs (Carl Whithaus); an analysis of the educational and political ramifications of using automated grading software in a WAC content course (Edward Brent & Martha Townsend); and an analysis of commercial promotional material of software packages (Beth Ann Rothermel). Includes a 190-item bibliography of machine scoring of student writing spanning the years 1962-2005 (Richard Haswell), and a glossary of terms and products.

Wilson, Maja. (2006). Apologies to Sandra Cisneros: How ETS’s computer-based writing assessment misses the mark. Rethinking Schools 20(3).*

Wilson tested Educational Testing Service’s Critique, the part of Criterion that provides “diagnostic feedback,” by sending it Sandra Cisneros’s chapter “My Name,” from The House on Mango Street.Critique found problems in repetition, sentence syntax, sentence length, organization, and development. Wilson then rewrote “My Name” according to Critique’s recommendations, which required adding an introduction, a thesis statement, a conclusion, and 270 words, turning it into a wordy, humdrum, formulaic five-paragraph essay.

Sandene, Brent, Horkay, Nancy, Bennet, Randy Elliot, Allen, Nancy, Braswell, James, Kaplan, Bruce & Oranje, Andreas. (2005). Part II: Online writing assessment. Online assessment in mathematics and writing: Reports From the  NAEP Technology-Based Assessment Project, Research and Development Series.  NCES 2005–457). U.S. Department of Education, National Center for Education Statistics. Washington, DC: U.S. Government Printing Office.

While not a traditional peer-reviewed publication, the NAEP research report is considered a high-quality scholarly source; it describes the results of the 2002 Writing Online study of a national sample of eighth graders writing online and compared the results to those students taking the traditional pencil-and-paper format of the test. The report is a comprehensive comparison, which includes the machine scoring of essays using e-rater 2.0, with one subsection on the AES (pp. 37-44). Results of the study “showed that the automated scoring of essay responses did not agree with the scores awarded by human readers.” Moreover, AES “produced mean scores that were significantly higher” than those awarded by human readers and that the human readers “agreed with each other” at higher rates than the agreement between the AES scores and those produced by the human readers. In rank ordering essay, again human readers and AES did not agree at the same rates as human readers did with each other.

Penrod, Diane. (2005). Composition in Convergence: The Impact of New Media on Writing Assessment. Mahwah, New Jersey: Lawrence Erlbaum.*

Argues that since writing and writing assessment are intertwined, and since writing and writing standards are rapidly changing under the impact of digital technology, machine scoring cannot keep up: “The current push for traditional assessment standards melding with computer technology in forms like the Intelligent Essay Assessor, E-rater, and other software programs provides a false sense of establishing objective standards that appear to be endlessly repeated across time and space” (p. 164).

Shermis, Mark D., Burstein, Jill & Leacock, Claudia. (2005). Applications of computers in assessment and analysis of writing. In Charles A. MacArthur, Steve Graham & Jill Fitzgerald (Eds.), Handbook of Writing Research (pp. 403-416). New York: Guilford Press.* 

A review of what the authors call “automated essay scoring” (AES) from the perspective of the testing industry. There is a brief history of the development of the most successful software, a very informed discussion of reliability and validity studies of AES (although validity is restricted to correlations with other assessments of student essays), a useful explanation of the different approaches of Ellis Page’s Project Essay Grade (PEG), ETS’s e-rater, Vantage’s IntelliMetric, and Thomas Landauer and Peter Foltz’s Intelligent Essay Assessor, and a shorter discussion of computerized critical feedback programs such as Criterion and c-rater. The authors conclude that teachers need to understand how the technology works, since “the future of AES is guaranteed, in part, by the increased emphasis on testing for U. S. schoolchildren” (p. 414).

Whithaus, Carl. (2005). Teaching and Evaluating Writing in the Age of Computers and High-Stakes Testing. Mahwah, NJ: Lawrence Erlbaum.* 

The larger argument of this book is that digital technology changes everything about the way writing is or should be taught. That includes evaluating writing. Whithaus critiques high-stakes writing assessment as encouraging students to “shape whatever material is placed in front of [them] into a predetermined form” (p. 11) rather than encouraging thinking through how to communicate to different audiences for different purposes and through different modalities. He argues that if the task is to reproduce known facts, then systems such as Project Essay Grade (PEG) or Intelligent Essay Assessor (IEA) may be appropriate; but if the task is to present something new, then the construction of electronic portfolios makes a better match. Suggests that using e-portfolios creates strong links between teaching and assessment in an era when students are being taught to use multimodal forms of communication. Argues that scoring packages such as e-Write or e-rater, and the algorithms that drive them such as latent semantic analysis or multiple regression on countable traits may serve to evaluate reproducible knowledge or “dead” text formats such as the 5-paragraph essay (p. 121), but cannot fairly assess qualities inherent in multimedia and multimodal writing of blogs, instant messaging, or e-portfolios, where the production is epistemic and contextual and where the evaluation should be situated and distributed (judged by multiple readers). Making this book particularly useful is its extended analysis of contemporary student texts.

Cheville, Julie. (2004). Automated scoring technologies and the rising influence of error. English Journal 93(4), 47-52.* 

Examines the theoretical foundations and practical consequences of Criterion, the automated scoring program that the Educational Testing Service is still developing. Cheville bases her critique on information provided by ETS as part of an invitation to participate in a pilot study. Contrasts the computational linguistic framework of Criterion with a position rooted in the social construction of language and language development. Links the development of the program with the high-stakes large-scale assessment movement and the “power of private interests to threaten fundamental beliefs and practices underlying process instruction” so that the real problem—”troubled structures of schooling” (p. 51)—will remain.

Burstein, Jill & Marcus, Daniel. (2003). A machine learning approach for identification of thesis and conclusion statements in student essays.  Computers and the Humanities 37, 455-467.*

Explains how a machine may be able to evaluate a criterion of good writing (organization) that many teachers think cannot be empirically measured. Argues that essay-based discourse-analysis systems can reliably identify thesis and conclusion statements in student writing. Explores how systems generalize across genre and grade level and to previously unseen responses on which the system has not been trained. Concludes that research should continue in this vein because a machine-learning approach to identifying thesis and conclusion statements outperforms a positional baseline algorithm.

Shermis, Mark D. & Burstein, Jill (Eds.). (2003). Automated Essay Scoring: A Cross-Disciplinary Perspective. Mahwah, NJ: Lawrence Erlbaum. *

Thirteen original essay-chapters on the development of computer programs to analyze and score “free” or essay-like pieces of discourse. The bulk of the book documents and promotes current computerized methods of text analysis, scoring software, or methods to validate them: Ellis Batten Page on Project Essay Grade (PEG); Scott Elliot on IntelliMetric; Thomas K. Landauer, Darrell Laham, & Peter W. Foltz on Intelligent Essay Assessor; Jill Burstein on e-rater; Leah S. Larkey & W. Bruce Croft on binary classifiers as a statistical method for text analysis; Gregory J. Cizek & Bethany A. Page on statistical methods to calculate human-machine rater reliability and consistency; Timothy Z. Keith on studies validating several programs by correlating human rates and machines rates; Mark D. Shermis & Kathryn E. Daniels on use of scales and rubrics when comparing human and machine scores; Claudia Leacock & Martin Chodorow on the accuracy of an error-detection program called ALEK (Assessment of Lexical Knowledge); Jill Burstein & Daniel Marcu on the accuracy of a computer algorithm in identifying a “thesis statement” in an open essay. Although chapters are highly informative — data-based and well documented — conspicuously absent are studies of the use and impact of machine scoring or feedback in actual classrooms. The introduction argues that “Writing teachers are critical to the development of the technology because they inform us as to how automated essay evaluations can be most beneficial to students” (xv), but no new information along those lines is presented.

Williamson, Michael M. (2003). Validity of automated scoring: Prologue for a continuing discussion of machine scoring student writing. Journal of Writing Assessment, 1(2), 85-104.*

Reviews the history of writing-assessment theory and research, with particular attention to evolving definitions of validity. Argues that researchers and theorists in English studies should read and understand the discourse of the educational measurement community. When theorists and researchers critique automated scoring, they must consider the audiences they address, that they must understand the discourse of the measurement community rather than write only in terms of English Studies theory. Argues that while common ground exists between the two communities, writing teachers need to acknowledge the complex nature of validity theory and consider both the possibilities and problems of automated scoring rather than focus exclusively on what they may see as threatening in this newer technology. Points out that there is a divide in the way writing assessment is discussed among professionals, with the American Psychological Association and the American Educational Research Association discussing assessment in a decidedly technical fashion and the National Council of Teachers of English and Conference on College Composition and Communication groups discussing writing assessment as one aspect of teaching and learning about assessment. Williamson points out that the APA and AERA memberships are much larger than those of NCTE and CCCC, and that writing studies professionals would do well to learn more about the assessment discussions happening in APA and AERA circles.

Powers, Donald E., Burstein, Jill, Chodorow, Martin S., Fowles, Mary E. & Kukich, Karen. (2002). Comparing the validity of automated and human scoring of essays. Journal of Educational Computing Research 26(4), 407-425.*

The authors compared e-rater scores with students’ self-reports of writing ability, writing accomplishment, grades in writing-intensive courses, and other “non-test” variables, and found that expert human ratings of essays correlated better than did e-rater ratings, although both were low. They conclude that e-rater scores are “less valid than are those assigned by trained readers” (p. 421), but only assuming that the “non-test” variables are valid measures of writing skill.

Shermis, Mark D. & Barrera, Felicia. (2002). Automated essay scoring for electronic portfolios.Assessment Update, 14(4), 1-11.*

Provides an update on a grant from the Fund for the Improvement of Postsecondary Education (FIPSE) that explores the use of automated essay scoring (AES) for electronic portfolios.  Argues that large numbers of e-portfolios necessitate the use of AES evaluative systems.  Presents data showing the validity of three AES systems:  Project Essay Grade (PEG), IntelliMetric, and Intelligent Essay Assessor (IEA).  Reports that project researchers were creating national norms for documents; norms will be available through automated software online for a period of five years.

Shermis, Mark D., Mzumara, Howard R., Olson, Jennifer & Harrington, Susanmarie. (2001). On-line grading of student essays: PEG goes on the world wide web. Assessment and Evaluation in Higher Education 26(3), 247-260.*

Describes two studies in using Project Essay Grade (PEG) software for placement of students into college-level writing courses. In the first study, students’ papers were used to create a scoring schemata for the software; in the second, scores provided by PEG and human readers were compared. Argues that PEG works because the computer scores and raters’ scores had high correlations; in addition, PEG is an efficient and low-cost way to do low-stakes writing assessment like placement. Although the authors note that a good writer could fool the system by submitting a nonsensical essay, the article does not address other potential problems with machine scoring of student essays. In fact, it ends by pointing out how PEG’s use could be expanded beyond placement assessment into the grading of essays in programs like Write 2000, which promotes more writing in grades 6-12.

Herrington, Anne & Moran, Charles. (2001). What happens when machines read our students’ writing? College English, 63(4), 480-499.*

Provides a short history of the field of composition’s response to machine scoring and examines two programs now heavily marketed nationwide: Intellimetric, the platform of WritePlacer Plus, andIntelligent Essay Assessor. Herrington and Moran each submit work to both scoring programs and discuss the different outcomes. Argues that machine scoring does not treat writing as a rhetorical interaction between writers and readers. Calls into question the efficiency and reliability claims companies make as the primary basis for marketing their programs. Argues that machine scoring may send the message to students that human readings are unreliable, irrelevant, and replaceable, and that the surface features of language matter more than the content and the interactions between reader and text — a message that sabotages compositions’ pedagogical goals.

Powers, Donald E., Burstein, Jill C., Chodorow, Martin, Fowles, Mary E. & Kukich, Karen. (2001).Stumping e-rater: Challenging the validity of automated essay scoring (GRE Report, No. 98-08bP). *  

Reports on a study in which writing specialists, linguists, language testing experts, and computer software experts were encouraged to write and submit essays they believed would trick e-rater into giving higher or lower scores than the essays deserved. Human readers scored the essays, as dide-rater. Study found that readers agreed with one another within one point of the scoring scale 92% of the time, while e-rater and readers agreed within one point of each other 65% of the time. Further, e-rater was more likely to give inflated scores than to give lower than warranted scores. Some of the essays given the highest score (6) by e-rater but very low scores by human readers were those that repeated whole paragraphs or that used key phrases from the question but that merely agreed with the writing prompt instead of analyzing it, as directed. Essays earning lower than warranted scores were those that included subtle transitions between ideas or frequent literary allusions. Concludes that e-rater should not be used without human scorers and that more could be done to train human scorers in the aspects of writing that e-rater overlooks. This is a technical report by ETS, so not a peer-reviewed publication, but it offers useful insight into the AES.

Jones, Brett D. (1999).  Computer-rated essays in the English composition classroom. Journal of Educational Computing Research, 20(2), 169-186.*

Reports on a study designed to determine how middle and high school teachers would use computer-generated ratings of student writing if they were available. Discusses the potential for computer-generated rated essays to help teachers give feedback to student essays. Reviews the types of feedback students find most helpful, suggests that teachers do not have enough time to provide this type of feedback, and argues that Project Essay Grade (PEG) is capable of rating the overall quality of an essay, thus leaving more time for teachers to provide more specific and content-based feedback on student papers. Stresses that PEG ratings do not give information on why an area of writing is weak (for instance, content, organization, style, mechanics, creativity), but alerts teachers to areas that need attention.

Whittington, Dave & Hunt, Helen. Approaches to the computerized assessment of free text responses. (1999). Proceedings of the Third Annual Computer Assisted Assessment Conference (pp. 207-219). Loughborough, England: Loughborough University.*

Provides clear, brief descriptions of how a number of machine scoring software programs operate, including Project Essay Grade (PEG), Latent Semantic Analysis (LSA), Microsoft’s Natural Language Processing Tool, and Educational Testing Service’s e-rater. Also describes two other, potentially beneficial, software initiatives: Panlingua, which is based on the assumption that there is a universal language that reflects understanding and knowledge and on several levels would map onto a software program the way the brain understands language/ideas, and Lexical Conceptual Structure (LCS), which is based on the idea that a machine “must be capable of capturing language-independent information — such as meaning, and relationships between subjects and objects in sentences—whilst still processing many types of language-specific details, such as syntax and divergence” (p. 10). Points out that there are many important limitations of all of these software initiatives but that they hold promise and, together, represent the dominant ways of thinking about how to build software to address the scoring of complex writing tasks.

Breland, Hunter M. (1996). Computer-assisted writing assessment: The politics of science versus the humanities. In Edward M. White, William D. Lutz & Sandra Kamusikiri (Eds.)Assessment of Writing: Politics, Policies, Practices (pp. 249-256). New York: Modern Language Association.*

Briefly reviews the development of computer-based evaluation of writing by “scientists” and the resistance to this approach by those in the “humanities.” Addresses programs such as Bell LabsWriter’s Workbench as well as author’’ own research into Educational Testing Service’s WordMAPprogram. Concludes that although many writing teachers still oppose the focus on error and mechanics that characterize the computer-based approach, a “certain amount of standardization, particularly in writing mechanics, is an essential part of writing and writing assessment,” and to deny this fact “is not good for writing instruction” (p. 256).

Huot, Brian A. (1996). Computers and assessment: Understanding two technologies. Computers and Composition, 13(2), 231-243.*

Examines the problems and possibilities of using assessment technologies, and argues that we must base decisions for using any technology on sound theory and research. Includes a literature review on computer scoring. Considers theoretical assumptions of assessment practices and computer practices with respect to teaching and communicating, paying special attention to the debate about computers as value-free versus value-laden tools. Examines validity and reliability arguments of machine scoring and the theoretical implications of using computers for assessment of and response to student writing.

Brock, Mark N. (1995). Computerized text analysis: Roots and research. Computer Assisted Language Learning 8(2-3), 227-258.*

Focuses on computerized text analysis programs, such as Writer’s Workbench, Edit, and Critique,that provide feedback to writers to prompt revision. Explains the way these programs function, summarizes how they were developed, and reviews research about their efficacy. Identifies the “exclusive focus on surface-level features of a text” as the “most severe limitation” of computerized text analysis because it directs students away from meaning making (p. 236). Concludes that the beneficial claims about these programs as writing aids are “at best controversial and at worst simply untrue” (p. 254). Describes how the programs are used to give feedback to writers and contrasts this use with how the programs grade writing.

Prepared by the NCTE Task Force on Writing Assessment

Chris Anson, North Carolina State University (chair)

Scott Filkins, Champaign Unit 4 School District, Illinois

Troy Hicks, Central Michigan University

Peggy O’Neill, Loyola University Maryland

Kathryn Mitchell Pierce, Clayton School District, Missouri

Maisha Winn, University of Wisconsin

2 thoughts on “Annotated Bibliography of Machine Grading of Essays, Part 2

  1. Pingback: Computer Grading of Essays | The Academe Blog

Your comments are welcome. They must be relevant to the topic at hand and must not contain advertisements, degrade others, or violate laws or considerations of privacy. We encourage the use of your real name, but do not prohibit pseudonyms as long as you don’t impersonate a real person.