Too many parents and students can remember a bad experience from school where they felt work was unfairly evaluated. If different teachers look for different things, it seems improbably for a standardized system to grade things like the SAT essay section. Many wonder how SAT essays are graded and how the process could be objective.
SAT Essay Reader Qualifications
In 2005 I was selected as one of the first graders for the essay section when it was introduced on the March SAT. To qualify as a reader one must:
- Hold a bachelor’s degree or higher
- Have taught a high school or college course with writing
- Taught for at least three years
- Not receive payment for SAT preparation from a test prep company or individual students in the past 12 months.
- Satisfy technical and security requirements for grading online.
To be considered or to learn more, visit the College Board’s Professional site.
Grading Standards For SAT Essays
Once approved as an SAT reader, I entered into a process of training and evaluation designed to teach readers what to look for and how SAT essays are graded. Teachers who have graded writing for AP exams or other standardized tests will understand that scores are NOT subjective.
The SAT essay scoring key is available online. It clearly states what readers are looking for and what standards must be met to score anywhere from a 1 to a 6. I had a laminated copy of the grading scale taped to my computer monitor. It was the final word on scoring.
To insure essays are graded according to standards, all readers must successfully complete hours of training and pass multiple grading tests. Every times I logged in to read essays, I had to pass a quiz that required me to accurately score a set of sample essays. Additionally, as I was scoring, my accuracy was tracked. Personally, I found it very stressful for a part-time job that left me reading student drafts for hours on end.
Pearsons Educational Measurement and College Board are concerned with how SAT essays are graded and work to insure consistent and accurate evaluations.
Grading Process For SAT Essays
Beyond issues of what is graded, there are issues of how SAT essays are graded logistically. Once answer documents are sent back, each essay is electronically scanned into the system.
Readers log-in to grade from their own computers and evaluate essays “on screen.” It is different from my days as a teacher where I would sit down with a composition, mark errors, and makes notes before assigning a grade. For the SAT essays I would spend a minute or two reading the essay and comparing it to the scoring rubric. Then I would enter one numerical score—no comments or explanation needed.
Two readers will score each student’s essay. If the two scores differ by more than a point, a supervisor will evaluate and correct the scoring.
How SAT Essays Are Graded: Conclusion
While no system is perfect, I believe how SAT essays are graded is generally a fair and accurate process. Yes, readers will have off days. Yes, some graders may be turkeys looking to cut corners, but most readers are trying very hard to provide scores that follow grading guidelines.
You may also want to read about one of the most controversial SAT essay topics here.
Education activists are increasingly becoming concerned about the computer grading of written portions of new Common Core tests. Can a computer really grade written work as well as a human being? Here’s a piece on the issue by Leonie Haimson, a co-founder of the Parent Coalition for Student Privacy, a national alliance of parents and advocates defending the rights of parents and students to protect their data.
[The astonishing amount of data being collected about your children]
(Correction: Earlier version said the GRE is only scored by machine. It is also scored by humans.)
By Leonie Haimson
On April 5, 2016, the Parent Coalition for Student Privacy, Parents Across America, Network for Public Education, FairTest and many state and local parent groups sent a letter to the Education Commissioners in the states using the federally funded Common Core tests known as PARCC and SBAC, asking about the scoring of these exams.
We asked them the following questions:
- What percentage of the ELA exams in our state are being scored by machines this year, and how many of these exams will then be re-scored by a human being?
- What happens if the machine score varies significantly from the score given by the human being?
- Will parents have the opportunity to learn whether their children’s ELA exam was scored by a human being or a machine?
- Will you provide the “proof of concept” or efficacy studies promised months ago by Pearson in the case of PARCC, and AIR in the case of SBAC, and cited in the contracts as attesting to the validity and reliability of the machine-scoring method being used?
- Will you provide any independent research that provides evidence of the reliability of this method, and preferably studies published in peer-reviewed journals?
Our concerns had been prompted by seeing the standard contracts that Pearson and AIR had signed with states. The standard PARCC contract indicates that this year, Pearson would score two-thirds of the students’ writing responses by computers, with only 10 percent of these rechecked by a human being. In 2017, the contract said, all of PARCC writing samples were to be scored by machine with only 10 percent rechecked by hand.
This policy appears to contradict the assurances on the PARCC scoring FAQ page that says,“Writing responses and some mathematics answers that require students to explain their process or their reasoning will be scored by trained people in the first years.”
On another Pearson page, linked to from the FAQ, called “Scoring the PARCC Test,” the informational sheet goes on at great length about the training and experience levels of the individuals selected for scoring these exams (which is itself quite debatable) without even mentioning the possibility of computer scoring. In fact, we can find nowhere on the PARCC website in any page that a parent would be likely to visit that makes it clear that machine-scoring will be used for the majority of students’ writing on these exams.
In an Inside Higher Ed article from March 15, 2013, Smarter Balanced representatives said that they had retreated from their original plans to switch rapidly to computer scoring, “because artificial intelligence technology has not developed as quickly as it had once hoped.” Yet the standard AIR contract with the SBAC states calls for all the written responses to be scored by machine this year, with half of them rechecked by a human being; next year, only 25 percent of writing responses will be re-checked by a human being.
In both cases, however, for an additional charge, states can opt to have their exams scored entirely by real people.
The Pearson and AIR contracts also promised studies showing the reliability of computer scoring. After we sent our letter and a reporter inquired, Pearson finally posted a study from March 2014. The SBAC automated scoring study is here. Both are problematic in different ways.
According to Les Perelman, retired director of a writing program at MIT and an expert on computer scoring, the PARCC/Pearson study is particularly suspect because its principal authors were the lead developers for the ETS and Pearson scoring programs. Perelman said: “It is a case of the foxes guarding the hen house. The people conducting the study have a powerful financial interest in showing that computers can grade papers.”
In addition, the Pearson study, based on the spring 2014 field tests, showed that the average scores received by either a machine or human scorer was “very low: below 1 for all of the grades except grade 11, where the mean was just above 1.” This chart shows the dismal results:
Given the overwhelming low scores, the results of human and machine scoring would of course be closely correlated in any scenario.
Les Perelman concludes: “The study is so flawed, in the nature of the essays analyzed and, particularly, the narrow range of scores, that it cannot be used to support any conclusion that Automated Essay Scoring is as reliable as human graders. Given that almost all the scores were 0’s or 1’s, someone could obtain to close the same reliability simply by giving a 0 to the very short essays and flipping a coin for the rest. ”
As for the AIR study, it makes no particular claims as to the reliability of the computer scoring method, and omits the analysis necessary to assess this question.
As Perelman said: “Like previous studies, the report neglects to give the most crucial statistics: when there is a discrepancy between the machine and the human reader, when the essay is adjudicated, what percentage of instances is the machine right? What percentage of instances is the human right? What percentage of instances are both wrong? … If the human is correct, most of the time, the machine does not really increase accuracy as claimed.”
Moreover, the AIR executive summary admits that “optimal gaming strategies” raised the score of otherwise low-scoring responses a significant amount. The study then concludes because that one computer scoring program was not fooled by the most basic of gaming strategies, repeating parts of the essay over again, computers can be made immune from gaming. The Pearson study doesn’t mention gaming at all.
Indeed, research shows it is easy to game by writing nonsensical long essays with abstruse vocabulary. See for example, this gibberish-filled prose that received the highest score by the GRE computer scoring program. The essay was composed by the BABEL generator – an automatic writing machine that generates gobbled-gook, invented by Les Perelman and colleagues. [A complete pair of BABEL generated essays along with their top GRE scores from ETS’s e-rater scoring program is available here.]
In a Boston Globeopinion piece , Perelman describes how he tested another automated scoring system, IntelliMetric, that similarly was unable to distinguish coherent prose from nonsense, and awarded high scores to essays containing the following phrases:
“According to professor of theory of knowledge Leon Trotsky, privacy is the most fundamental report of humankind. Radiation on advocates to an orator transmits gamma rays of parsimony to implode.’’
Unable to analyze meaning, narrative, or argument, computer scoring instead relies on length, grammar, and arcane vocabulary to do assess prose. Perelman asked Pearson if he could test its computer scoring program, but was denied access. Perelman concluded:
If PARCC does not insist that Pearson allow researchers access to its robo-grader and release all raw numerical data on the scoring, then Massachusetts should withdraw from the consortium. No pharmaceutical company is allowed to conduct medical tests in secret or deny legitimate investigators access. The FDA and independent investigators are always involved. Indeed, even toasters have more oversight than high stakes educational tests.
A paper dated March 2013 from the Educational Testing Service (one of the SBAC sub-contractors) concluded:
Current automated essay-scoring systems cannot directly assess some of the more cognitively demanding aspects of writing proficiency, such as audience awareness, argumentation, critical thinking, and creativity…A related weakness of automated scoring is that these systems could potentially be manipulated by test takers seeking an unfair advantage. Examinees may, for example, use complicated words, use formulaic but logically incoherent language, or artificially increase the length of the essay to try and improve their scores.
The inability of machine scoring to distinguish between nonsense and coherence may lead to a debasement of instruction, with teachers and test prep companies engaged in training students on how to game the system by writing verbose and pretentious prose that will receive high scores from the machines. In sum, machine scoring will encourage students to become poor writers and communicators.
Only five state officials responded to our letter after a full month. Dr. Salam Noor, the deputy superintendent of Oregon; Deputy Commissioner Jeff Wulfson of Massachusetts;Henry King of the Nevada Department of Education; and Dr. Vaughn Rhudy from the Office of Assessment in West Virginia informed us that their states were not participating in automated scoring at this time. Wyoming Commissioner Jillian Balow also replied to our letter, saying that she shared our concerns about computer scoring, and that Wyoming was not using the SBAC exam as we had mistakenly believed.
In contrast, Education Commissioner Richard Crandall responded to local parent activist Cheri Kiesecker that Colorado would be using computer scoring for two-thirds of students’ PARCC writing responses:
“Automated scoring drives effective and efficient scoring of student assessments, resulting in faster results, more consistent scoring, and significant cost savings to PARCC states. This year in Colorado, roughly two-thirds of computer-based written responses will be scored using automated scoring, while one-third will be hand-scored. Approximately 10 percent of all written responses will receive a second hand scoring for quality control.”
He added that parents would never know if their child’s writing was scored by a machine or a human being, because different items on each individual test sheet are apparently randomly assigned to machines and humans.
On April 5, 2016, the same day we sent the letter, Rhode Island Education Commissioner Ken Wagner spoke publicly to the state’s Council on Elementary and Secondary Education about the automated scoring issue. He claimed that “the research indicates that the technology can score extended student responses with as much reliability- if not more reliability- than expert trained teacher scores …..” (Here’s the video, watch from about 11 minutes.)
He repeated this claim once again – that the machines outperform even most highly trained experienced teachers:
“The research has … not just looked at typical teacher scores but expert trained teacher scores and then compared the automated scoring results to the expert trained teacher scores and the results are either as good or if not…better….”
This is appears on the face of it an absurd claim. How can a machine do better than an expert trained teacher in scoring a piece of writing?
Wagner went on to insist that “SAT, GRE, GMA, those kinds of programs have been doing this stuff for a very long time.” Yet on the GRE, every writing sample is scored by both a computer and a human being. And to its credit, the College Board uses trained human scorers exclusively on writing samples for the SAT and AP exams.
The following 18 states and districts have failed to respond to our letter or those of other parents as to whether they are using computers to score writing samples on their PARCC and SBAC exams: CA, CT, DE, DC, HI, ID, IL, LA, MD, MI, MT, NH, NJ, NM, ND, SD, VT, and WA.
The issue of computer scoring — and the seeming reluctance of the states and companies involved in the PARCC and SBAC consortia to be open with parents about this — is further evidence that the ostensible goal of the Common Core standards to encourage the development of critical thinking and advanced skills is a mirage. Instead, the primary objective of Bill Gates and many of those promoting the Common Core and allied exams is to standardize both instruction and assessment and to outsource them to reductionist algorithms and machines, in the effort to make them more “efficient.”
Essentially, the point of this grandiose project imposed upon our nation’s schools is to eliminate the human element in education as much as possible. As recent piece by Pearson on Artificial Intelligence (or AI) argues,
True progress will require the development of an AIEd infrastructure. This will not, however, be a single monolithic AIEd system. Instead, it will resemble the marketplace that has developed for smartphone apps: hundreds and then thousands of individual AIEd components, developed in collaboration with educators, conformed to uniform international data standards, and shared with researchers and developers worldwide. These standards will enable system-level data collation and analysis that help us learn much more about learning itself and how to improve it.
If we are ultimately successful, AIEd will also contribute a proportionate response to the most significant social challenge that AI has already brought – the steady replacement of jobs and occupations with clever algorithms and robots. It is our view that this phenomena provides a new innovation imperative in education, which can be expressed simply: as humans live and work alongside increasingly smart machines, our education systems will need to achieve at levels that none have managed to date.
Here, Pearson appears to be suggesting that the robust marketplace in data-mining computer apps supplied with artificial intelligence will lead to a proliferation of jobs for ed tech entrepreneurs and computer coders, to make up for the proportional loss of jobs for teachers. This seems to be further evidence that their ultimate goal — as well of that of their allies in the foundation and corporate worlds — is to maximize the mechanization of education and minimize the personal interaction between teachers and students, as well as students with each other, in classrooms throughout the United States and abroad.
Is this the future we want for American public school students?
More information about the lack of evidence for machine scoring is in this issue brief here. If you are a parent from one of these states: please send in your questions, especially bullet points #1 to #3 above. The email addresses of your state commissioners are posted here. And please let us know if you get a response by emailing us at email@example.com .