Lingua franca -- July/August 1999

NEW WORD ORDER
THE ATTACK OF THE INCREDIBLE GRADING MACHINE

BY CLIVE THOMPSON

Darrell Laham, Peter Foltz, and Thomas Landauer got up early on April 16, 1998, and found their work cited on the front page of USA Today. [SOFTWARE MAKES THE GRADE ON ESSAYS], blared the headline. The article went on to discuss the psychologists' Intelligent Essay Assessor (IEA), a computer program that attempts to judge essays by analyzing statistical patterns in word usage. More to the point, as USA Today breathlessly noted, the program can actually grade papers, conferring marks that closely match the average professor's.

Laham, a Ph.D. candidate at the University of Colorado, Foltz, a professor at New Mexico State University, and their intellectual mentor Thomas Landauer, also a professor at Colorado, had spent almost ten years improving the word-crunching method behind the software. They had tested it on two thousand student essays. And the day the story appeared in USA Today, they were in San Diego to present their findings at a meeting of the American Educational Research Association (AERA).

They almost didn't make it to the meeting. "The phone started ringing in our hotel room," Laham recalls. Reporters from Dallas and Dayton, London and Toronto, jammed the phone lines. "It was just crazy," Laham says. "We spent most of the day in the pressroom. We barely got down to give the talk in time." The media inquiries didn't let up until two o'clock the next morning. Everyone had the same questions: What did this mean for education? Would teaching assistants become obsolete? Would professors stop grading papers altogether?

Few scenarios provoke more hysteria than the one in which the machines take over, especially when it involves machines' replicating the arguably unique human capacity for language and thought. Not that such anxieties have kept us from trying to build smart machines; on the contrary, researchers have tried for decades to produce essay-grading software. But few have claimed the success that Landauer and his colleagues have—and none has applied such unorthodox assumptions.

Traditionally, computer techniques for analyzing written language relied on grammar and syntax. Since the 1960s, programmers have been designing software that searched for familiar features of good writing, such as sentence length, spelling, and the rhetorical signposts of argumentation—words like "might," "therefore," or "however." The goal was to get computers to "read" texts with the aid of dictionaries, spell checks, and grammar software. Most of these efforts have been disappointing.

But the technique Landauer came up with is an altogether different beast. His software completely ignores style, grammar, and syntax. In fact, it relies on none of the familiar rules of language at all. It concerns itself solely with the following question: Does the student's essay use words appropriate to the subject matter? To answer this question, Landauer's software performs a series of operations. First, it assembles a customized database of texts on the assigned topic. It measures the spatial relationships among all the words in these texts, noting where each word appears and which other words it is near. The software then performs a similar analysis on the student essay. Finally, it compares the student essay with the texts in its database. The theory behind the method is this: For any given essay, good content is a function of using certain words in the vicinity of certain other words, and that accomplishment can be expressed numerically.

It sounds too complicated (or too simple) to be any good, but Landauer and his team claim their program can crank out grades as accurately as any professor. Indeed, flushed with the success of their experimental results, the psychologists have formed a company—Knowledge Analysis Technologies—to market their tool to schools. They have already signed up a professor at Florida State University to test drive the grading machine in his classroom next fall.

But cornering the instructional technology market is just part of the plan. When it comes to making an intellectual impact, the psychology professors' ambitions are grander still. The ideas behind IEA, its inventors claim, provide nothing less than a revolutionary insight into the mysteries of human thought. Declaring that their software "can offer a close enough approximation to people's knowledge to underwrite theories and tests of theories of cognition," the psychologists argue that their invention "constitutes a fundamental computational theory of the acquisition and representation of knowledge." If Landauer's software can evaluate texts by identifying statistical patterns of word usage, could it be that humans make sense of language in a similar—purely statistical—way? What if our brains are nothing more than extremely powerful number crunchers?

This is a radical notion, and not everyone is buying it. Linguists and philosophers in particular remain deeply skeptical. Landauer and his team want us to think of their software's "knowledge of the world [as] analogous to a well-read nun's knowledge of sex." But that means that much of what the linguists have taught us about how language works—especially about the importance of syntax and grammar—is wrong. Are the linguists mistaken? Or are IEA's inventors—on this issue at least—wildly off the mark?

Landauer, sixty-seven, is an unlikely revolutionary. Most of his career has been devoted to small, focused problems rather than big theories. "I start with the practice first," he says, "then work my way out." The project that evolved into his essay-grading software began in the mid-1980s, out of a seemingly mundane question: Why are computers so hard to use? At the time, personal computers were just beginning to catch on, and Landauer—a Harvard Ph.D.—was working for Bell Labs, heading up the company's cognitive-science division. For a research scientist, Bell was a fairly delirious place to work; projects were wide-ranging and not necessarily focused on the bottom line, a luxury that many universities could not afford. When Landauer decided to look into the interaction of humans and computers, it was still a relatively new field. "No one," says Landauer, "was asking why people found computers so difficult." The software creators didn't realize there was a problem; they thought computers were supposed to be hard.

Landauer's team began conducting tests on computer users. They discovered that user confusion was almost always a matter of language. Computer commands were often unintuitive. For example, the command for listing files was typically "DIR," an abbreviation for "directory." Users would often forget this term, trying more intuitive words like "list" instead. But "list" didn't work, of course, and frustrated users were all too often inclined to give up. "People were literally quitting Bell because UNIX"—a common operating system for mainframe computers—"was such a nightmare," Landauer notes.

Computerized search engines posed similar problems. Anyone who's used a computerized database knows the hair-pulling frustration of keyword searches. Even a finely honed search can often fail to produce a worthwhile document because you're not using quite the right keyword. In information-retrieval circles, this is referred to as the synonymity problem. A colleague of Landauer's at Bell Labs offers an example: "Say I'm looking for articles on man-machine interaction. If there's a paper on human-computer interaction, I'm not going to find it, even though they're dealing with the same concepts."

Synonyms, of course, are a basic feature of language. Different people use different words to say the same thing. And the same person is likely to use different words to say the same thing on different occasions. We do this without thinking, unconsciously selecting synonyms, effortlessly substituting one idiomatic expression for another. But computers are helpless at this task. A computer has to match one keyword to another precisely; it can't determine that man-machine is saying the same thing, generally, as human-computer. In the late 1980s, Landauer's team began the hunt for a way around this problem. "We needed a system that would learn automatically what words were related to each other," Landauer says. "It was an issue of semantics."

Within a year, his research team had developed an intriguing solution. Employing a complex mathematical technique, they were able to examine how every single word in a given body of texts is related to every other word in the texts—and express those relationships numerically. This technique involved taking a handful of texts on a particular topic and creating a "semantic space" out of their words. An elaborate, multidimensional model, the semantic space is difficult to visualize. But you could think of it this way: In a simple two-dimensional grid, a point is represented by two numbers, its x and y coordinates. In Landauer's semantic space, a word is plotted in three hundred or more dimensions simultaneously. By comparing the strings of coordinate values for two words, you can theoretically tell how related they are—how likely they are to appear together in a sentence or a text—on a scale of 1.0 to 1.0. If the values for two words are nearly 1.0 apart, the words are virtually identical; if they're 1.0 apart, chances are they will never appear in the same text.

Language, Landauer's team argued, is fundamentally a matter of such statistics. Any word we know is potentially related to any other—it's just a matter of how likely or unlikely the two words are to be used in the same context. The average writer, for example, would rarely use "taxi" and "strip-mining" in the same sentence; in contrast, "strip-mining" and "copper" would have a far more intimate relationship. It doesn't mean that "taxi" and "copper" are utterly unrelated—they're just distantly related in the writer's semantic universe. Over a lifetime of seeing and reading words, Landauer argues, our tacit knowledge of these relationships enables us to determine the meaning of words and sentences. In his analysis, word order in a given sentence or text doesn't matter; merely by appearing in close proximity, a significant relationship between words is established.

Landauer's team dubbed their technique Latent Semantic Analysis (LSA), a name that refers to the hidden connections they believe create meaning in a given document.

Early results obtained with LSA exceeded the researchers' expectations. The software demonstrated an uncanny ability to emulate certain aspects of human reading comprehension. In the early 1990s, Landauer and a colleague created a massive semantic space by crunching 4.6 million words—roughly to the amount of text a student would have read by the time he or she reached the eighth grade—from entries in the Grolier's Academic American Encyclopedia. The space took up five hundred megabytes of their computer's memory, an enormous amount at the time.

When the LSA space was ready, the researchers fed it a synonym test that was part of the Test of English as a Foreign Language offered by the Educational Testing Service. Each question poses a word or phrase and asks the testee to pick the closest match from among four other words. LSA scored 65 percent. Not perfect, but still, the researchers were delighted to note, this was the average score for non-English-speaking college applicants. "We could've gotten it into college," Landauer laughs.

After this success, remarkable things began to happen. As his team experimented with the encyclopedia-based semantic space, he says, the program became even more—to use his word—"humanlike." In addition to picking out synonyms, the researchers discovered that LSA could make correct inferences about sentences it had never encountered before.

Consider the following sentence: "The player caught the high fly to left field." No single word in the sentence actually mentions the game of baseball. The referent is merely implied. Yet when Landauer and a colleague analyzed the sentence using their encyclopedia LSA space, the software seemed to recognize the implicit—but entirely absent—subject. LSA noted that the words "ball," "baseball," and "hit" were all closely related to the sentence. "Ball" was rated 0.37 in similarity to the sentence; "baseball" received 0.31; "hit" got 0.27—all relatively close occurrences.

Equally astonishing to the researchers, LSA could accurately determine that other words weren't closely related to the sentence. Consider the word "fly" as it appears in the phrase "high fly." "Zipper," of course, is a synonym for "fly" when we're talking about a pair of pants. But LSA rated "zipper" merely a 0.03. In other words, it signaled that in a sentence about baseball, the word "fly" did not mean "zipper." "Insect," another potential word relative, scored just 0.17. "Airplane" got only 0.18.

No previous computer technique had gone this far. LSA could apprehend a close relationship between any two sentences in a text, even if they had no words in common. And it did so without recourse to preprogrammed dictionaries or grammars. In a 1997 paper on LSA, Landauer and a co-author invoked the "magical appearance of its performance."

Quantifying relationships between individual words and sentences is one thing; grading essays is another. Or is it? By the mid-1990s, Landauer's LSA project had piqued the interest of several younger researchers, including Peter Foltz. While a graduate student in psychology at the University of Colorado, Foltz spent a year as a research intern at Bellcore, the successor to Bell Labs after the breakup of AT&T. There, Foltz met Landauer and began learning about LSA. At the end of the year, Foltz returned to Boulder to finish his doctorate. Landauer, coincidentally, moved there, too, after obtaining a teaching position in the university's psychology department.

Back at school, Foltz became increasingly interested in finding practical applications for LSA. At a rudimentary level, Foltz believed, the principle behind LSA is the same as that an instructor uses to grade a student essay. In theory, he argued, the more information a student extracts from a textbook, the more his or her writing will resemble—at least in factual content—that original text. This is particularly true for the humble short-essay answer, where factual information is at a premium and where opportunities for stylistic embellishment—metaphor, simile, analogy—are often at a minimum. With the short-essay answer, Foltz concluded, a teacher isn't looking for inspired prose, just some evidence that the student has mastered the subject. Could LSA tell if this was the case?

Ignoring syntax and grammar, the team's software nonetheless showed an uncanny ability to emulate human reading comprehension.

Foltz decided to conduct an experiment. After finishing his Ph.D., he received a postdoctoral appointment at the University of Pittsburgh. There, he collected a handful of short essays that Pittsburgh history students had written in response to the question "To what extent was the U.S. intervention in Panama justified?" Before writing the essays, the students had read twenty-one short history texts on the subject, a total of 6,097 words. Foltz took those texts and used them to create an LSA space. Then he added an additional 4,800 words from encyclopedia entries and 17,000 words from other books on the topic. Satisfied that the LSA space was fully "trained" on the subject of Panama, Foltz set it to work on the papers, allowing LSA to make comparisons.

The process worked like this: To grade the essays, Foltz compared each, sentence by sentence, with the LSA space. Since LSA looks for semantic similarity, not identity, it didn't matter if the students' wordings diverged from the original texts. A sentence with high content about Panama would receive a high score, no matter how it was phrased. To determine the final mark for the paper, Foltz calculated the average of the similarity markings for each sentence. The highest average correlations got As, the slightly lower ones got Bs, and so on.

The central risk of this approach, Foltz admitted later, was that plagiarism would be lavishly rewarded: If a student repeated an original source text verbatim, LSA would give the essay a high score. One could raise another, more profound objection: The test question required the students to make a persuasive judgment about the Panama invasion based on their readings, but LSA could assess merely the presence of relevant information, not the strength and coherence of an argument.

Despite these weaknesses, Foltz suspected LSA's grades could match those of human graders. To find out, he hired four history graduate students and had them mark the essays, too. When he compared the human- and computer-generated grades, the results were impressive: LSA approximated the human graders. It wasn't a perfect match; in many cases, the LSA grades differed greatly from those given by a human. But then again, the human graders didn't always agree with each other either. On average, LSA produced grades that agreed almost as often with the humans as the humans agreed among themselves. Thus, LSA passed computer science's celebrated Turing test: A machine can be considered intelligent if its behavior is indistinguishable from that of a human. In other words, the point behind an automatic grading system is not to grade perfectly; it's to grade as well as a human.

Nonetheless, Foltz wasn't satisfied with LSA's performance. LSA was almost as good as a human grader but not quite, and Foltz thought he knew how to improve it. He had the four graduate-student graders comb through the Panama reading materials and pick ten sentences that they felt imparted the most essential information on the topic. The LSA space then graded the student essays based on how closely they matched the content of those ten sentences alone. As Foltz noted, the ten sentences represented the material that the graders regarded as most important, the content of an ideal answer to the Panama question. Foltz figured that human graders are looking—consciously or unconsciously—for very specific information. By having the graders identify their preferred content, Foltz could train the LSA space to look for that material, too.

Foltz's assumptions were right. By basing its marks on the ten sentences, LSA's scoring results were even closer to the humans'. In fact, this time LSA slightly outperformed its human counterparts, agreeing with the human graders more closely than they agreed with each other. LSA, it appeared, was hitting a golden mean that the human graders wandered around but seldom found. When Landauer learned of Foltz's results, he was initially surprised; later, he thought they actually made sense. Human instructors are subject to "grader drift," a well-documented propensity to raise or lower grading standards unconsciously as they work. Come across a series of really good papers, and you adjust your standards upward to compensate; come across a series of bad papers, and the reverse happens. A computer's objectivity is never compromised in this way.

When I first reached Darrell Laham by phone, he was reeling from a discussion with his lawyers over intellectual-property questions. "Things are really shaping up as far as the commercialization of this goes," he says. "I've spent countless hours these last six or seven months talking to lawyers, just to make sure we have all our t's crossed and our i's dotted." Laham is a fervent convert to LSA. As a student of Landauer's at the University of Colorado, he made LSA the subject of his master's research. Four years later, Laham is nearing the end of his Ph.D. in psychology and is a partner in Knowledge Analysis Technologies with Foltz and Landauer.

With Laham's assistance, IEA's inventors continued to tinker with the software, programming it to flag an essay with too high a score, for example, since very high scores signaled potential plagiarism. The team also experimented with alternate ways to grade papers, such as having LSA compare students' essays to a "gold standard"—a "perfect" essay on the topic written by a professor. IEA, they discovered, required a fair bit of tweaking when it first learned a subject. "We decided this couldn't be a point-and-click piece of software, because of the training it takes to get the system to the state where it's performing well," Laham says. "We have to be involved in calibrating it." After combining several grading methods, they eventually brought LSA up to the accuracy level (80 percent agreement with human graders) achieved by professionally trained markers at the Educational Testing Service, arguably the academic benchmark for grading reliability.

But it wasn't until they attended the 1998 AERA conference that the trio became serious about selling their software. "After we presented our research, people just wanted it," Laham said. When Myke Gluck, a professor of Information Studies at Florida State University, asked the researchers to set up an automatic grading system for his master's course, Foundations of Information Studies, they readily agreed. The psychologists have yet to decide how much they will charge schools for their services; about their deal with Gluck, Laham will say only, "We're subsidizing the costs quite a bit to get the thing going."

That professors should be eager for automatic essay grading is, from one perspective, a rather pointed indictment of higher education. As college enrollments have soared, so have course sizes; yet, at the same time, university administrations are demanding greater efficiency from their faculties. As a result, the essay test has frequently gone the way of the dodo, superseded by its poor cousin, the multiple-choice exam. "I have two hundred students," Gluck reports. "I know essays are the best way to get students to learn, but let's get serious: Grading two hundred papers is just impossible to do with any quality at all. If I can't figure out another way to do it, I'll have to start using multiple-choice. That's the worst thing I could imagine."

Indeed, its inventors are marketing IEA as a multipurpose pedagogical tool, one that can teach as well as grade. Foltz, for example, fed IEA the textbook he used in his upper-level psychology class. After reading a chapter in the textbook, students wrote an essay and submitted it to the software. In return, IEA gave them an instant grade, identifying material they had missed and suggesting further reading. Students were allowed to rewrite their essays as often as they wished with the understanding that only their highest grade would be recorded. On average, students rewrote their essays three times. The typical grade for the first submission was an 85; by the third rewrite, the class average was 92. At the end of the semester, 98 percent of Foltz's students said they would choose to use the mechanical grader again if given the chance.

"It's easily demonstrated in five minutes that even children use syntax to determine meaning," says Bill Nagy.

Many observers remain unimpressed. The history of artificial intelligence is littered with bright new ideas that turned out to be duds. And the computerization of teaching is a volatile issue in academe, cutting to the heart of professors' self-image. Even as they moan about the mind-numbing drudgery of teaching and grading, they bristle at the suggestion that a computer could perform these tasks as well as they do.

When news of the Intelligent Essay Assessor reached the press, the backlash was immediate. Professors and pundits fired off letters to newspapers denouncing IEA as quackery. Kevin McNamara, an English professor at the University of Houston at Clear Lake, called the grader "a parasite that will thrive at the expense of education." Amy E. Schwartz, an editorial-page staffer at The Washington Post, inveighed against it in an Op-Ed article, claiming that it was symptomatic of the many "bad trends in education.Š [It] harks back to the once firmly established view that the role of the teacher in the classroom was to make sure the student mastered and could recite back a set body of information."

A frequent and largely justified complaint about IEA is that it cannot appreciate creative uses of language. Trained to recognize the most common usages of words, the program is unable to appreciate deliberate but uncommon usages—metaphor, for example. In other words, IEA can't tell that the phrase "a horse of a different color" is not referring to horses at all. "It's a classic problem of standardized testing," says Monty Neill, the executive director of FairTest, a watchdog organization for test-taking technologies. "We set up tools to look for only one criterion, and students who answer something creatively get punished."

The Colorado team doesn't contest the point. Instead, they've attempted to incorporate the weakness into the software. If IEA detects creative phrasing, it is designed to admit defeat and turn the text over to a human for proper grading. In practice, however, the software seldom admits defeat. After all, its inventors argue, creativity is moot in short-essay answers. Professors who cling to the idea of the finely crafted essay obviously haven't marked one in years. The point, Laham maintains, is to convey the maximum amount of content; brevity trumps style. Critics, he adds, suffer from a misplaced romantic ideal that the student essay is always "a piece of literature, like a Charles Lamb kind of piece."

LSA's inventors also allow that word order and syntax present significant blind spots for the software. LSA completely ignores a word's grammatical role in a sentence, an approach that led one Stanford University professor to ask Landauer, "How much should you predicate on a semantic system that considers 'the cat ate the thumbtack' equivalent to 'the thumbtack ate the cat'?" Couldn't a student fool LSA by writing an essay that used the expected words but scrambled their syntax—or used no syntax at all? The previous sentence you just read, for example, could be represented: "student fool words expected syntax essay."

In response, Landauer points to experiments he's conducted offering bogus, ungrammatical essays to IEA. All scored poorly or failed, he claims. Obvious words, he speculates, are not the only important markers of content and meaning. Put another way, words that we might consider unessential are anything but. In one experiment, Landauer and Foltz used LSA to analyze a handful of medical students' essays on the heart. He found that supposedly nonessential words—articles and conjunctions—were contributing as much to the content as the keywords. "Perhaps the only way that a student can generate an accurate set of [keywords] is to compose a good essay," he mused in a 1997 paper on the experiment.

The critics still aren't convinced. One managed to humiliate IEA's inventors by using their own tool against them. In 1998, Landauer's team set up a demonstration of IEA with sample essay questions on the company's Web site (www.knowledge-technologies.com). Last November, Dennis Baron, a professor of English and linguistics at the University of Illinois at Urbana-Champaign, visited the site. He was writing a scathing Op-Ed piece about IEA for The Chronicle of Higher Education. To buttress his criticisms, he took the first twelve paragraphs of his article and presented them as answers to two of the Web site's essay questions—one on business practices and one on marketing. He received passing grades of 3.36 and 4.82, on a scale of 1 to 6. IEA didn't flag the business-practices essay as deviant in any way: "None of the confidence measures detected anything inappropriate with this essay," it responded. The marketing essay, though it received a good grade, was flagged as suspect; the software referred it to a human grader.

Baron was triumphant. "I'm always interested in stretching the capabilities of the technology," he says sarcastically. "So I said, okay, Star Trek device, assimilate this!" On a more serious note, he worries that essay-grading software may actually do more harm than good. Machines may be intended to imitate human behavior, but what if they also shape it? "Essay-grading programs may give both students and teachers a false picture of how readers evaluate what they read, and that in turn will lead to false ideas about how we should write," he wrote in the Chronicle. Baron points to his experiments in the 1980s with The Writer's Workbench, a software program that analyzed grammar with the goal of helping writers clean up their syntax. "It was always telling me that I was no good, because I used too many long words and complex constructions." FairTest's Monty Neill agrees that such tools—now frequently included in word-processing programs—can adversely influence teaching: "If the kid writes something the machine won't accept, the teacher will say, 'Well, I won't accept it either.'"

Baron's review stung the Colorado team. They backpedaled, hastily pointing out that the on-line demo was "very, very beta." The most current version of LSA—the one that is going on the market this summer—is far better at smoking out cheats, says Laham. It offers enhanced statistical tools to assess whether the patterns of language in the essays match those found in conventional syntactic writing. One tool, for example, checks the essay's ratio of common words like "and" and "but" to the more uncommon, subject-specific words. Normal writing has a fairly standard ratio; mere lists of keywords stand out. "We've got red flags," Laham says. He cheerfully professes to be glad that Baron and others have tried to hack his system: "It's like free debugging."

LSA's most serious challenges to date, however, come not from people like Baron but from experts in linguistics and language theory. "The debate has gotten heated," says Arthur Graesser, the editor of the journal Discourse Processes, which devoted an issue last year to LSA. Graesser himself is experimenting with LSA as a tutoring aid. "The debates on LSA are some of the hottest around," he says. "Tom has had a lot of bad reactions over the last two years. He's had to take a lot of flak."

Understandably, LSA troubles language experts. What does it mean that a machine can respond usefully to a text without recognizing grammar or syntax? The experts have long assumed that syntax is more important to language's operations than semantics; sentence meaning is predicated on sentence structure. This is why Landauer's program is so threatening. If you ignore grammar, it shouldn't be possible to emulate human language abilities with any degree of accuracy. "After I give a demonstration, I keep on having linguists stand up and tell me, 'What you've just done is impossible,'" Landauer says.

One way Landauer has tried to persuade the skeptical linguists is by invoking Plato's paradox, one of their own puzzles. The paradox derives from a simple question: How do you know all that you know? It has been amply demonstrated that people obtain an inexplicably large amount of knowledge from comparatively little information. The typical seventh-grader, for example, assimilates ten to fifteen new words each day. At the same time, studies show, she is likely to be formally exposed to only three new words each day. How does the seventh-grader absorb the meaning of words she has not actually encountered?

LSA outperformed the human graders, agreeing with their results more often than they agreed with each other.

Over the years, this mysterious "poverty of stimulus" phenomenon has led philosophers, linguists, biologists, and cognitive scientists on a merry chase. Most famously, it helped Noam Chomsky formulate his notion of an innate and universal human grammar in the 1960s. For Chomsky, the only explanation for our ability to master a system as complex as language is the existence of a hardwired grammar that assists us in making sense of sentences that we haven't encountered before. In this view, now widely accepted, we're born with language and have merely to discover it for ourselves.

In a 1997 article, Landauer challenged the Chomskian approach to Plato's paradox, citing his work with LSA as evidence. LSA, he noted, distinguishes nuances in meaning purely through statistics, comparing the many different contexts in which a word occurs. Thus, when LSA encounters a new word, it does not attempt to determine the role it plays in a sentence; rather, it considers which other words appear nearby. As Landauer points out, humans often seem to behave in a similar way, guessing correctly at the meaning of a word by considering the meaning of nearby words. "More or less correct usage [of a word] often precedes referential knowledge," he wrote. "Many well-read adults know that Buddha sat long under a Banyan Tree (whatever that is) and Tahitian natives lived idyllically (whatever that means) on breadfruit and poi (whatever those are)."

At the same time, he noted that LSA's technique allows it to learn about words not only from the context in which it finds them but from contexts in which it doesn't find them. Again, Landauer argues, people may assimilate language the same way. For example, if a child reads a book about a farm, he'll likely see the words "pig" and "cow." Obviously, then, the child learns that pigs and cows are closely related to the idea of a farm. However, he probably won't find the words "typewriter" and "Corvette" in the farm book. The child therefore implicitly learns that typewriters and Corvettes are not closely related to farms, which, in a subtle way, tells him something about their meanings, too. By looking at word meaning as a function of statistics, LSA offers this solution to Plato's paradox: We amass information quickly because even a word's absence tells us something about its meaning.

For all its pretensions to radicalism, Landauer's theory does bear some resemblance to Chomsky's; in place of Chomsky's innate grammar, Landauer posits an innate computational ability. But Landauer argues that his model is more robust because it's simpler. It doesn't stipulate that our brains come equipped with complex and seemingly arbitrary grammatical rules. Instead, our brains need only be powerful calculating machines, teasing out the meanings of words by quantifying their relationships as we encounter them.

Still, the LSA brain model leaves the question of everyday syntax up for grabs. If semantics are all-important, what is there left for syntax to do? Very little, respond the three psychologists. In a paper they presented at the 1997 meeting of the Cognitive Science Society, they speculated that syntax may merely act as a gatekeeper, signaling to us which sentences are worth our attention and which are pure nonsense. In comparison, the researchers argue, semantic context is so powerful that it can create meaning even when syntax is absent. Landauer takes a sort of rebellious glee in pointing to examples: "If I say to you, 'IBM stock bought Tom,' you know that it means I'm saying, 'Tom bought stock in IBM.' There's no way you'd ever think 'IBM bought stock in Tom.'" In which case, the vast majority of linguists have been largely wasting their time for years and years.

But linguists have some impressive counterexamples of their own. "It's easily demonstrable in five minutes that even children use syntax to determine meaning," says Bill Nagy, a professor of education at Seattle Pacific University. He notes a famous linguistic experiment in which a teacher shows children a doll and calls it either "Dax" or "a Dax." If the teacher uses "Dax," the children will think the doll's name is Dax. In the other case, the children will think the doll is a Dax-type doll. "It never fails," Nagy says.

Possibly Landauer's most penetrating critic is Charles Perfetti, a professor of psychology and linguistics at the University of Pittsburgh. Perfetti knows and respects LSA; indeed, he assisted Foltz with his original essay-grading experiments in Pittsburgh. But in the special issue of Discourse Processes devoted to LSA, he soundly dismissed the trio's grandest claims. LSA, he argued, is nothing more than a very sophisticated keyword searcher, good at identifying "co-occurrences" of words but nothing more. If the results seem magical, it's because LSA can find word associations that are invisible to humans or of which they are unconscious. LSA's weakness is that it mimics only one narrow function of the brain; its strength is that it can perform this one task better than we can.

More damningly, Perfetti noted that LSA need only slip up once to prove it's not a cognitive model. He pointed to the semantic space constructed from the 4.6-million-word encyclopedia. If it were truly a model of the brain, it would know everything an eighth-grader would know. But when asked for a synonym for "physician," it chose "nurse" over "doctor." "Such an error is not expected from a student but should be expected from a system that depends exlusively on the co-occurrence of words," Perfetti wrote, with what sounded like the thud of unimpeachable judgment.

In many respects, the debate over LSA reflects the anxieties faced by traditional academic disciplines as they encounter the computer revolution. If linguists have traditionally emphasized rules of syntax or grammar, it may be partly because these features have offered the most convenient means of analyzing language; humans can't consciously do massive statistical assessments of word patterning. Computers have changed all that. They can do what we can't.

LSA may not be a blueprint of the human brain, but that doesn't mean we shouldn't take it seriously.

Few scholars will dispute the fact that if you analyze a text carefully enough, you'll discover at least some interesting statistical patterns. And a few of those patterns may be useful in helping us understand how our brains process and produce language. But given LSA's considerable lacunae—its indifference to scrambled word order, its inability to distinguish between synonyms and other similarities among words, its failure to grasp metaphors and similes—it's difficult to believe that LSA is even a rudimentary blueprint of the human brain. As Perfetti suggests, LSA may be best appreciated as an extremely powerful tool. It makes for an excellent assessor of the pedestrian short essay but no more.

That doesn't mean we shouldn't take the software seriously. IEA may not be a threat to linguists and philosophers, but it may nonetheless be a threat to teachers. If a rather obtuse machine can grade essays as well as a human can—and for a fraction of the price—isn't that a rather unflattering commentary on the grader, the essay, and the entire teaching process? Could so much university-level instruction really be this primitive? Rather than turn up their noses at the software's limitations, perhaps professors ought to be alarmed by what its success says about their own.

Clive Thompson is editor at large for Shift magazine and a technology columnist for Newsday. His email address is clive@bway.net