In billions of words, digital allies find tale
Google,
Harvard join in book study
By Carolyn Y. Johnson
Globe Staff / December 17, 2010
Mining the complete text of 4 percent of the
world’s books, Harvard University and Google researchers used a powerful new
tool unveiled yesterday to glean surprising insights into language, culture,
and history.
Books
already tell stories, but when their words are combined and analyzed with
computational tools, they tell bigger tales. By studying billions of words that
appeared in books published over the last 200 years, the researchers found that
references to God have been dropping off since about 1830. People are becoming
celebrities earlier in life now than in the past, but their fame is more
fleeting as their names drop out of the lexicon. References to past years are
dropping off more quickly as cultures shift their focus to the present. And
censorship leads to discernible shifts in a person’s or event’s cultural footprint,
as evident in tracking Tiananmen in Chinese books, or the Jewish artist Marc
Chagall in German books from the Nazi era.
The
findings, fruit of the ambitious Google project to digitize every book in
existence, were reported yesterday in the journal Science. They are a
tantalizing first glimpse at what researchers think may become a transformative
new tool for humanities researchers.
Google
is publicly launching the tool, Google Books Ngram Viewer, to allow scholars or
the simply curious to ask questions, such as when references to “The Great
War,’’ which peaked between 1915 and 1941, were replaced by “World War I.’’ The
tool allows people to look up words or phrases that range from one to five
words, and see their occurrences over time — the frequency that a word is
mentioned in a given year divided by the total number of words written that
year.
“This
is really the largest data release in the history of the humanities — a
fantastic wealth of data,’’ said Jean-Baptiste Michel, a postdoctoral researcher
in the program for evolutionary dynamics at Harvard. “In our paper we present
our initial investigation — we explore this new terrain, we dig a little bit.
It is a very cool feeling to have, but what people will be able to do will far
exceed everything we have done.’’
In
this analysis, the researchers used the data set to look at changes in grammar
and English, finding that about half the words that appear in books are “dark
matter’’ that do not appear in dictionaries — words that may be compound
constructions or proper nouns, or just are undocumented, like “aridification’’
or “slenthem.’’ English, they found, is growing by about 8,500 words a year.
They
have also looked at collective memory — and forgetting. Authors are letting the
past go more quickly. The year “1880’’ had dropped to half its maximum
frequency of references 32 years later, in books written in 1912. But it took
only a decade for “1973’’ to decline to half its prominence.
Researchers
found that use of the word “women’’ has been rising for 200 years, and began to
eclipse mentions of “men’’ around the mid-1980s. And the frequencies of
“pizza,’’ “pasta,’’ and “ice cream’’ have soared since the 1970s — food for
thought,
The
study, led by Michel and senior author Erez Lieberman Aiden, who runs the
multidisciplinary Laboratory-at-Large at Harvard’s engineering school, drew on
a wide array of collaborators, not only from Harvard and Google, but also from
Encyclopaedia Britannica and the American Heritage Dictionary.
Michel
and Lieberman Aiden had worked together on a 2007 study in the journal Nature
that tracked the evolution of language through a much more painstaking process
— hunting down obscure old books and reading them to discover the linguistic
heritage of modern verbs. They began to notice the growth of Google Books, the
initiative that has now scanned 15 million volumes — more than 10 percent of
all published books, according to Jon Orwant, engineering manager of Google
Books, which has a large presence in Cambridge.
Seeing
that their research techniques would soon be antiquated, Michel and Lieberman
Aiden approached Google and began a collaboration.
“As
we’ve amassed more and more information that isn’t available elsewhere, I
started to realize we’re sitting on these troves of data that are very
useful,’’ Orwant said. The value, he said, is not just for Web users searching
for answers to specific questions, but to scholars, too.
The
efforts are part of a much broader push to bring the power of analyzing large
data sets to the increasingly digitized world of humanities research.
“If
you look at what humanities scholars have studied for hundreds of years, they
tend to study things like books, music. The difference today is those are
digital and you have the potential of searching and ‘reading’ much larger
amounts of this information than you ever could before,’’ said Brett Bobley,
director of the office of digital humanities at the National Endowment for the
Humanities.
Such
tools would not supplant humanities’ researchers current methods, Bobley said.
But they could supplement work and broaden the scope of research questions,
which are limited by how much people can read and remember.
Researchers
calculated, for example, that just reading the books from the year 2000 in the
two-century data set used in the Science paper would take 80 years — without
interruption for meals or sleep.
Franco
Moretti, co-director at the Stanford Literary Lab, praised the methods and the
findings of the study. Going forward, digital humanities researchers have
increasingly powerful tools, but the challenge will be interpretation — finding
links between quantity and meaning.
“Just
as it makes an enormous difference [for paleontologists] whether a bone
fragment belongs to a creature’s tail or neck, so it makes a great difference
whether the word ‘God’ . . . occurs as a self-explaining given, in a discussion
of principle, or as a banal interjection; whether, in a play, it is used more
often in soliloquies, love duets, or public scenes; and so on,’’ Moretti wrote
in an e-mail.
© Copyright 2010 Globe Newspaper Company.
http://www.boston.com/news/local/massachusetts/articles/2010/12/17/harvard_google_join_in_study_of_books_from_past_200_years/?page=2
Nenhum comentário:
Postar um comentário