阿迪新闻英语-新的AI工具可搜索数百万份历史报纸页面

标签:
阿迪英语英语学习新闻英语旧报纸人工智能 |
分类: 阿迪新闻英语 |
20201001 Thu
阿迪新闻英语
新的AI工具可搜索数百万份历史报纸页面
New AI Tool Searches Millions of Historical Newspaper
Pages
A new search tool uses machine learning to search millions of
U.S. newspaper pages for historical pictures.
The U.S. Library of Congress recently launched the tool,
called Newspaper Navigator. The online search system is available
for free to the public.
The Library of Congress is the world's largest library. It
offers materials from the creative record of the United States. The
library serves as the main research service for the U.S.
Congress.
Newspaper Navigator currently permits users to search more
than 16 million pages from newspapers across the country, from 1900
to 1963.
The newspaper pages were digitized for another Library of
Congress project, called Chronicling America. This tool also
permits searches across the library's 16 million newspaper pages.
The pages contain more than 1.5 million images.
The Chronicling America system permits users to find and look
at full newspaper pages as digitized images. Users can also search
the collection by keyword, using optical character recognition --
OCR. OCR is a tool that uses digital cameras to identify printed
characters on a page for searches or to produce text.
This meant that people using the Chronicling America site had
to search through newspaper pages themselves when trying to find
specific images. The new Newspaper Navigator tool offers the
ability to carry out searches based on image-only content in the
collection.
This is where the machine-learning methods come in. The search
system was trained to recognize different kinds of images. For
example, it was designed to tell the difference between photos,
maps, comics, advertisements, etc. It can also identify similar
images and return these in search results.
Benjamin Lee created the system. He is a member of the Library
of Congress' Innovator in Residence Program. The program was
established to sponsor people from different fields to create new
ways to present the library's huge historical collections to the
public.
Lee trained a machine-learning model to identify the visual
content and then ran the model over all 16 million pages in
Chronicling America.
His training model was based on another Library of Congress
experiment called Beyond Words. That project invited members of the
public to help identify cartoons, drawings, pictures and
advertisements in newspapers during World War I.
Lee said that after he learned of the Beyond Words experiment,
he saw a great possibility to use that information to power his
machine-learning tool. "I began to wonder whether this identified
visual content was the key to throwing open the treasure chest of
visual content, throughout all 16 million pages in Chronicling
America."
Newspaper Navigator works like other search engines. Users
enter a search term in the "keyword" box. They can also choose to
limit search results by location, as well as by date.
But one of the most powerful tools in the system is the
ability to search images by visual similarity. Users of the tool
can save images to a personal "collection." They can then use those
images as a basis for finding other visually similar images across
the library's full collection.
The system even permits users to "retrain" the machine
learning tool for individual searches. This is done by examining
the images that the search returns. By selecting whether images
found were similar or not similar to the desired result, the user
is "retraining" the system to improve its search performance.
A demonstration of the Newspaper Navigator is available to
help users learn more about the tool and how to carry out different
searches. The creators hope the tool can be useful for historians,
reporters, educators, professional researchers or anyone interested
in learning about U.S. history through newspapers.
The Library of Congress notes that all images included in
Newspaper Navigator and Chronicling America are in the public
domain, meaning people are free to use them as they wish.
-----
Words in This Story
page – n. one part of a website
digitize – v. to put information into the form or a series of
numbers, usually so that it can be understood by a computer
character – n. a letter, number or other mark or sign used in
writing or printing
comics – n. a series of pictures that tell a story
content – n. information contained in a piece of writing, a
speech, a movie or on the internet
visual – adj. related to seeing
sponsor – v. to pay for someone to do something or for
something to happen
location – n. place where something takes place