A New Research Initiative From the Harvard Law School Library: The Institutional Data Initiative (IDI) Launches Today
From the Launch Announcement:
Today we’re launching the Institutional Data Initiative (IDI), a research initiative at the Harvard Law School Library. IDI is dedicated to supporting our peers as they steward humanity’s knowledge and seek to provide the broadest access to it in the age of AI, just as they’ve done for so much media over centuries, and across the technological revolutions within them.
IDI comprises a growing team of data scientists and community builders, first incubated at the Library Innovation Lab. We’ll collaborate with knowledge institutions—from libraries and universities to cultural groups and government agencies—to help structure, analyze, and publish their collections as data for all uses, including AI. We’ll work to develop AI-driven tools to scale and accelerate this work, evaluations to study its impacts, and best practices to foster responsible data use while affirming institutional stewardship.
Our initial activities include refining a collection of nearly one million public domain books, scanned at Harvard Library; a collaboration with Boston Public Library to make available millions of pages from hard-to-find historical newspapers; and a spring symposium hosted at Harvard Law School to build connections and explore areas of alignment between the institutional and AI communities.
[Clip]
At launch, we have data from nearly one million public domain books, scanned at Harvard Library as part of the Google Books project. Our structuring and analysis of the corpus is complete and we’re working with Google to release this treasure trove far and wide.
We’re also collaborating with Boston Public Library as they scan millions of pages from public domain newspapers. The layouts of newspapers make extracting their text notoriously difficult, so we’re applying new methods to increase accuracy and accessibility. Once extracted, we’ll research the impact this data has on the behavior and information recall of AI models so that other institutions can better understand the potential of their own collections.
Read the Complete Launch Announcement (about 1500 words)
See Also: Supporting New Open Data Initiatives: Institutional Data Initiative and CORE (via Microsoft)
Media Coverage
- Harvard and Google to Release 1 Million Public-Domain Books as AI Training Dataset (via TechCrunch)
- Harvard is Releasing a Massive Free AI Training Dataset Funded by OpenAI and Microsoft (via WIRED)
Filed under: Data Files, Interviews, Libraries, News, Profiles, Public Libraries, School Libraries
About Gary Price
Gary Price (gprice@gmail.com) is a librarian, writer, consultant, and frequent conference speaker based in the Washington D.C. metro area. He earned his MLIS degree from Wayne State University in Detroit. Price has won several awards including the SLA Innovations in Technology Award and Alumnus of the Year from the Wayne St. University Library and Information Science Program. From 2006-2009 he was Director of Online Information Services at Ask.com.