HathiTrust, Full-Text Riches, Whatever Your Discipline

HathiTrust logo

Just like the Hindi word for the elephant its name implies, HathiTrust is enormous and it never forgets.  With over 16 million digitized texts, HathiTrust is a vast archive of digitized materials for research and scholarship of all types.  It was created in 2008 as a partnership of major research libraries working together to preserve and make available the published record of human knowledge from around the world and in a wide variety of languages and scripts. It includes both public domain, open access and copyright materials digitized by Google, the Internet Archive and Microsoft, as well as those digitized by member libraries in-house efforts, including Boston College.   Copyright materials included are made available in full-text to the users at the contributing institution.

Discover useful materials by bibliographic (title-level) or full-text document searching (see Search Tips).  In addition to the non-English language materials available for discovery, you can also search using such non-Western scripts as Russian, Greek, Hebrew, Chinese, Japanese, Korean and more (BC has made a particular effort to contribute Irish script content).  Once you find what you need, read materials online or sign in as a member of the Boston College community to download entire works available for full viewing;  without sign-in, download may be limited to a page at a time.

While all works found in HathiTrust are available for discovery, full viewing is restricted to open access works created under a Creative Commons license, items in the public domain (U.S. works published before 1923 and government documents), Australian or Canadian works prior to 1898 (see Copyright for published materials from other countries) or materials contributed by Boston College Libraries.  (FYI, Boston College Libraries staff are involved in a national, systematic effort by libraries to update public domain status on materials included in HathiTrust.)  Boston College users with print disabilities can get access to full-print materials still in copyright.

As you find materials, consider grouping them into your own collections.  Once you create a collection for future reference, you can search within it and share your collections with others by making them public.

Not only is the corpus of material available here for viewing, researchers can use tools provided by the HathiTrust Research Center (HTRC) for computational analysis of vast amounts of publicly-available textual data.  Researchers engaged in non-profit and educational research and needing a secure environment for text-mining and other non-consumptive uses can create their own secure datasets or leverage those already available. Here are two such datasets available for users:   “HTRC Extracted Features Dataset Page-level” (with features from 15.7 million volumes) and  “Word Frequencies in English-Language Literature, 1700-1922”.  The HTRC launched as a partnership of Indiana University and the University of Illinois, working with the Hathi Trust Digital Library.

Look to HathiTrust as a rich and constantly expanding source of U.S. Federal Government documents issued by the Government Printing Office (GPO) and other federal agencies.  Libraries are partnering to create the United States Government Documents Registry within HathiTrust.  The goal of this ongoing effort is to create metadata and digitize all full-text materials to provide access to a complete body of federal documents from 1789 to the present.  This effort is particularly challenging, given that there is no reliable list of government documents created.

Clearly, the HathiTrust is a rich resource in all areas of public knowledge and a resource that you may want to consider including in your research efforts, whatever the discipline.