As communication, news reporting and even mundane records move from print to the digital realm, the Library of Congress keeps records of much of it.

Abigail Grotke, head of the library’s web archiving team, and her colleagues have been documenting Internet content since 2000. While creating a record of important and burgeoning online speech, the team is also gathering collections that will help researchers in some future time get a clear idea of life in the early 21st century.

The library’s “recommending officers,” in consultation with subject experts, select tweets, blog posts and other online items. The resulting archives are accessible to users worldwide.

One focus of the collections is American elections. “In print, we would get flyers and pamphlets and things like that,” Grotke said, but after elections are over, “many of the campaign [web] sites disappear.”

Twelve years of Twitter

In 2010, the Library of Congress signed an agreement with Twitter to acquire the texts of all public tweets from 2006 forward. The library says it took this step for the same reason it collects other materials — to preserve “a record of knowledge and creativity.”

As social media has exploded, however, the library changed its collection strategy in December 2017, choosing to preserve tweets around themes and events, such as elections or issues of ongoing national interest, such as public policies.

This aligns with the way the library archives other social media platforms, such as Facebook.

Grotke says the library archives about 30 terabytes of the web each month (a terabyte is the capacity of many recent desktop computers). The Web Archive has collected about 1.3 petabytes of data — that’s 1,300 terabytes — since 2000.

“Just dealing with that amount of data is a big challenge,” Grotke said, “but we’re up to it. It’s exciting.”