Jenny Jansson

The Labour Movement Gone Digital: Preservation of organizational activities in the On-line Era

The Internet has become an increasingly important forum for political activism and social movements (SM). Actions themselves have moved on-line. The situation is challenging for those interested in citizens' democratic practices, because the material published on-line are not systematically archived. Digitalization will make it very hard to do research on our contemporary movements in the future. In this project we aim to take some first important steps towards solving this problem. We will set up an automated script that collects and stores material from the Swedish labour movement's online activities, such as their webpages, social media, blogs etc. Since this is a pioneer project when it comes to storing online activities in Sweden and creating a searchable database, we will start collecting material on one movement only. We have chosen the labour movement because the labour movement has been the backbone of the Swedish civil society, is a well-defined movement, and has excellent traditional archive material that is likely to be helpful to us with respect to coding. Once the system works we can easily add other movements to the database, which is our ambition. The material will be coded in order to make it easy to search for anyone interested in doing research on the labour movement. Similar databases exist in other countries. Hence, the database can be used for international comparisons.
Final report

Labour gone digital! (DigiFacket)

DigiFacket aimed at creating a web archive for Swedish trade unions for preserving born-digital material posted on the Internet by trade unions. Webpages and social media feeds are integrated and important means of communication nowadays, but are seldom preserved. This project aimed at solving this problem. The project has been a cooperation with the Labour Movement's Archive and Library (LMAL) and the TAM Archives. Although the project was initiated by scholars, our main task was to create an archiving system that could be run by the LMAL and the TAM Archives after the development phase. The archives and the trade unions themselves did not have the necessary financial resources for developing such a system in 2015 when we started the project, but they were eager to collaborate and share their archive-specific knowledge with us.

DigiFacket is made up of several different components. The downloading cycle roughly consists of the following steps: harvesting, storing, indexing, and displaying the material for the user. We tested several software programs before settling on NetarchiveSuite (NAS) and the data crawler Heritrix. To facilitate the sustainable, long-term storage of the material, the files needed to be compressed. One important advantage of choosing the NAS software package was that it supported the Web ARChive (WARC) file format. For each harvesting cycle, NAS gathers all harvested files into a single WARC file, which simplifies data transferring, indexing, storing, and other processes. The WARC format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. As an ISO standard (ISO 28500:2017), WARC is recognized by most national library systems as the standard to follow for web archiving and is used by many of the software programs that manipulate harvested data. It is likely to be a format that will be compatible with other software for the foreseeable future.

Full-text searching is often the preferable option for researchers, which requires some sort of indexing. Although NAS has an indexer of its own, it appeared to be too rigid to allow us to adjust the system according to our needs, especially in terms of the possibility of adding our own search categories to the UI (such as time frame and domain name). Instead, we tried out and decided to go with a combination of SolrWayback and Apache Solr; the latter is a search platform written in Java and developed by the Apache Lucene project. Solr permits the indexing of large amounts of material, and a converter from WARC files is available.

DigiFacket has two UIs: one to browse previously archived web materials—called OpenWayback—and one for letting users make full-text search queries within the same data. For the latter, a combination of Apache Solr and SolrWayback is used.

During the project, we encountered various problems that delayed the project. For instance, it took longer than we had expected to gather consent from the organizations that are part of the DigiFacket. We also had to rethink the collection of social media feeds. There are several reasons for this: although we received permission from the trade unions to download their websites and social media feeds, the material may contain sensitive personal information. This became especially complicated when the GDPR came into force in 2016 (two years after we started the project). In addition, social media companies such as Twitter and Facebook have changed the conditions for downloading several times, making it difficult to design a sustainable downloading system. The solution to this problem is to let the organizations download their Facebook and Twitter histories themselves. This solution is long-term sustainable as the social media feeds can easily be integrated into the regular handover of archive material to LMAL and the TAM archives. We also did not develop a thesaurus on our own. We started to do that, but when it turned out that Solr Search has a very well-developed syntax that makes it possible to clearly define full-text searches, we stopped developing our own index, as it seemed unnecessary. In 2020, the pandemic forced us to postpone the handover of software and collected material to LMAL for several months. The software is now installed and the collected material is handed over to LMAL and the TAM-archives, thus anyone who wishes to access the infrastructure could do that at LMAL and TAM archives.

Throughout the project, the research team got new ideas of how to use the newly archived information and which kinds of additional data would be important to archive. For example, one could investigate the changes in collaboration strategies used by trade unions or other civil society organizations via network analysis of archived data.

Web page: www.statsvet.uu.se/digifacket
Twitter: @digifacket

Grant administrator
Uppsala University
Reference number
IN14-0698:1
Amount
SEK 3,614,000
Funding
RJ Infrastructure for research
Subject
Political Science (excluding Public Administration Studies and Globalization Studies)
Year
2014