Lessons from the Library: Extreme Minimalist Scaling at Pirate Ebook Platforms

Martin Paul Eve

Issue 16.2

Minimal Computing

Lessons from the Library: Extreme Minimalist Scaling at Pirate Ebook Platforms

Martin Paul Eve

June 2022

archives
access
minimal computing
ethics
digital libraries

doi:

Minimalist computing principles in scholarly communications focus on the moderation of digital resource consumption and global social equity ¹.² Such an application of minimalist principles to computing is designed both to make the global flow of digital information accessible _ to_ the widest audiences and to allow the broadest set of people to participate in the creation of such resources. By minimizing the resource constraints required to run computational architectures, minimalist computing principles work to ensure that the world is not falsely separated into a Global expert North and a Global apprentice South, in which those with resources merely export digital expertise outwards from a claimed center to the supposed margins ³. Minimalist computational principles are designed for inclusivity.

By contrast, contemporary models of scholarly communication — the composite systems through which academics and other intellectuals disseminate their findings for global consumption — appear to be the opposite of minimalist. They cover a full and ever-expanding disciplinary range, from natural scientific outputs, to humanistic arguments, to social scientific research. Their social processes and techniques are vast, involving peer review systems, citations, references, and footnotes ⁴ ⁵ ⁶. The technical elements of such systems also tend to proliferate, including platforms, persistent identifiers, XML standards, digital preservation systems, and manuscript management technologies ⁷ ⁸.

The costs involved in these dissemination systems are also maximal. Since the 1980s, rising in line with the expansion of higher education, the cost of subscribing to academic serials has outpaced inflation by approximately 6% every year ⁹, while library budgets have faced cuts or remained flat ¹⁰. Over forty years, this amounts to a several-hundredfold price increase. A discourse of maximalism and growth seems omnipresent in scholarly communications.

These maximalist principles in scholarly communications have resulted in well-known damaging consequences and inequalities. For instance, Thomas Mboa Nkoudou notes the spread of an “epistemic alienation” when researchers at the margins of the Global North’s publication systems must conform to external pressures and norms ¹¹. As another example, peer review is supposed to be a system that distinguishes work based solely on merit rather than any identity characteristic of the author. However, critics of this system note that pre-publication peer review favors those for whom English is their first language ¹² ¹³. Further evidence of biases from peer review includes the extraordinary fact that just 1.5% of economics articles in highly ranked journals were about countries other than the United States ¹⁴ ¹⁵.

The spiraling, maximalist costs of the for-profit subscription scholarly publication ecosystem also price out much of the world’s population ¹⁶ ¹⁷ ¹⁸. It is in part this inequality that has spurred the open access movement (OA) for research, which seeks to abolish paywalls ¹⁹ ²⁰ ²¹ ²² ²³ ²⁴. Yet even this movement has come under fire. Built on the premise that nobody, worldwide, should be excluded from access to the scholarly record by their (in)ability to pay and that the benefits of re-use should be broadly extended to third parties using Creative Commons licenses, the road to OA was paved with noble intentions. As Ulrich Herb notes, open access was originally “embedded in a conceptual ensemble of participation, democratisation, digital commons and equality” ²⁵. However, parts of this vision have died over time. “Nowadays,” Herb writes, “Open access seems to be exclusive: to the extent that commercial players have discovered it as a business model and article fees have become a defining feature of gold open access, open access has increasingly transformed into a distinguishing feature and an exclusive element” ²⁵.

Thus, the contemporary scholarly communications environment has many defects that pertain to inequality and that intersect with its maximalist characteristics. Given the goals of minimalist computing, rooted in equity and diversity, might it be possible for our scholarly communications to learn from such principles? Is there a way in which ideas of “scaling small” ²⁶ might help these systems to contend, as Roopika Risam puts it, “Not only with the colonial hangovers from the cultural record, but also with forces that are actively constructing the medium of the digital cultural record — the Internet — as a hostile environment” ²⁷?

In the remainder of this article I turn to a specific illicit scholarly communications practice that has emerged in recent years: the idea of the “pirate” shadow library. These archives, which violate copyright, work around paywalls and provide access to all comers. Although frequently on the wrong side of the law, shadow library operators conceive of their sites in ethical terms. Framing their banditry in terms of a Robin Hood-esque outfit, such sites believe they are robbing from the rich to give to the scholarly poor ²⁸ (see also ²⁹). Importantly for the themes of this special issue, I believe that such archives can also be understood in certain terms of computational minimalism. As I will go on to detail, various technical design principles of these archives lower social barriers to participation. This article thus sets out a fresh theoretical terrain for understanding what I call minimal-maximal tensions in computational architectures and projects. While this discourse is rooted in older information technology debates about microkernels versus macrokernels ³⁰, through my case study of Library Genesis — the shadow library that is the main focus of this article — I seek to unpack a new framework for thinking about the relationships between minimal components and maximal outcomes. While Library Genesis seeks to shadow an enormous archive, the constraints under which it operates lead to a series of unexpected minimalist design principles.

The Growth and Emergence of Shadow Libraries

Shadow libraries — pirate archives of copyrighted scholarly publications — emerged in response to a frustration at the slow growth of open access ³¹ ³². Nonetheless, open-access advocates remain divided as to whether these “guerilla” libraries are a solution to, or merely a symptom of, the ills of scholarly communication ³³ ³⁴ ²⁸ ³⁵. The two most famous (and interlinked) of these systems for illicit access to pirate scholarship as of 2022 are Sci-Hub and Library Genesis.

The former, Sci-Hub, is a credential-proxying site that bypasses publisher paywalls, primarily for journal articles. Founded by Alexandra Elbakyan in 2011, the site has grown to provide access to at least 68.9% of the 81.6 million scholarly articles registered with Crossref and to 85.1% of all articles published in subscription journals ³⁶. The site is based in Kazakhstan, which introduces complex jurisdictional legal issues. It works by collecting credentials from academic institutions, possibly by conducting phishing attacks ³⁷. These credentials are then used to fetch any article requested by an end-user. In order to reduce the number of requests to publishers’ sites — and thereby evade detection and a ban on its credentials — Sci-Hub caches fetched journal articles in an archive called Library Genesis. Sci-Hub has faced and lost several court cases, particularly in the U.S. where the publisher Elsevier has been awarded millions of dollars in damages ³⁸. Because the site sits outside of U.S. jurisdiction and because Elbakyan has no means of paying such damages, it is unlikely that publishers will see any financial return from these lawsuits. Nonetheless, for much of 2021, Elbakyan paused the ingestion of new articles while awaiting the results of a court case in India, which could rule in Sci-Hub’s favor ³⁹.

Library Genesis, by contrast, is the largest and oldest shadow archive on the Internet. With over thirty-three terabytes of data in its primary book collection (and more than sixty terabytes in its pool of scientific journal articles powered by its aforementioned sister project, Sci-Hub), the project is one of the largest repositories of copyright-violating educational ebooks ever created ⁴⁰. Established in 2008, the goal of Library Genesis is nothing short of a totalizing modern Library of Alexandria, albeit without legal sanction. As one of its administrators wrote, emphasizing its extralegal, yet claimed moral and ethical, status: “Within decades, generations of people everywhere in the world will grow up with access to the best scientific texts of all time…. [T]he quality and accessibility of education to the poor will grow dramatically too. Frankly, I see this as the only way to naturally improve mankind: we need to make all the information available to them at any time” ⁴¹.⁴² Philosophically rooted in the communist principles of its homeland, Russia, and particularly in the Soviet isolationist copyright policies of the 20th century ⁴¹, Library Genesis is a formidable resource and a large-scale threat to conventional academic publishers ³¹.

Shadow libraries differ in their models, usage patterns, and effects. The Z-library system, for instance, charges for specific formats, such as Amazon Kindle conversion ⁴³. Other pirate libraries such as Library Genesis have no charges at all and their funding mechanisms are unknown, leading to accusations of state subterfuge, although they do use advertising ⁴⁴. While the total percentage of the market eaten by piracy is unknown, in the general ebook space some studies have shown that as many as 35.1% of books are downloaded illegally ⁴⁵. What is clear from existing studies of Library Genesis is that it is used by participants worldwide, including in wealthier regions of the Global North ⁴⁶ ⁴⁷. Various studies have also shown a citation advantage to papers that appear in Sci-Hub ⁴⁸. However, considering the prevalence of material in this archive, this effect could simply be due to the difficulty of obtaining the original papers.

Given that Sci-Hub and Library Genesis are shadows of the formalised academic publication system, we might expect them to share the maximalist tendencies of mainstream scholarly communications systems. These libraries are, indeed, enormous. Yet what it means to be a shadow has changed over time. It was not until the 18th century that the shadow became defined as the colorless inverse image of the object itself. Before this point, shadows were represented, in art and heraldry, as a partial transparency and outline. In heraldic terms, shadows represent an outlined shape that reveals a hidden part of the family tree, not a mirror of it ⁴⁹. That is, shadows can be seen as a shameful family link, rather than a precise formal reflection, and such a history reminds us, as Nanna Bonde Thylstrup puts it, of the “inherently unstable form of shadow libraries as a cultural construct” ⁵⁰. In this light, there are several characteristics of shadow libraries that present minimalist design principles and that we can take as refractive and instructive lenses for understanding mainstream practices.

Minimalist Shadows

Just what, then, is minimalist about these shadow libraries? Certainly, they are minimalist in terms of the minimal barriers that they present to their readership for access to scholarly publications. In eliminating paywalls and presenting only a flat search box, with no authentication mechanisms, Sci-Hub and Library Genesis are far simpler than the systems used by formal publishers. By contrast, these archives are not minimal in terms of the “minimal space” that they consume ¹. Nonetheless, I will argue that we can understand shadow libraries in minimal terms along a number of axes: minimalist surface exposure, minimalist metadata design principles, and minimalist distributional principles. I will cover each of these in turn, primarily with respect to Library Genesis.

To begin with the minimal surface exposure of the site, as a highly illegal, copyright-violating initiative that also wants to achieve worldwide transformation of educational potential, Library Genesis finds itself in a minimal-maximal double bind. On one hand, it must remain difficult to access, hidden, and must lie low to evade law enforcement. It must be a resource with a minimal surface. On the other hand, to achieve its stated goals, this resource must be accessible and known to as many people as possible. This double bind is not unique to these libraries. Similar minimal-maximal tensions exist within legal digital social justice projects that seek to criticize powerful governmental norms, such as the Global Detention Project. Such platforms are not illegal like Library Genesis, but they seek to criticize often-hostile regimes for a broad audience while avoiding tyrannical government crackdowns.

Nonetheless, Library Genesis has several technical hurdles to overcome in its quest to retain its minimal surface presence on the Internet. Two of these are the use of the Domain Name System (DNS) and the threat of (distributed) denial of service (DDOS) attacks. On the first of these fronts, to provide a memorable location for the archive, Library Genesis uses DNS ⁵¹. This is the system that translates an address such as gen.lib.rus.ec into an Internet Protocol (IP) address (198.167.223.167, for example). This is a useful system because IP addresses are not easily memorable for humans. However, DNS addresses are subject to takedowns and blocking by Internet Service Providers (ISP). That is, if an allegedly infringed party can persuade a court of law or an ISP that a site’s sole purpose is copyright infringement, the domain name (or IP address) can be blocked for large swathes of users ⁵². It is, of course, possible to circumvent such blocks by a variety of techniques, such as the use of a Virtual Private Network (VPN) that routes traffic through a different ISP in a friendlier jurisdiction. Systems such as The Onion Router (Tor) are another way of evading these blocks ⁵³.

Library Genesis plays the games of whack-a-mole and hide-and-seek with DNS ⁵⁴. Its mirrors rapidly switch between addresses in an attempt to avoid takedowns ⁵⁵. Given international jurisdictional challenges, it is very hard for countries that do care about copyright infringement to shut down Library Genesis permanently. The distributed nature of DNS acts in the interests of projects such as Library Genesis as it is simply impossible, at present, to garner the level of international legal compliance that would be necessary to shut down its DNS records permanently. Advances in DNS privacy and encryption are only likely to make this problem more difficult to combat (see, for instance, ⁵⁶.) That said, Library Genesis also has a communication problem with respect to DNS. If the goal of DNS is to provide memorable addresses for sites on the Internet then changing these addresses within a narrow time window frustrates the ultimate purpose of the system. Sites such as Reddit — a contemporary bulletin board system — spread news of new DNS mirrors, but, of course, once these addresses are public, the takedown process can begin anew. It is a war of attrition and, thus far, Library Genesis is winning the war through its guerilla tactics.

The small visible central surface of sites such as Library Genesis, only accessible through circumvention technologies such as VPNs, reflects a phenomenon seen in Alternate Reality Games (ARGs). Such hiddenness is characteristic of ARGs, which often have obscure entry points, conventionally referred to as rabbit holes: points of ingress that lure in new users searching for clues ⁵⁷. For Garcia and Niemeyer, “a ‘good’ rabbit hole is one that, for those not looking for clues, blends into the background and noise of the world” ⁵⁸. An example of how Library Genesis seeds such a trail, rather than blatantly advertising itself, can be seen in its metadata. Searching for Library Genesis on Google yields a link but its description only reads, “No information is available for this page.” Additionally, the site does not feature in the top results of “download scientific books for free.” Instead, one needs to be told about the site and then visit it directly.

This minimal surface principle is also clear in the site’s upload procedures, which intersect with its minimal metadata implementations. Although Library Genesis has a prominent upload link on its homepage, it requires a password to proceed. This gives the impression that uploading is a private activity, conducted only by an elite cadre of individuals who know the magic word.⁵⁹ In reality, though, users who follow the rabbit hole into the forum on the site and then register can readily find the requisite username and password to begin uploading.

There are several reasons why Library Genesis presents such a minimal surface. While, surely, the core reason for Library Genesis’ minimal surface is its illegality, this feature also lowers the potential for rights-holders to flood the site with false metadata and uploads. Indeed, the gravest threat to Library Genesis’s operation would be contamination of its records with inaccurate files, which could cause a denial-of-service attack against the archive. While the simple reason is that all uploads must be vetted, it would also be surprising if the use of the password seriously deters a concerted effort to pollute the library. As we will also see, though, the minimalist metadata principles of Library Genesis actively encourage participation.

In terms of minimal metadata, it is worth first considering the scale of Library Genesis. The main Library Genesis database collection had just short of 1.2 million records (books) in 2014 ⁶⁰. As of January 2020, this capacity had more than doubled to 2.5 million books. Clearly, a database of epistemological dissent at this size and with this scale of growth requires a lightweight — or, as we might say, minimal metadata scheme if it is not to collapse under the strain of its size. Although the ext2 and ext3 filesystems that were commonly used on Linux systems at the time of the database’s inception have a sizable upper limit of 1.3×1020 files per directory, there are known performance issues in handling more than 10,000 ⁶¹. There are also issues of integrity monitoring at this scale. When dealing with 2.5 million records, how can one ensure that physical media degradation — “bit rot” — does not lead to corrupted files on the disk that may not come to light for substantial periods? That said, the scale of Library Genesis pales in comparison to formal archives. For instance, a report from 2009 noted that there are over 9,000 missing volumes in the British Library’s main collection, illustrating that this problem exists in both the legal and illegal spaces ⁶². The British Library is also substantially larger than Library Genesis, with a main collection of approximately 170 million items total and 13.5 million books, with a digital collection that is over a petabyte in size ⁶³.

Library Genesis is an enormous archive in terms of total size (maximal), but it is composed of files that are relatively small (minimal). The average (mean) file size in the database is 13.90MB, with a significant portion of these files being less than 5MB. The file size distribution varies for the different file types within the database. Despite the accessibility challenges of the format — the Portable Document Format (PDF) is not the best format for screen readers, for instance — PDFs dominate the archive. For PDFs (n=1,697,927), for instance, the tail is longer to trail off and the average file size is higher at 16.49MB. By contrast, EPUB (n=179,926; average: 7.32 MB) and Mobi (n=23,947; average: 4.33MB) bring the mean file sizes down as they can jettison the formatting information inherent in PDF files. Nonetheless, with this volume of small files, addressability is a core performance concern.

The solution that Library Genesis devised for handling such a vast, proliferating archive composed of relatively small files is based on minimalist principles of metadata, distribution, and hashing. Files are distributed over multiple directories (called buckets) while being referenced within a database that includes metadata about the file, as well as a message-digest algorithm v5 (MD5) hash of the file, which serves as a filename. There are 47 metadata fields within the database for each entry, which can be subdivided into categories of file and record information, file and record properties, and external identifiers.⁶⁴ Importantly, though, very few of these fields are required . The subset that are absolutely necessary are kept to a minimum.

The first of these categories — file and record information — pertains to local lookup of the file. For instance, to retrieve a local file for a record, one simply divides the ID field by 1,000 to get the bucket directory, then retrieves the object with the specified MD5 record from that directory (e.g., 243000/9e107d9d372bb6826bd81d3542a419d6). To specify the content type, a user can append the Extension field (e.g., pdf or epub) to the delivered file (and could also infer the multipurpose Internet mail extensions or MIME type if necessary). In addition to using the Filesize field to check a download, it is also possible to verify that the contents of the file have not been corrupted at any time. This verification is achieved by computing the file’s MD5 hash and then comparing this to the database/filename. Although MD5 is a very old (and possibly even broken) hashing algorithm, it has a 1.47x10-29 chance of a random collision (that is, of two files sharing the same hash) ⁶⁵. There are no two records within the Library Genesis database that share an MD5 hash as of March 1, 2020.⁶⁶ Nonetheless, the subdivision into bucket directories of 1,000 files makes the on-disk likelihood of two files sharing a hash/filename extremely unlikely and further reduces this risk. The verification algorithm means that the detection of corrupt files can be handled automatically rather than on a reporting basis from users (although computing all MD5 hashes in the database is a computationally demanding task given the scale). These field and record information fields are the bare minimum required to retrieve a file from the filestore.

The second category of metadata field — file and record properties — gives specific metadata for a record, such as the work’s title and authors. Interestingly, for the principles of minimal computational design, the database does not store authors in a structured and linked form, but rather as free-text. Traditionally, if one were designing a relational database where an author could be ascribed to more than one book, one would create a separate database table called author that had properties such as first name, last name, and ORCID ID, and then, in the books table, link to the author. This would create a mechanism to query an author called, say, Joe Bloggs, and to retrieve all books written by that specific Joe Bloggs. Such a schema would provide high-quality structured data. This is not what Library Genesis does. Instead, it simply stores author names as plaintext. Sometimes this means that authors and editors are listed in a long string, and it is not clear whether authors of the same name are the same person.

Such an approach comes with several minimalist computational advantages. The clearest of these are: (1) there is no need to maintain the structural integrity of the database between tables, and (2) the overhead for entering metadata is greatly simplified by flattening the input. In this way, Library Genesis lowers both the computational requirements for maintenance and the barriers for entry/participation by allowing freeform textual input. Given the size of the database, and the fact that a free-text search can take a long time, the single-field, free-text approach to “author” rather than a linked record also allows interested parties to create their own indexes of the database with relative ease.

The third and final metadata field type (external identifiers) — also entirely optional — demonstrates further minimalist design principles. Instead of storing all metadata locally, the database points to offsite management of such data, allowing for others, likely specialists, to focus entirely on metadata curation, collection, and provision. While this carries some initial lookup overhead for the pirate entering the data, the labor of maintaining this metadata is then outsourced and can be resolved on demand. This all points towards a minimalist labor approach and an awareness that by lowering the threshold for participating to the minimum, the maximalist goals of the project are more likely to be achieved.

If Library Genesis encodes many of its minimalist functions in its metadata design principles, its distributional characteristics are also emblematic of such operations. The distributional minimalism principle can be seen in discussions about the preservation of the archive and the future of sites such as Library Genesis. On the first front of digital preservation, the challenge is that this archive is maximal, not minimal. A reasonable cost estimate for simply the hard disks to mirror the entire 33TB archive as of March 2020 is $1,200. This does not include either redundancy in terms of drive failure or the server hardware in which the hard drives would be housed. In terms of minimal cost scaling, the sheer size of the archive makes for a difficult environment if the local copy is to be complete and usable.

That said, full preservation in every replica may not be the aim of distribution, and it is of course possible for partial replicas of the database to exist worldwide ⁶⁷ ⁶⁸. Clients such as aria2 also make it possible to download a single file from a torrent swarm, which in theory makes single books within the Library Genesis archive addressable and retrievable in distributed form. However, the torrent swarm is not ideal for Library Genesis’s use case. Anybody seeding on the network will have an exposed IP address, and it will be clear that they are participating, with legal risk. For this reason it is likely that many users in the swarm will be connecting using VPNs or so-called seedboxes (remote high-bandwidth servers) in order not only to protect their identity, but also to ensure the efficient, high-speed distribution of the material.

Indeed, in late 2019, coordinating around a pirate archiving initiative known as The Eye, a group of individuals took it upon themselves to ensure the full torrent availability of the Library Genesis filestore. Under this initiative, “swarm peers increased from 3,000 seeders to 30,000 seeders,” and “speeds increased from about 60KB/s on most torrents to over 100MB/s” ⁶⁹. The users who undertook this illegal initiative viewed their work as charitable, presumably under the rubric of educational advancement, which has long held eleemosynary status in many jurisdictions. The amateur archivists who seeded Library Genesis believe that the “initiative fulfils United Nations/UNESCO world development goals that mandate the removal of restrictions on access to science” and that “[l]imiting and delaying humanity’s access to science isn’t a business, it’s a crime, one with an untold number of victims and preventable deaths” ⁶⁹. Somewhat boldly, two seedbox companies — Seedbox.io and UltraSeedbox.com — offered their servers for this avowedly illegal project, thereby providing the bandwidth to achieve the aforementioned speeds ⁷⁰. That said, the torrent archive is not viewed as a long-term viable mechanism for distribution, according to these individuals: “It obviously isn’t sane to store 33TB long-term, we just want to push this out to archivers” ⁷¹.

A potentially more viable distribution, storage, and retrieval mechanism proposed for Library Genesis could be the InterPlanetary File System (IPFS) protocol ⁷². IPFS presents an addressable, distributed system in which objects are assigned a hash:

IPFS provides a high throughput content-addressed block storage model, with content addressed hyper links [sic]. This forms a generalized Merkle DAG, a data structure upon which one can build versioned file systems, blockchains, and even a Permanent Web. IPFS combines a distributed hashtable [sic], an incentivized block exchange, and a self-certifying namespace. IPFS has no single point of failure, and nodes do not need to trust each other.
⁷³
Further, IPFS has a system called pinning, in which objects are immutable and pulled down to clients, thus ensuring their permanent availability: “This also makes IPFS a Web where links are permanent, and Objects can ensure the survival of others they point to” ⁷³.

IPFS does not address the anonymity of nodes that participate, meaning that it would still be possible to locate a serving entity by IP address, which could have legal implications. Such problems could, again, be mitigated by the use of anonymity networks such as Tor. The scalability of IPFS to 33TB of material, spread over 2.5 or so million unique entities of small file sizes is also unproven. An intermediate address lookup table would be needed to translate between the Library Genesis structure and any IPFS version of the platform. However, IPFS would present a minimalist design platform with the distributed preservation of a maximalist scholarly archive, in which many actors all share a small proportion of the total output, contributing towards a holistic totality in fragmentary participation. Nonetheless, once again, the fragmentary approach, in which smaller portions of the archive are distributed between many actors, rather than in a centralised store, yields a minimal approach to the function of a maximal archival.

Minimum-Maximum

Throughout this Library Genesis case study, I have been proposing a framework for thinking about minimal-maximal tensions. It is tempting to consider minimalist computing paradigms as the opposite to proliferating maximal systems. In reality, instead, minimalist paradigms can work as fragments of a distributed whole; this is one of the lessons that we can learn from the pirate archive.

These debates are not abstract for scholarly communications. At present, various research funders across the globe are embarking on the establishment of their own platforms to replace the journal ecosystem. Spurred by the same motivation as Library Genesis and Sci-Hub — the slower-than-desired growth of open access — the Open Research Europe, Wellcome Open Research, and Gates Open Research platforms represent significant centralizations of research outputs under funder control, as opposed to a distributed journal publishing architecture. These platforms, running on the for-profit F1000 platform owned by Taylor & Francis, bill themselves as megajournals, following the terminology developed by the world’s largest journal, PLOS ONE . There are many competing interests in these systems. Even as such initiatives represent centralizations and maximal architectures, funders insist on open licenses for re-use and re-distribution. Hence, we have centralization efforts mandating licenses that allow for de-centralization and micro/minimal circulation.

Library Genesis also yields various principles of participation and consumption that are minimal in their design, even while they yield access to a maximal resource. By requiring only the minimum of metadata provision, Library Genesis lowers barriers to participation, an aspect also bolstered by the off-shoring and distribution of its identifier systems. In terms of consumption, the single search box that returns all results for everything, ever, represents a maximal centralization. In terms of ease of use and accessibility, this one-stop-shop demonstrates a concision and minimalism that is lacking in most scholarly communications systems.

Finally, Library Genesis demonstrates a minimally exposed surface, mostly due to its illegality. It would be churlish to pretend that there are huge implications for legitimate scholarly communications infrastructures or lessons to be learned from this. The more people have access to scholarship and research, the better. The minimally exposed surface of the shadow archive is precisely — and only — because it must remain hidden. The strait that this archive must navigate, though, is the space between its maximal presence and its minimal entrance, the rabbit holes of the alternate reality game.

Nonetheless, in all, I have attempted in this article to document several features of Library Genesis that serve as an index of minimal-maximal infrastructures. At least some of these features could be adopted by mainstream scholarly communications providers to increase not only the resilience of, but also global equity of access to, our publication ecosystems. While shadows are not equal to mirrors, in these senses I contend that there are lessons to be learned from the (shadow) library.

Sayers, Jentery. “Minimal Definitions.” Minimal Computing Working Group, October 2, 2016. http://go-dh.github.io/mincomp/thoughts/2016/10/02/minimal-definitions/. ↩︎ ↩︎
I would like to extend my thanks in this article to the anonymous readers at DHQ who prompted me to rethink the scope and scale of this work, as well as to the editors of the special issue. Work on this article was funded by a Philip Leverhulme Prize from the Leverhulme Trust. ↩︎
Gil, Alex and Élika Ortega. “Global Outlooks in Digital Humanities: Multilingual Practices and Minimal Computing.” In Doing Digital Humanities: Practice, Training, Research , edited by Crompton Constance, Richard J. Lane, and Raymond G. Siemens, 22–34. NY: Routledge, 2016. ↩︎
Grafton, Anthony. The Footnote: A Curious History . Cambridge, MA: Harvard University Press, 1999. ↩︎
Fyfe, Aileen. “Journals, Learned Societies and Money: Philosophical Transactions, ca. 1750–1900.” Notes and Records: The Royal Society Journal of the History of Science vol. 69.3 (2015): 277–99. https://doi.org/10.1098/rsnr.2015.0032. ↩︎
Moxham, Noah and Aileen Fyfe. “The Royal Society and the Prehistory of Peer Review, 1665–1965.” The Historical Journal vol. 61.4 (2018): 863–89. https://doi.org/10.1017/S0018246X17000334. ↩︎
Gray, Jonathan. “Infrastructural Experiments and the Politics of Open Access.” In Reassembling Scholarly Communications: Histories, Infrastructures, and Global Politics of Open Access , edited by Martin Paul Eve and Jonathan Gray, 251–64. Cambridge, MA: The MIT Press, 2020. ↩︎
Andrews, Penny C. S. “The Platformization of Open.” In Reassembling Scholarly Communications: Histories, Infrastructures, and Global Politics of Open Access , edited by Martin Paul Eve and Jonathan Gray, 265-276. Cambridge, MA: The MIT Press, 2020. ↩︎
Bosch, Stephen and Kittie Henderson. “Predicting the Future in 3,000 Words and Charts: The Library Journal Serials Pricing Article.” The Serials Librarian vol. 74.1–4 (2018): 224–27. https://doi.org/10.1080/0361526X.2018.1430442. ↩︎
Jurchen, Sarah. “Open Access and the Serials Crisis: The Role of Academic Libraries.” Technical Services Quarterly vol. 37.2 (2020): 160–70. https://doi.org/10.1080/07317131.2020.1728136. ↩︎
Mboa Nkoudou, Thomas Hervé. “Epistemic Alienation in African Scholarly Communications: Open Access as a Pharmakon. ” In Reassembling Scholarly Communications: Histories, Infrastructures, and Global Politics of Open Access , edited by Martin Paul Eve and Jonathan Gray, 25–40. Cambridge, MA: The MIT Press, 2020. ↩︎
Moore, Samuel, Cameron Neylon, Martin Paul Eve, Daniel O’Donnell and Damian Pattinson. “Excellence R Us: University Research and the Fetishisation of Excellence.” Palgrave Communications 3 (2017). https://doi.org/10.1057/palcomms.2016.105. ↩︎
Eve, Martin Paul, Cameron Neylon, Daniel O’Donnell, Samuel Moore, Robert Gadie, Victoria Odeniyi and Shahina Parvin. Reading Peer Review: _ PLOS ONE _ and Institutional Change in Academia . Cambridge: Cambridge University Press, 2021. ↩︎
Das, Jishnu, Quy-Toan Do, Karen Shaines and Sowmya Srikant. “U.S. and Them: The Geography of Academic Research.” Journal of Development Economics 105 (2013): 112–30. https://doi.org/10.1016/j.jdeveco.2013.07.010. ↩︎
Roh, Charlotte, Harrison W. Inefuku and Emily Drabinski. “Scholarly Communications and Social Justice.” In Reassembling Scholarly Communications: Histories, Infrastructures, and Global Politics of Open Access , edited by Martin Paul Eve and Jonathan Gray, 41–52. Cambridge, MA: The MIT Press, 2020. ↩︎
Kapczynski, Amy and Gaëlle Krikorian, eds. Access to Knowledge in the Age of Intellectual Property . NY: Zone Books, 2010. ↩︎
Andročec, Darko. “Analysis of Sci-Hub Downloads of Computer Science Papers.” Acta Universitatis Sapientiae, Informatica vol. 9.1 (2017): 83–96. https://doi.org/10.1515/ausi-2017-0006. ↩︎
Boudry, Christophe, Patricio Alvarez-Muñoz, Ricardo Arencibia-Jorge, Didier Ayena, Niels J. Brouwer, Zia Chaudhuri, Brenda Chawner, et al. “Worldwide Inequality in Access to Full Text Scientific Articles: The Example of Ophthalmology.” PeerJ 7 (2019): e7850. https://doi.org/10.7717/peerj.7850. ↩︎
Chan, Leslie, Darius Cuplinskas, Michael Eisen, Fred Friend, Yana Genova, Jean-Claude Guédon, Melissa Hagemann, et al. “Budapest Open Access Initiative.” February 14, 2002. https://www.soros.org/openaccess/read.shtml. ↩︎

“Berlin Declaration on Open Access to Knowledge in the Sciences and Humanities.”  October 22, 2003. [https://oa.mpg.de/lang/en-uk/berlin-prozess/berliner-erklarung](https://oa.mpg.de/lang/en-uk/berlin-prozess/berliner-erklarung).

↩︎

Suber, Peter, Patrick O. Brown, Diane Cabell, Aravinda Chakravarti, Barbara Cohen, Tony Delamothe, Michael Eisen, et al. “Bethesda Statement on Open Access Publishing,” 2013 http://dash.harvard.edu/handle/1/4725199. ↩︎
Fitzpatrick, Kathleen. Planned Obsolescence: Publishing, Technology, and the Future of the Academy . NY: New York University Press, 2011. ↩︎
Suber, Peter. Open Access . Cambridge, MA: The MIT Press, 2012. http://bit.ly/oa-book. ↩︎
Eve, Martin Paul. Open Access and the Humanities: Contexts, Controversies and the Future . Cambridge: Cambridge University Press, 2014. https://doi.org/10.1017/CBO9781316161012. ↩︎
Herb, Ulrich. “Open Access And Symbolic Gift Giving.” In Open Divide: Critical Studies on Open Access , edited by Joachim Schöpfel and Ulrich Herb, 69–81. Sacramento, CA: Library Juice Press, 2018. https://doi.org/10.5281/zenodo.1206377. ↩︎ ↩︎
Adema, Janneke and Samuel A. Moore. “Scaling Small; Or How to Envision New Relationalities for Knowledge Production.” Westminster Papers in Communication and Culture vol. 16.1 (2021). https://doi.org/10.16997/wpcc.918. ↩︎
Risam, Roopika. New Digital Worlds: Postcolonial Digital Humanities in Theory, Praxis, and Pedagogy . Evanston, IL: Northwestern University Press, 2018. ↩︎
Faust, Jeremy S. “Sci-Hub: A Solution to the Problem of Paywalls, or Merely a Diagnosis of a Broken System?” Annals of Emergency Medicine vol. 68.1 (2016): A15–17. https://doi.org/10.1016/j.annemergmed.2016.05.010. ↩︎ ↩︎
Hobsbawm, Eric. Bandits . NY: Pantheon Books, 1981. ↩︎
Tanenbaum, Andrew S. and Linus Torvalds. “Appendix A: The Tanenbaum-Torvalds Debate.” In Open Sources: Voices from the Open Source Revolution , edited by Chris DiBona, Sam Ockman, and Mark Stone, 221–51. Sebastopol: O’Reilly Media, Inc., 2008. http://public.ebookcentral.proquest.com/choice/publicfullrecord.aspx?p=443191. ↩︎
Green, Toby. “We’ve Failed: Pirate Black Open Access Is Trumping Green and Gold and We Must Change Our Approach.” Learned Publishing vol. 30.4 (2017) 325–29. https://doi.org/10.1002/leap.1116. ↩︎ ↩︎
Brembs, Björn. “The Ultimate Open Access Timeline.” Bjoern.Brembs.Blog , March 3, 2020. https://bjoern.brembs.net/2020/03/the-ultimate-open-access-timeline/. ↩︎
Swartz, Aaron. “Guerilla Open Access Manifesto.” In The Boy Who Could Change the World , 26–27. London: Verso, 2015. ↩︎
Hockenberry, Benjamin. “The Guerilla Open Access Manifesto: Aaron Swartz, Open Access and the Sharing Imperative.” Lavery Library Faculty/Staff Publications , November 2013, 1–7. ↩︎
Machin-Mastromatteo, Juan D, Alejandro Uribe-Tirado and Maria E Romero-Ortiz. “Piracy of Scientific Papers in Latin America: An Analysis of Sci-Hub Usage Data.” Information Development vol. 32.5 (2016): 1806–14. https://doi.org/10.1177/0266666916671080. ↩︎
Himmelstein, Daniel S, Ariel Rodriguez Romero, Jacob G Levernier, Thomas Anthony Munro, Stephen Reid McLaughlin, Bastian Greshake Tzovaras and Casey S Greene. “Sci-Hub Provides Access to Nearly All Scholarly Literature.” ELife 7 (February 2018): e32822. https://doi.org/10.7554/eLife.32822. ↩︎
Russell, Carrie and Ed Sanchez. “Sci-Hub Unmasked: Piracy, Information Policy, and Your Library | Russell | College & Research Libraries News.” College & Research Libraries News vol. 77.3 (2017): 122–25. https://doi.org/10.5860/crln.77.3.9457. ↩︎
Schiermeier, Quirin. “US Court Grants Elsevier Millions in Damages from Sci-Hub.” Nature , June 2017. https://doi.org/10.1038/nature.2017.22196. ↩︎
Reddy, Hrishikesh and Shivang Mishra. “Sci-Hub Case: Legally Removing the Barriers in the Way of Science.” NLUJ Law Review, April 29, 2021. http://www.nlujlawreview.in/sci-hub-case-legally-removing-the-barriers-in-the-way-of-science/. ↩︎
Bodó, Balázs, Dániel Antal and Zoltán Puha. “Can Scholarly Pirate Libraries Bridge the Knowledge Access Gap? An Empirical Study on the Structural Conditions of Book Piracy in Global and European Academia.” PLOS ONE vol. 15.12 (2020): e0242509. https://doi.org/10.1371/journal.pone.0242509. ↩︎
Bodó, Balázs. “The Genesis of Library Genesis: The Birth of a Global Scholarly Shadow Library.” In Shadow Libraries: Access to Educational Materials in Global Higher Education , edited by Joe Karaganis, 25–52. Cambridge, MA: The MIT Press, 2018. ↩︎ ↩︎
It is worth noting that some of the secondary accounts of Library Genesis can tip over into hagiographic commentary, an angle that I seek to avoid here. ↩︎
Dulong de Rosnay, Melanie. “Open Access Models, Pirate Libraries and Advocacy Repertoires: Policy Options for Academics to Construct and Govern Knowledge Commons.” Westminster Papers in Communication and Culture vol. 16.1 (2021). https://doi.org/10.16997/wpcc.913. ↩︎
Harris, Shane and Devlin Barrett. “Justice Department Investigates Sci-Hub Founder on Suspicion of Working for Russian Intelligence.” The Washington Post , December 19, 2019. https://www.washingtonpost.com/national-security/justice-department-investigates-sci-hub-founder-on-suspicion-of-working-for-russian-intelligence/2019/12/19/9dbcb6e6-2277-11ea-a153-dce4b94e4249_story.html. ↩︎
Camarero, Carmen, Carmen Antón and Javier Rodríguez. “Technological and Ethical Antecedents of E-Book Piracy and Price Acceptance: Evidence from the Spanish Case.” The Electronic Library vol. 32.4 (2014): 542–66. https://doi.org/10.1108/EL-11-2012-0149. ↩︎
Bohannon, John. “Who’s Downloading Pirated Papers? Everyone.” Science, April 25, 2016. https://www.sciencemag.org/news/2016/04/whos-downloading-pirated-papers-everyone. ↩︎
Till, Brian M., Niclas Rudolfson, Saurabh Saluja, Jesudian Gnanaraj, Lubna Samad, David Ljungman and Mark Shrime. “Who Is Pirating Medical Literature? A Bibliometric Review of 28 Million Sci-Hub Downloads.” The Lancet Global Health vol. 7.1 (2019): e30–31. https://doi.org/10.1016/S2214-109X(18)30388-7. ↩︎
Correa, Juan C., Henry Laverde-Rojas, Julian Tejada and Fernando Marmolejo-Ramos. “The Sci-Hub Effect on Papers’ Citations.” Scientometrics , January 2021. https://doi.org/10.1007/s11192-020-03806-w. ↩︎
Pastoureau, Michel. “L’incolore n’existe Pas.” In Points de Vue: Pour Philippe Junod , edited by Danielle Chaperon and Philippe Kaenel, 21–36. Champs Visuels. Paris: L’Harmattan, 2003. ↩︎
Thylstrup, Nanna Bonde. The Politics of Mass Digitization . Cambridge, MA: The MIT Press, 2018. ↩︎
Mockapetris, P. V. “Domain Names - Concepts and Facilities.” Internet Engineering Task Force . November 1987. https://tools.ietf.org/html/rfc1034. ↩︎
Bambauer, Derek E. “Orwell’s Armchair.” The University of Chicago Law Review vol. 79.3 (2012): 863-944. ↩︎
Farnan, Oliver, Alexander Darer and Joss Wright. “Analysing Censorship Circumvention with VPNs via DNS Cache Snooping.” ArXiv:1907.04023 [Cs] , July 2019. https://arxiv.org/abs/1907.04023. ↩︎
Schiermeier, Quirin. “Pirate Research-Paper Sites Play Hide-and-Seek with Publishers.” Nature , December 2015. https://doi.org/10.1038/nature.2015.18876. ↩︎
Sar, Ernesto Van Der. “Sci-Hub, BookFi and LibGen Resurface After Being Shut Down.” TorrentFreak, November 21, 2015. https://torrentfreak.com/sci-hub-and-libgen-resurface-after-being-shut-down-151121/. ↩︎
Schmitt, Paul, Anne Edmundson, and Nick Feamster. “Oblivious DNS: Practical Privacy for DNS Queries.” ArXiv:1806.00276 [Cs] , December 2018. http://arxiv.org/abs/1806.00276. ↩︎
Szulborski, Dave. This Is Not A Game: A Guide To Alternate Reality Gaming . Macungie, PA: New-Fiction Pub, 2005. ↩︎
Garcia, Antero and Greg Niemeyer. “Introduction.” In Alternate Reality Games and the Cusp of Digital Gameplay , edited by Antero Garcia and Greg Niemeyer, 1–28. NY: Bloomsbury Academic, 2017. ↩︎
For more on passwords, see ⁷⁴. ↩︎
Bodó, Balázs. “Library Genesis in Numbers: Mapping the Underground Flow of Knowledge.” In Shadow Libraries: Access to Educational Materials in Global Higher Education , edited by Joe Karaganis, 53–78. Cambridge, MA: The MIT Press, 2018. ↩︎
Poirier, Dave. “The Second Extended File System.” Savannah. 2001. https://www.nongnu.org/ext2-doc/ext2.html. ↩︎
Dawar, Anil and Maev Kennedy. “British Library Mislays 9,000 Books.” The Guardian , March 17, 2009. . https://www.theguardian.com/uk/2009/mar/17/british-library-books-mein-kampf. ↩︎

“Facts and Figures of the British Library.”    _The British Library_ . Accessed September 23, 2021. [https://www.bl.uk/about-us/our-story/facts-and-figures-of-the-british-library](https://www.bl.uk/about-us/our-story/facts-and-figures-of-the-british-library).

↩︎

The metadata fields are ID, Title, VolumeInfo, Series, Periodical, Author, Year, Edition, Publisher, City, Pages, PagesInFile, Language, Topic, Library, Issue, Identifier, ISSN, ASIN, UDC, LBC, DDC, LCC, Doi, Googlebookid, OpenLibraryID, Commentary, DPI, Color, Cleaned, Orientation, Paginated, Scanned, Bookmarked, Searchable, Filesize, Extension, MD5, Generic, Visible, Locator, Local, TimeAdded, TimeLastModified, Coverurl, Tags, IdentifierWODash. ↩︎
Ramirez, Gorka. “MD5: The Broken Algorithm.” Avira Blog , July 28, 2015. https://blog.avira.com/md5-the-broken-algorithm/. ↩︎
This was verified by the MySQL query: SELECT MD5, COUNT(MD5) FROM updated GROUP BY MD5 HAVING COUNT(MD5) > 1;. ↩︎
Menasché, Daniel S., Antonio A. de A. Rocha, Bin Li, Don Towsley and Arun Venkataramani. “Content Availability and Bundling in Swarming Systems.” IEEE/ACM Transactions on Networking vol. 21.2 (2013): 580–93. https://doi.org/10.1109/TNET.2012.2212205. ↩︎
Neglia, G., G. Reina, H. Zhang, D. Towsley, A. Venkataramani and J. Danaher. “Availability in BitTorrent Systems.” In IEEE INFOCOM 2007 - 26th IEEE International Conference on Computer Communications , 2007, 2216–24. Anchorage, AK, USA: IEEE. https://doi.org/10.1109/INFCOM.2007.256. ↩︎
u/shrine. “Library Genesis Project Update: 2.5 Million Books Seeded with the World, 80 Million Scientific Articles Next.” Reddit - r/DataHoarder. Accessed September 28, 2021. https://www.reddit.com/r/DataHoarder/comments/ed9byj/library_genesis_project_update_25_million_books/. ↩︎ ↩︎
Maxwell, Andy. “Meet the Guy Behind the Libgen Torrent Seeding Movement.” TorrentFreak, December 5, 2019. https://torrentfreak.com/meet-the-guy-behind-the-libgen-torrent-seeding-movement-191205/. ↩︎
u/shrine. “Charitable Seeding Update: 10 Terabytes and 900,000 Scientific Books in a Week with Seedbox.Io and UltraSeedbox.” Reddit - r/Seedboxes. Accessed September 28, 2021. https://www.reddit.com/r/seedboxes/comments/e3yl23/charitable_seeding_update_10_terabytes_and_900000/. ↩︎
Rahalkar, Chaitanya, and Dhaval Gujar. “Content Addressed P2P File System for the Web with Blockchain-Based Meta-Data Integrity.” ArXiv:1912.10298 [Cs] , January 2020. http://arxiv.org/abs/1912.10298. ↩︎
Benet, Juan. (2014). “IPFS - Content Addressed, Versioned, P2P File System.” ArXiv:1407.3561 [Cs] , July. https://arxiv.org/abs/1407.3561. ↩︎ ↩︎
Eve, Martin Paul. Password . NY: Bloomsbury, 2016. ↩︎