overview for ancuuiqter

Lawsuit Accuses Anna's Archive of Hacking WorldCat, Stealing 2.2 TB Data in c/[email protected]

[–] ancuuiqter 5 points 1 year ago* (last edited 1 year ago) (1 children)

Maybe you're thinking of Sci-Hub and its founder, Alexandra Asanovna Elbakyan?

I could not find a location on Anna's Archive's wiki page.

Lawsuit Accuses Anna's Archive of Hacking WorldCat, Stealing 2.2 TB Data in c/[email protected]

[–] ancuuiqter 4 points 1 year ago* (last edited 1 year ago)

The official Anna's Archive Reddit account, AnnaArchivist, has responded to an r/Annas_Archive post linking the same Torrent Freak article:

Thanks! We're not making any public statements about this lawsuit but rest assured we're fine.

Lawsuit Accuses Anna's Archive of Hacking WorldCat, Stealing 2.2 TB Data in c/[email protected]

[–] ancuuiqter 8 points 1 year ago (3 children)

Would you be able to share where you learned that Anna's Archive is based in Kazakhstan?

Lawsuit Accuses Anna's Archive of Hacking WorldCat, Stealing 2.2 TB Data in c/[email protected]

[–] ancuuiqter 3 points 1 year ago* (last edited 1 year ago)

Regarding the operating location(s) of Anna's Archive, OCLC is alleging the following (pages 7-9):

C. Defendants Rely on Sophisticated Technology and Online Practices to Conceal their Identities.

Defendants understand that their pirate library enterprise and related activities, here, hacking and harvesting OCLC’s WorldCat® records, are illegal. Defendants admit that they are engaging in and facilitating mass copyright infringement, stating, “[w]e deliberately violate the copyright law in most countries.” In another blog post, Defendants noted that their activities could lead to arrest and “decades of prison time.” Defendants have also recognized that their hacking and distribution of OCLC’s data is improper, acknowledging that WorldCat® is a “proprietary database,” that OCLC’s “business model requires protecting their database,” and that Defendants are “giving it all away. :-).”

Because Defendants understand their actions infringe on copyright laws, amongst others, Defendants go to great lengths to remain anonymous to ensure both that Anna’s Archive’s domains are not taken down and to avoid the legal consequences of their actions, including civil lawsuits where parties like OCLC seek to vindicate their rights, as well as criminal and regulatory enforcement actions undertaken by government entities. None of Anna’s Archive’s domains or its online blog provide a business address, business contact, or other contact information that would be found on a legitimate entity’s website.

Defendants have explained in a blog post that they are “being very careful not to leave any trace [of their online activities], and having strong operational security.” For instance, Anna’s Archive utilizes a VPN with “[a]ctual court-tested no-log policies with long track records of protecting privacy.” Each of the Anna’s Archive domains are registered using foreign hosts, registrars, and registrants in order to conceal the identity of the site operators. Additionally, Defendants rely on multiple proxy servers to maintain anonymity. Defendants also use a free version of Cloudflare, a top-level hosting provider, so that they do not have to provide any payment or other identifying information. Defendants selected Cloudflare because they claim Cloudflare has resisted requests to take down websites for copyright infringement. The individuals behind Anna’s Archive also use usernames as pseudonyms to mask their identities online.

Through the work of a cyber security and digital forensic investigation firm, OCLC was able to identify one of the individuals behind Anna’s Archive by name and locate a United States address, Defendant Maria Dolores Anasztasia Matienzo. However, the physical address and contact information of Anna’s Archive and the identities and contact information of the John Does remain unknown. It is highly likely that Anna’s Archive is a non-domestic, foreign entity, based on the findings from OCLC’s investigator, as set forth below.

OCLC explained the above in their Motion To Serve Defendant Anna’s Archive By Email, as justification for why they seek "permission to serve Anna’s Archive by alternative means, here, email, pursuant to Federal Rule of Civil Procedure 4(h)(2) and (f)(3)."

Lawsuit Accuses Anna's Archive of Hacking WorldCat, Stealing 2.2 TB Data in c/[email protected]

[–] ancuuiqter 11 points 1 year ago (3 children)

As to how Anna's Archive accomplished their data scraping, this is what OCLC is claiming (see page 62-63):

These attacks were accomplished with bots (automated software applications) that “scraped” and harvested data from WorldCat.org and other WorldCat®-based research sites and that called or pinged the server directly. These bots were initially masked to appear as legitimate search engine bots from Bing or Google.

To scrape or harvest the data on WorldCat.org, the bots searched WorldCat.org results, running a script based on OCN for individual JavaScript Object Notation, or “JSON,” records. As a result, WorldCat® data including freely accessible and enriched data, such as OCNs, were scraped from individual results on WorldCat.org.

The bots also harvested data from WorldCat.org by pretending to be an internet browser, directly calling or “pinging” OCLC’s servers, and bypassing the search, or user interface, of WorldCat.org. More robust WorldCat® data was harvested directly from OCLC’s servers, including enriched data not available through the WorldCat.org user interface.

Finally, WorldCat® data was harvested from a member’s website incorporating WorldCat® Discovery Services, a subscription-based variation of WorldCat.org that is available only to a member’s patrons. Again, the hacker pinged OCLC’s servers to harvest WorldCat® records directly from the servers. To do this through WorldCat® Discovery Services/FirstSearch, the hacker obtained and used the member’s credentials to authenticate the requests to the server as a member library.

From WorldCat® Discovery Services, hackers harvested 2 million richer WorldCat® records that included data not available in WorldCat.org. This hacking method resulted in the harvesting of some of OCLC’s most proprietary fields of WorldCat® data.

These hacking attacks materially affected OCLC’s production systems and servers, requiring around-the-clock efforts from November 2022 to March 2023 to attempt to limit service outages and maintain the production systems’ performance for customers. To respond to these ongoing attacks, OCLC spent over 1.4 million dollars on its systems’ infrastructure and devoted nearly 10,000 employee hours to the same.

Despite OCLC’s best efforts, OCLC’s customers experienced many significant disruptions in paid services during the aforementioned period as a result of the attacks on WorldCat.org, requiring OCLC to create system workarounds to ensure services functioned.

During this time, customers threatened and likely did cancel their products and services with OCLC due to these disruptions.

Because OCLC had to combat these persistent hacking attacks, OCLC was forced to divert existing personnel and resources from OCLC’s other products and services. As a result, OCLC’s development and improvements to other products and services were delayed and limited.

OCLC has devoted, at various times, ten or more employees to respond to and mitigate the harm from these attacks from October 2022 to present.

Lawsuit Accuses Anna's Archive of Hacking WorldCat, Stealing 2.2 TB Data in c/[email protected]

[–] ancuuiqter 15 points 1 year ago (5 children)

Here are the court filings if anyone would like to read them:

https://archive.org/details/gov.uscourts.ohsd.287709/

The following is a link to the docket (which the above link draws from), so people can follow the progress of the lawsuit:

https://www.courtlistener.com/docket/68157923/oclc-online-computer-library-center-inc-v-annas-archive/

5

Z-Library Blog: "Unprecedented seizure of our domains with books on rare languages" (z-library.se)

submitted 1 year ago by ancuuiqter to c/[email protected]

1 comments fedilink

Today we are forced to share some sad news - yesterday many of our domains were seized again. We should highlight that the majority of the seized domains were not mirrors of the Z-Library website. Instead, they were separate sub-projects, containing only books in rare languages of the world, and their blocking is perplexing. For instance, these domains included books in Tamil, Mongolian, Catalan, Urdu, Pashto, and other languages:

afrikaans-books.org

bengali-books.org

urdu-books.org

marathi-books.org

chamorro-books.org

Over the 15 years of the project's existence, we've managed to collect an impressive collection of rare texts in many uncommon languages. These domains featured many unique texts that can't be found anywhere else, including rare books, documents, and manuscripts. All of this is a priceless heritage, contributing to the preservation and study of world cultures, and serving as important material for researchers in linguistics, anthropology, and history.

Z-Library also states in the blog post that they did not lose the files, just the domains.

This community can't continue to exist on lemmy.world in c/libgen

[–] ancuuiqter 1 points 2 years ago (1 children)

Do you mind elaborating? Is there something you could share that provides more context?

This community can't continue to exist on lemmy.world in c/libgen

[–] ancuuiqter 12 points 2 years ago (3 children)

What if the community shifted to an already-existing one?

https://lemmy.ml/c/libgen

DeGruyter Collection in c/[email protected]

[–] ancuuiqter 3 points 2 years ago (1 children)

Thanks for your follow-up. Were there any works you've started reading? Anything you found particularly interesting?

Copyright lawsuits against Meta and OpenAI mention shadow libraries, including Library Genesis, as sources of training data in c/[email protected]

[–] ancuuiqter 30 points 2 years ago (1 children)

Mentioning this since the project Anna's Archive compiles several datasets and their corresponding torrents.

Anna's Archive, whose aim is to "archive all the books in the world, and make them widely accessible," pulls from a number of shadow library sources; the project provides its own torrent links (via Tor) for Library Genesis, Z-lib, Internet Archive, among others, plus Library Genesis's torrents. In the datasets linked below, you can click on a given source and find its onion site or the torrents provided by the shadow library itself (in the case of Library Genesis, for example).

Anna's Archive datasets

...almost all files shown on Anna’s Archive are available through torrents. Below is a list of the different data sources that we use, with links to their torrents. Our own torrents are available on Tor.

Sources include

Internet Archive Digital Lending Library
Libgen.li comics
Z-Library scrape
ISBNdb scrape
Libgen auxiliary data
Libgen.rs
Libgen.li (includes Sci-Hub)

20

Copyright lawsuits against Meta and OpenAI mention shadow libraries, including Library Genesis, as sources of training data (archive.md)

submitted 2 years ago by ancuuiqter to c/[email protected]

1 comments fedilink

cross-posted from: https://lemmy.world/post/1330512

Below are direct quotes from the filings.

OpenAI

As noted in Paragraph 32, supra, the OpenAI Books2 dataset can be estimated to contain about 294,000 titles. The only “internet-based books corpora” that have ever offered that much material are notorious “shadow library” websites like Library Genesis (aka LibGen), Z-Library (aka B-4ok), Sci-Hub, and Bibliotik. The books aggregated by these websites have also been available in bulk via torrent systems. These flagrantly illegal shadow libraries have long been of interest to the AI-training community: for instance, an AI training dataset published in December 2020 by EleutherAI called “Books3” includes a recreation of the Bibliotik collection and contains nearly 200,000 books. On information and belief, the OpenAI Books2 dataset includes books copied from these “shadow libraries,” because those are the most sources of trainable books most similar in nature and size to OpenAI’s description of Books2.

Meta

Bibliotik is one of a number of notorious “shadow library” websites that also includes Library Genesis (aka LibGen), Z-Library (aka B-ok), and Sci-Hub. The books and other materials aggregated by these websites have also been available in bulk via torrent systems. These shadow libraries have long been of interest to the AI-training community because of the large quantity of copyrighted material they host. For that reason, these shadow libraries are also flagrantly illegal.

This article from Ars Tecnica covers a few more details. Filings are viewable at the law firm's site here.

16

Copyright lawsuits against Meta and OpenAI mention shadow libraries, including Library Genesis, as sources of training data (archive.md)

submitted 2 years ago by ancuuiqter to c/libgen

0 comments fedilink

Below are direct quotes from the filings.

OpenAI

As noted in Paragraph 32, supra, the OpenAI Books2 dataset can be estimated to contain about 294,000 titles. The only “internet-based books corpora” that have ever offered that much material are notorious “shadow library” websites like Library Genesis (aka LibGen), Z-Library (aka B-4ok), Sci-Hub, and Bibliotik. The books aggregated by these websites have also been available in bulk via torrent systems. These flagrantly illegal shadow libraries have long been of interest to the AI-training community: for instance, an AI training dataset published in December 2020 by EleutherAI called “Books3” includes a recreation of the Bibliotik collection and contains nearly 200,000 books. On information and belief, the OpenAI Books2 dataset includes books copied from these “shadow libraries,” because those are the most sources of trainable books most similar in nature and size to OpenAI’s description of Books2.

Meta

Bibliotik is one of a number of notorious “shadow library” websites that also includes Library Genesis (aka LibGen), Z-Library (aka B-ok), and Sci-Hub. The books and other materials aggregated by these websites have also been available in bulk via torrent systems. These shadow libraries have long been of interest to the AI-training community because of the large quantity of copyrighted material they host. For that reason, these shadow libraries are also flagrantly illegal.

This article from Ars Tecnica covers a few more details. Filings are viewable at the law firm's site here.

How do I download Spotify playlists? in c/[email protected]

[–] ancuuiqter 1 points 2 years ago* (last edited 2 years ago)

You could try d-fi to download Spotify playlists; the tool pulls tracks from Deezer, another streaming platform. You have to set up a config file with an ARL token, which links to a Deezer subscription. Depending on the subscription plan of the ARL, you will be able to download up to FLAC-level quality (you can also select 320 kbps MP3 or 128 kbps MP3). Also contigent on the subscription plan is whether you'll be able to download tracks marked as explicit by Deezer.

d-fi: https://github.com/d-fi/releases/releases

You can grab an ARL from here: https://rentry.org/firehawk52#deezer-arls

Once you run the program, you just paste the link to the Spotify playlist in the search interface and it will pull all songs that correlate to a song in Deezer's library. You can select to download all tracks in the playlist or pick only the songs you want. After downloading the tracks, it will also create a .m3u8 playlist file from the playlist you searched for.

DeGruyter Collection in c/[email protected]

[–] ancuuiqter 2 points 2 years ago (1 children)

Yeah, it's not clear. The following explanation goes into a bit more detail on finding what you want to download, for anyone else who has trouble identifying books in the torrent.

Try browsing through this link on the publisher's site for any subjects or books you'd be interested in. If you find something, copy the ISBN (listed in the work's direct URL, or you can open the work's page, which lists out the eBook and hardcover ISBNs---you need the eBook ISBN as far as I understand). If you open the torrent and let the file contents load, depending on your client, you can look through the structure, search through all files using the ISBN, and locate the book that way. Files all seem to be named by their DOI, with the / replaced with _. I guess the ISBN is part of the DOI naming convention.

It doesn't look like any work tagged as Ahead of Publication will be in this torrent, but works from 2023 and prior appear to be.

14

DeGruyter Collection (self.leftpiracy)

submitted 2 years ago* (last edited 2 years ago) by ancuuiqter to c/[email protected]

6 comments fedilink

cross-posted from: https://lemmy.world/post/274818

tldr: a huge torrent of books straight from the academic publisher De Gruyter

cross-posted from: https://teddit.net/r/DataHoarder/comments/1463ah3/degruyter_collection/

Trying again because reddit filters, let's see if I get it right this time.

After months of scraping I finally finished downloading almost every single degruyter book to which I have access, which are many. And so I created a torrent.

magnet:?xt=urn:btih:76f573241a0126fb1ab0aa5540cc7493c045ae74&dn=Degruyter%20Imprints%20v2%20%5b09-06-23%5d&tr=http%3a%2f%2fatrack.pow7.com%2fannounce&tr=udp%3a%2f%2fopen.stealth.si%3a80%2fannounce&tr=udp%3a%2f%2ftracker.cyberia.is%3a6969%2fannounce&tr=udp%3a%2f%2fretracker.lanta-net.ru%3a2710%2fannounce&tr=udp%3a%2f%2ftracker.moeking.me%3a6969%2fannounce&tr=udp%3a%2f%2ftracker.tiny-vps.com%3a6969%2fannounce&tr=udp%3a%2f%2ftracker.torrent.eu.org%3a451%2fannounce&tr=udp%3a%2f%2ftracker.openbittorrent.com%3a80%2fannounce&tr=http%3a%2f%2ftracker.openbittorrent.com%3a80%2fannounce&tr=udp%3a%2f%2fopentra

The content of the torrent is pretty much these https://www.degruyter.com/search?query=*&startItem=0&pageSize=10&sortBy=mostrecent&documentVisibility=all&documentTypeFacet=book&publisherFacet=De+Gruyter%7EDe+Gruyter+Oldenbourg%7EDe+Gruyter+Saur%7EDe+Gruyter+Mouton

Neither the files nor the series are renamed. They all bear the original filenaming which is much better since it can easily be keyed to the book's unique page.

Note: this torrent includes only the abovementioned degruyter imprints. In the upcoming weeks I will create a second torrent, with the degruyter partner publishers: about 100k books from these publishers

https://i.imgur.com/mSKrLto.png And in few months the last torrent which will include the degruyter journals.

This endeavour would have not been possible were not for all the people that granted me academic access, wrote scripts for me, helped me ensure the integrity of the files, and so on. Sadly many files, especially epubs, are corrupted or downright missing at the source. There are also some dupes that, being part of multiple series and subseries, were downloaded twice. Total torrent size is about 2tb.

https://i.imgur.com/BjqsUqJ.png of course everything is actively being shared with nexus/annas/libgen as well as private groups and friends and the classicist discord channel. I did not waste months so I could jerkoff on the big numbah, therefore I want the files to be shared and reshared as much as possible in order to grant indirect academic access to all those students (me being one) whose universities cannot afford degruyter subscriptions.

Last note: all the files are retail untouched (according to BIB standards, if anyone here is a member). So if some epub has shitty formatting blame the publisher.

: fixed magnet url

10

DeGruyter Collection (self.datahoarder)

submitted 2 years ago* (last edited 2 years ago) by ancuuiqter to c/[email protected]

6 comments fedilink

tldr: a huge torrent of books straight from the academic publisher De Gruyter

cross-posted from: https://teddit.net/r/DataHoarder/comments/1463ah3/degruyter_collection/

Trying again because reddit filters, let's see if I get it right this time.

After months of scraping I finally finished downloading almost every single degruyter book to which I have access, which are many. And so I created a torrent.

magnet:?xt=urn:btih:76f573241a0126fb1ab0aa5540cc7493c045ae74&dn=Degruyter%20Imprints%20v2%20%5b09-06-23%5d&tr=http%3a%2f%2fatrack.pow7.com%2fannounce&tr=udp%3a%2f%2fopen.stealth.si%3a80%2fannounce&tr=udp%3a%2f%2ftracker.cyberia.is%3a6969%2fannounce&tr=udp%3a%2f%2fretracker.lanta-net.ru%3a2710%2fannounce&tr=udp%3a%2f%2ftracker.moeking.me%3a6969%2fannounce&tr=udp%3a%2f%2ftracker.tiny-vps.com%3a6969%2fannounce&tr=udp%3a%2f%2ftracker.torrent.eu.org%3a451%2fannounce&tr=udp%3a%2f%2ftracker.openbittorrent.com%3a80%2fannounce&tr=http%3a%2f%2ftracker.openbittorrent.com%3a80%2fannounce&tr=udp%3a%2f%2fopentra

The content of the torrent is pretty much these https://www.degruyter.com/search?query=*&startItem=0&pageSize=10&sortBy=mostrecent&documentVisibility=all&documentTypeFacet=book&publisherFacet=De+Gruyter%7EDe+Gruyter+Oldenbourg%7EDe+Gruyter+Saur%7EDe+Gruyter+Mouton

Neither the files nor the series are renamed. They all bear the original filenaming which is much better since it can easily be keyed to the book's unique page.

Note: this torrent includes only the abovementioned degruyter imprints. In the upcoming weeks I will create a second torrent, with the degruyter partner publishers: about 100k books from these publishers

https://i.imgur.com/mSKrLto.png And in few months the last torrent which will include the degruyter journals.

This endeavour would have not been possible were not for all the people that granted me academic access, wrote scripts for me, helped me ensure the integrity of the files, and so on. Sadly many files, especially epubs, are corrupted or downright missing at the source. There are also some dupes that, being part of multiple series and subseries, were downloaded twice. Total torrent size is about 2tb.

https://i.imgur.com/BjqsUqJ.png of course everything is actively being shared with nexus/annas/libgen as well as private groups and friends and the classicist discord channel. I did not waste months so I could jerkoff on the big numbah, therefore I want the files to be shared and reshared as much as possible in order to grant indirect academic access to all those students (me being one) whose universities cannot afford degruyter subscriptions.

Last note: all the files are retail untouched (according to BIB standards, if anyone here is a member). So if some epub has shitty formatting blame the publisher.

: fixed magnet url

11

Women Writing Africa Project (annas-archive.org)

submitted 2 years ago by ancuuiqter to c/[email protected]

0 comments fedilink

The product of a decade of research, this landmark collection is the first of four volumes in the Women Writing Africa Project, which seeks to document and map the extraordinary and diverse landscape of African women’s oral and written literatures. Presenting voices rarely heard outside Africa, some recorded as early as the mid-nineteenth century, as well as rediscovered gems by such well-known authors as Bessie Head and Doris Lessing, this volume reveals a living cultural legacy that will revolutionize the understanding of African women’s literary and cultural production.

Each text is accompanied by a scholarly headnote that provides detailed historical background. An introduction by the editors sets the broader historical stage and explores the many issues involved in collecting and combining orature and literature from diverse cultures in one volume. Unprecedented in its scope and achievement, this volume will be an essential resource for anyone interested in women’s history, culture, and literature in Africa, and worldwide.