this post was submitted on 19 Jul 2023
219 points (93.6% liked)
Comradeship // Freechat
263 readers
1 users here now
Talk about whatever, respecting the rules established by Lemmygrad. Failing to comply with the rules will grant you a few warnings, insisting on breaking them will grant you a beautiful shiny banwall.
A community for comrades to chat and talk about whatever doesn't fit other communities
founded 3 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
So far, on the arxiv page, no data or source code have been provided alongside the paper. I'd expect jupyter journals, or something like that at least, for reproducibility. Perhaps they will be added later or they are provided in a URL within the paper that I have not yet read.
In any case, the screenshot is of Table 11, and it is found in Appendix D, Domain Analysis:
Describing foreignpolicy.com as left-wing is an example of miscategorization by the authors, as is calling redsails.org a "Chinese far-left platform." Neither of these are accurate statements, and they undercut trust that the authors are correctly and thoroughly labeling and interpreting their data. Between this and other glaring oversights in Table 12 -- which purports that domains like "redditsave.com," "ko-fi.com," "twimg.com," and "archive.is" are "representative domains of tankies" specifically and supposedly not heavily found in other similar far-left communities (as per the authors' description of the Tf-Idf algorithm and their motivation for its use) -- there is a compelling case that the authors (1) do not themselves possess a sufficient level of understanding of left-wing ideology -- much less Marxist-Leninist ideology -- to label it accurately, and (2) may have been sloppy with their data analysis (though this can't be definitively known without access to the underlying datasets and analytics source code).
Majestic is described on the cited URL as: "The million domains we find with the most referring subnets." Basically, of the 7,049 different domains contained in the 146,078 URLs the authors found in their crawl, remove any that are found in the top 1,000 domains as defined by Majestic. Domains like google.com, facebook.com, reddit.com (whether or not the authors recognize the potential problem with excluding that particular result from the table is unknown at this point; I have not finished reviewing the paper).
Thanks