this post was submitted on 25 Jun 2023

165 points (100.0% liked)

No Stupid Questions

37013 readers

1001 users here now

No such thing. Ask away!

!nostupidquestions is a community dedicated to being helpful and answering each others' questions on various topics.

The rules for posting and commenting, besides the rules defined here for lemmy.world, are as follows:

Rules (interactive)

Rule 1- All posts must be legitimate questions. All post titles must include a question.

All posts must be legitimate questions, and all post titles must include a question. Questions that are joke or trolling questions, memes, song lyrics as title, etc. are not allowed here. See Rule 6 for all exceptions.

Rule 2- Your question subject cannot be illegal or NSFW material.

Your question subject cannot be illegal or NSFW material. You will be warned first, banned second.

Rule 3- Do not seek mental, medical and professional help here.

Do not seek mental, medical and professional help here. Breaking this rule will not get you or your post removed, but it will put you at risk, and possibly in danger.

Rule 4- No self promotion or upvote-farming of any kind.

That's it.

Rule 5- No baiting or sealioning or promoting an agenda.

Questions which, instead of being of an innocuous nature, are specifically intended (based on reports and in the opinion of our crack moderation team) to bait users into ideological wars on charged political topics will be removed and the authors warned - or banned - depending on severity.

Rule 6- Regarding META posts and joke questions.

Provided it is about the community itself, you may post non-question posts using the [META] tag on your post title.

On fridays, you are allowed to post meme and troll questions, on the condition that it's in text format only, and conforms with our other rules. These posts MUST include the [NSQ Friday] tag in their title.

If you post a serious question on friday and are looking only for legitimate answers, then please include the [Serious] tag on your post. Irrelevant replies will then be removed by moderators.

Rule 7- You can't intentionally annoy, mock, or harass other members.

If you intentionally annoy, mock, harass, or discriminate against any individual member, you will be removed.

Likewise, if you are a member, sympathiser or a resemblant of a movement that is known to largely hate, mock, discriminate against, and/or want to take lives of a group of people, and you were provably vocal about your hate, then you will be banned on sight.

Rule 8- All comments should try to stay relevant to their parent content.

Rule 9- Reposts from other platforms are not allowed.

Let everyone have their own content.

Rule 10- Majority of bots aren't allowed to participate here.

Credits

Our breathtaking icon was bestowed upon us by @Cevilia!

The greatest banner of all time: by @TheOneWithTheHair!

founded 2 years ago

MODERATORS

165

What kind of data do Lemmy instances store about their users? (self.nostupidquestions)

submitted 2 years ago by klappscheinwerfer to c/nostupidquestions

37 comments fedilink hide all child comments

all 38 comments

sorted by: hot top controversial new old

[–] chagall 91 points 2 years ago* (last edited 2 years ago) (1 children)

Disclaimer, This is only for Lemmy.world.

I actually read the privacy policy. There are basically 3 segments of data:

The one time when you signed up.
All times you log in, after you've signed up.
User generated data

For part one: They store your username and the IP address used when you create the account. They store a hashed version of your password, not the actual password. They'll store that info for as long as you have an account with lemmy.world (although they reserve the right to keep it for up to 12 months after you've deleted your account). They store the hashed password so you can log into your account.

For part two: They keep a log of the times you sign in, the device you signed in from (iOS, Android, web) and the IP address you do it from. They delete this data on a rolling basis, every 90 days from the date the login data was created (from the time you logged in).

For part three: These are your posts, comments, upvotes, downvotes, etc. This is stored this until you delete your comment/post or undo your upvote/downvote. When you delete your account, if you haven't deleted your data, the connection (the association) between your account and the data itself is severed. This means that the comment will remain but the username value will be null.

tl;dr: I'm no expert but I think they keep a very small amount data. They probably do this to keep their costs as low as possible (but that is just my speculation).

If you're really worried about data mining and data logging, you can always go back to reddit /s

[–] [email protected] 1 points 2 years ago

If you're really worried about data mining and data logging, you can always go back to reddit /s

Or just run your own lemmy instance.

[–] [email protected] 18 points 2 years ago (1 children)

DB schema for user info:

https://github.com/LemmyNet/lemmy/blob/63d3759c481ff2d7594d391ae86e881e2aeca56d/crates/db_schema/src/schema.rs#L380

https://github.com/LemmyNet/lemmy/blob/63d3759c481ff2d7594d391ae86e881e2aeca56d/crates/db_schema/src/schema.rs#L545

Just what you'd expect really (settings, profile data, posts/comments), not even user agent (what browser you use) is stored. But keep in mind any instance you sign up to could be using a forked version that inserts Google analytics or FB pixel or any other sort of tracking tech.

[–] Nioxic 2 points 2 years ago* (last edited 2 years ago) (1 children)

    deleted -> Bool,

this indicates the user aint gonna be deleted, just marked as "deleted" but still actually exist ?

this is common practise, at least.

im a bit too lazy to look through all the code. maybe deleted users arent show, and their comments "content" is still there?

though technically my comments could be considered personal data and thus break gdpr

[–] Aux 17 points 2 years ago (2 children)

Data should never be actually deleted from the database, that breaks all the best practices. It can be overwritten with garbage though. But it should always be present.

For example, if you create a new account with email, username and password and get assigned some id like 42. Then after a while you want to delete your account. The account should stay intact, id number 42 should still be occupied, but your email, username and password should be replaced by null values.

[–] PutangInaMo 1 points 2 years ago

Reference avro schema changes

[–] Nioxic -2 points 2 years ago (1 children)

so the last line in my comment would still apply

[–] Aux 6 points 2 years ago

No, they will be anonymised. That's compliant.

[–] [email protected] 17 points 2 years ago (2 children)

All I gave them are a private relay email address. So I am assuming there isn’t much else in terms of private data.

I am sure your ip address is logged somewhere for security. But beyond that I am not sure what else is there to store.

[–] danc4498 14 points 2 years ago (1 children)

There's tons more data than that that can be picked up. To start with, what posts you interact with and how. I'm sure there's loads of other data points that can be tracked.

[–] frostphunk 9 points 2 years ago (1 children)

Thankfully the code is open source so a good answer can be had. Unfortunately not familiar with the code base

Ofc instances themselves may modify the code to track certain things. However this highlights the importance of using instances you trust.

I wonder if there is a checksum or something to verify what version of code an instance is using

[–] danc4498 8 points 2 years ago

True, open source is always better for the sake of transparency. Worth noting, though, that doesn't mean the server you are browsing on uses the exact build they say they are.

[–] Nioxic 2 points 2 years ago

Depends what you comment throughout your time here

[–] sauron 7 points 2 years ago

I have yet to go through it all myself but from what I've seen of the Lemmy code it seems pretty straight forward. I doubt anything is being tracked other than what is required. Obviously your IP has to be taken down so they can route traffic to you. Username and all info you put on your profile or post. List of liked/disliked posts, subscribed or blocked communities and people, perhaps metadata of any photos or videos you upload, the package name for whatever mobile app you use, etc.

All the code is available on GitHub for you to check out if you'd like, 80% of it is written in Rust. But I am looking through it myself to see what kind of privacy I can expect from Lemmy. It's already ahead of Reddit though, where I couldn't view the source code and just had to trust what the company said.

[–] solrize 4 points 2 years ago* (last edited 2 years ago) (4 children)

I installed Jerboa and noticed that it grayed the titles of posts that I had already viewed even though I had viewed them on the web. That told me (unless I am somehow confused) that the server side tracks what posts you have read.

From my perspective that seems like a terrible invasion. I can understand some benefit to showing the post status in the UI, but if it is stored at all, the storage should be exclusively on the client side. I mentioned this also in the "issues" thread and got no reaction, so maybe I'm missing something or in error.

[–] scottywh 13 points 2 years ago (1 children)

Reddit tracked posts viewed as well.

[–] solrize 4 points 2 years ago (1 children)

Thanks, that is good to know, but that is a type of evil where I would hope Lemmy doesn't follow Reddit. I sometimes posted to Reddit but I more often read passively without logging in, partly to avoid some of the tracking.

[–] scottywh 4 points 2 years ago (1 children)

Sure. I think it's good to be aware of for sure and I agree that it would be nice if Lemmy isn't tracking as much. I also recognize that I've accepted a certain amount of tracking in my life over the years at this point.

[–] solrize -1 points 2 years ago* (last edited 2 years ago) (2 children)

Tracking posts is understandable. Tracking up and down votes is iffy. Tracking reading is inappropriate and invasive.

[–] scottywh 4 points 2 years ago (1 children)

I don't disagree really... Just pointing out that as long as you're logged in Reddit has always tracked posts viewed as far as I'm aware... Facebook similarly tracks all activities and always has.

These are obviously not models to aspire to but I think that it's helpful to be aware of what we've dealt with up until this point.

[–] solrize 2 points 2 years ago

Facebook was notoriously evil and I actuallly have all their domains that I know of blocked from my computer at the DNS level in order to avoid their spying. That Reddit tracks posts viewed is new to me but I guess not that surprising. Usenet never tracked posts viewed and basically couldn't. Wikipedia emphatically doesn't track that, though it doesn't track view counts. Arxiv.org doesn't track (or at least publish) view counts for individual papers (see here) though they do publish stats about the entire site. A real privacy focused site would avoid publishing any about what viewers are doing. There is a whole topic in cryptography called private information retrieval about how to run a server in which the clients can verify that the server can't know what they are reading. This is what Lemmy should aspire to, imho. (Aspirations aren't meant to be achieved literally, but only to provide guidance).

I may open a thread in /c/[email protected] asking about this, but the answer might be to launch my own Lemmy instance and retrieve all of the Lemmy posts so I can browse the ones that interest me without leaking any info. I'm sort of in a position to do that, but most people unfortunately aren't.

[–] aski3252 2 points 2 years ago (1 children)

What do you mean with "tracking" exactly? The way I understand it, tracking is analysing and using user data, for example for marketing purposes.

Posts and content need to be saved on the instance as far as I understand, I don't see any other way. And posts and comments are essentially public information, anyone can see the posts that your username posts and comments, that's kinda the entire point of posting and commenting.

Up and down votes too, otherwise I don't see how the concept of up and down votes could work. The server needs to know which comments or posts you upvote, otherwise it doesn't register it. And theoretically, the server admin could track that information and make statistics based on it, although this is potentially where legal issues come in if it's not properly explained what is done with your data.

Same with metadata stuff and data such as which posts you access/read. The server has to know that information, when you click on a post you want to read, you are essentially asking the server to provide you with that post, so the server has to know which post you want to read and this is generally logged on the server for a certain time.

The question is does the server keep and archive this information and/or is this information used and analysed by somebody.

According to the admin, data is not sold or used for marketing purposes.

[–] solrize 1 points 2 years ago* (last edited 2 years ago) (1 children)

Tracking of reads = when you read someone's post, there is a permanent record made, e.g. in a db row associated with the user, that @aski3252 read that post. That is somewhat different from normal httpd access logs that associate only with IP addresses and which typically get distilled down to aggregate data, and prefeferably discarded after a short period. Where I worked, we kept logs around for 30 days for stuff like abuse investigations but deleted them after that. In fact with a little careful design of the log data, or if the query is sent by HTTP POST instead of GET, the parameters that identify what you were reading will usually not be logged at all.

It's not mostly an issue of selling data for marketing purposes. The data could also be extracted by cyber attackers, seized by law enforcement, subpoenad in a lawsuit, or whatever. The only way to stop that from happening is to not retain the data in the first place. "Marketing purposes" is a smoke screen anyway. E.g. if you are a regular lurker on a community about workplace organizing or job hunting, that info will be more valuable to your boss than it will be to some advertiser or marketer. So the real customers of internet usage data (and phone records etc.) are far less benign than "marketing" organizations.

It is not necessary to record voting data except to prevent you from voting twice on a particular topic. So if voting closes (say on a poll), all the data about who voted in it can be deleted. There is also no need to remember HOW anyone voted. It's enough to remember that you voted on a particular topic, and increment the relevant vote counter. That is also how real-world elections work. See also the topic of "receipt-free voting" in cryptography.

I agree with you that if you actually publish something on the site, there is a certain amount of disclosure unavoidably associated with that.

[–] aski3252 1 points 2 years ago (1 children)

First of all, just to be clear, I'm not at all an expert on this topic for those who haven't noticed. My questions are mostly because I want to learn how it works, not because I want to tell you that you are wrong or anything like that. You seem to know a lot more than me anyway.

Tracking of reads = when you read someone’s post, there is a permanent log record made

When you read someone's post, you first need to access that information from the server. In order to do that, your client tells the server which post you want to see and the server sends you that post. Those interactions are most likely logged on the server as well as which IP address has requested that information, etc. There is no absolute sure way to make sure that the admin does not use those logs to extract that information, at the end of the day, it comes down if you trust the admin.

But there is also a "show read posts" option which seems to hide read posts overall, which does indeed suggest that read posts are saved and used and which seems to work independent of client.

It’s not mostly an issue of selling data for marketing purposes. The data could also be extracted by cyber attackers, seized by law enforcement, subpoenad in a lawsuit, or whatever.

Sure, I do get the issue to some extend, but I don't see how it is quite as bad as you seem to imply. For example, I worry more about personal data, such as my e-mail address being leaked, which is why I generally use a throw away email. I don't really see why I, or some attacker, should care about which posts I have "read", but maybe I don't understand the full implications getting this information means.

“Marketing purposes” is a smoke screen anyway.

Of course it is, but I don't think there are any lemmy instances that use lemmy data for marketing purposes. Data seems to be used only to improve the user experience, at least that's how it's intended.

It is not necessary to record voting data except to prevent you from voting twice on a particular topic.

If it wasn't logged or only logged client side you could upvote/downvote infinitely, no?

There is also no need to remember HOW anyone voted. It’s enough to remember that you voted on a particular topic, and increment the relevant vote counter. That is also how real-world elections work. See also the topic of “receipt-free voting” in cryptography.

That does seem to be a good point.

[–] solrize 1 points 2 years ago* (last edited 2 years ago) (1 children)

Yes, I understand how web servers work (I have implemented them) ;-). I've also been involved in abuse investigations that involved crunching of 100s of GB of raw logs. If I wanted to figure out what posts you had read based on raw http logs, it would be a big pain in the neck involving matching your user ID with IP addresses, and trying to match HTTP queries with posts, where the relevant log entries were scattered through billions of similar entries from other people. Last time I did something like that, the analysis took about 15 hours on a quite big server, though that particular task also had to find groups of queries corresponding to login sessions. While if there's a database table that identifies every post that has been read by every user, all I have to do is type some SQL and the info comes up immediately.

As for the invasiveness of that info, don't you have any private life at all? Are you pro-XYZ about some political question while your boss is rabidly anti-XYZ? You probably don't want him to know what you're reading. Same if you're getting sued by someone trying to dig up dirt on you, or if you are running for some kind of office (look at all the NSFW content aski3252 reads on Lemmy! Sinner!!!, etc). Or say you are in a country where some dictator gets into power and decides to round up all the Star Trek fans. You suspected something like that was coming, so you carefully avoided posting in the Star Trek communities, but unfortunately you were reading them and now you have been found out. Just use your imagination ;).

Re voting, let's say there is a poll "Is Spez an idiot? Vote yes/no, poll closes on July 1", and you vote in it. To stop you from voting twice, the server must remember until july 1 that you voted, but not how you voted. After July 1, it is impossible to vote again, so the info that you voted at all can be deleted. What currently happens instead seems to be that "aski3252 voted yes" is retained forever. There are some minor UI benefits to that, so I described it as iffy rather than outright evil. If it were up to me though, I would minimize the amount of info kept.

[–] faltuuser 1 points 2 years ago

Very Informative 🧵

[–] aski3252 6 points 2 years ago (2 children)

I noticed previously that stuff I read in the browser does not show up as read in the mobile app. I also just tested it with different browsers and as far as I can see, read posts are marked as unread when I use another browser.

So are you actually sure about your claim? This is very easily testable, so I hope you have actually confirmed this before you accuse lemmy of participating in a "terrible invasion" of privacy..

[–] solrize 1 points 2 years ago* (last edited 2 years ago) (1 children)

I have said several times that I am not completely sure. I will see if I can do some better tests. It is something that I noticed when I installed Jerboa, so I asked about it, and people seemed to confirm that there was server side tracking.

Anyway, even if it's confirmed, unless there is deception involved (which I have no reason to suspect), there's not much of an "accusation" to be made. I would say, in the event that individual post views really are saved on the server, that Lemmy's designers made a policy choice that I don't agree with. I'd call that a description rather than an accusation. I'd try to open a discussion about getting the decision changed. If that didn't succeed, I'd look for technical workarounds and/or limit my reading on the site.

[–] aski3252 5 points 2 years ago (1 children)

Thanks for your clarification.

[–] solrize 1 points 2 years ago* (last edited 2 years ago)

No prob, and I'll go a little further, from having seen this kind of thing many times before. Lots of times these info leaks happen because it was technically convenient or somehow useful to do X, without thinking through the privacy implications. Security vulnerabilities happen the same way. People just want to get their thing done with minimum fuss, rather than dithering around weighing complicated tradeoffs. So X is not explicitly a policy decision at all, but instead is a technical decision that turns out to have policy implications.

I'm a security developer so I have to be attuned to this kind of thing, but I miss stuff too, as does everyone. Most of the time nobody is being "bad". They are just trying to ship product in a complicated environment full of subtle interactions, and it is easy to overlook stuff, especially if you haven't already spent a lot of time dealing with those same issues.

[–] AFKBRBChocolate 0 points 2 years ago

Since there's a setting in your user profile to show or not sure read posts, it's clearly something tracked.

[–] [email protected] 4 points 2 years ago (1 children)

I can't say that the backends don't track that for sure because I haven't looked at the source or anything. But keeping a history is something very commonly done in the client. Just like Web browsers.

[–] solrize 1 points 2 years ago* (last edited 2 years ago)

Right, what I saw (unless I'm mistaken which is possible) was reading posts on one client (Firefox browser on my laptop computer) and then seeing the read posts marked on a completely different client (Jerboa on my phone). That means the info must have somehow been communicated between the two clients. Suspicion points to the server. I will ask on /c/[email protected] about this and/or look at the code base.

[–] Myriadblue 2 points 2 years ago (1 children)

Isn't that a browser thing, not a lemmy thing? Iirc, your jerboa history shows up in your default browser

[–] solrize 4 points 2 years ago (1 children)

I browsed and posted on Lemmy for a while through a desktop browser on my laptop, then installed Jerboa on my phone and started playing with it, and immediately noticed that posts I had previously read through the browser were marked in Jerboa. The only ways Jerboa could have gotten that info are: 1) the server recorded the info from the browser and relayed it to Jerboa, or 2) I was confused somehow and had also read those posts through Jerboa.

#2 above is something of a possibility but that leaves #1 as still not dispelled suspicion. I was hoping that someone familiar with the implementation would comment.

[–] Myriadblue 1 points 2 years ago

Interesting! I've only used lemmy on mobile, so haven't had that experience.

[–] zeppo 4 points 2 years ago

The unfortunate thing is that eventually, like how it happens in mobile apps, owners of popular lemmy instances will be contacted by marketers/data harvesters with offers like "hey, we'll give you $50,000 to install this data harvesting code on your site", which is difficult to turn down for many people.