this post was submitted on 31 Aug 2024
19 points (74.4% liked)

Open Source

31028 readers
857 users here now

All about open source! Feel free to ask questions, and share news, and interesting stuff!

Useful Links

Rules

Related Communities

Community icon from opensource.org, but we are not affiliated with them.

founded 5 years ago
MODERATORS
 

cross-posted from: https://discuss.tchncs.de/post/21298994

I'm trying to feel more comfortable using random GitHub projects, basically.

top 11 comments
sorted by: hot top controversial new old
[–] Static_Rocket 10 points 2 months ago (2 children)

You would first need to define malicious code within the context of that repo. To some people, telemetry is malicious.

[–] [email protected] 1 points 2 months ago

@Static_Rocket
@unknowing8343

Under the GDPR any data processing must be proportional to its goal, the goal must be transparent and justified and the processing must be limited to its goal. Telemetry is perfectly fine if you keep to the rules and malicious if you don't. So simple are things. And no, this can't be judged by looking at the repo, it is the deployment that matters. Nonetheless some code is always malicious, some code should be deployed with care. Would be good to scan for those.

[–] [email protected] 0 points 2 months ago

Yes, of course, the idea would be something like passing the AI a repo link and a prompt like "this repo is supposed to be used for X, tell me if you find anything weird that doesn't fit that purpose".

[–] [email protected] 10 points 2 months ago (1 children)

Huh. That's actually kind've a clever use case. I hadn't considered that. I presume the main obstacle would be the token limit of whatever LLM that one is using (presuming that it was an LLM that was used). Analyzing an entire codebase, ofc, depending on the project, would likely require an enormous amount of tokens that an LLM wouldn't be able to handle, or it would just be prohibitively expensive. To be clear, that's not to say that I know that such an LLM doesn't exist — one very well could — but if one doesn't, then that would be rationale that i would currently stand behind.

[–] [email protected] 3 points 2 months ago

I understand, but I wouldn't be surprised to see some solution out there that could maybe feed the AI chunks of code without context... It may still be able to detect "hey you told me this software is supposed to do X and here it seems to be doing Y".

I guess we'll have to wait a couple of years for these tools to be accessible and affordable.

[–] [email protected] 3 points 2 months ago (1 children)

Probably not. Obfuscation works, and might even depend on remote code being downloaded at either build time or run time.

There are a lot of heuristics you can use (e.g. disallowing some functions/modules) to check a codebase, but those already exist no AI required. Unless you call static analysis "AI", who knows.

[–] [email protected] 1 points 2 months ago (2 children)

But an AI can "realise" the code might be downloading something it doesn't need to. That's the point.

AI is "smart" and understands that you told it that the library was supposed to do something specific, and it can understand that and look for things that seem not correlated to the purpose of the repo.

[–] [email protected] 4 points 2 months ago (1 children)

If you're one of those people that think every product is better if there's "AI" on the box then sure. What you're describing is static analysis though, it is not new.

[–] [email protected] 1 points 2 months ago (1 children)
[–] [email protected] 2 points 2 months ago

Gitlab has a SAST tool

[–] [email protected] 2 points 2 months ago

Its got a dataset of billions for tokens, youre better off running the stock market as an antivirus.

Instead if you care use specifically curated programs for the task, like antivirus'