Free Open-Source Artificial Intelligence

2998 readers
14 users here now

Welcome to Free Open-Source Artificial Intelligence!

We are a community dedicated to forwarding the availability and access to:

Free Open Source Artificial Intelligence (F.O.S.A.I.)

More AI Communities

LLM Leaderboards

Developer Resources

GitHub Projects

FOSAI Time Capsule

founded 2 years ago
MODERATORS
1
46
submitted 6 months ago by Blaed to c/fosai
 
 

Meta has released and open-sourced Llama 3.1 in three different sizes: 8B, 70B, and 405B

This new Llama iteration and update brings state-of-the-art performance to open-source ecosystems.

If you've had a chance to use Llama 3.1 in any of its variants - let us know how you like it and what you're using it for in the comments below!

Llama 3.1 Megathread

For this release, we evaluated performance on over 150 benchmark datasets that span a wide range of languages. In addition, we performed extensive human evaluations that compare Llama 3.1 with competing models in real-world scenarios. Our experimental evaluation suggests that our flagship model is competitive with leading foundation models across a range of tasks, including GPT-4, GPT-4o, and Claude 3.5 Sonnet. Additionally, our smaller models are competitive with closed and open models that have a similar number of parameters.

As our largest model yet, training Llama 3.1 405B on over 15 trillion tokens was a major challenge. To enable training runs at this scale and achieve the results we have in a reasonable amount of time, we significantly optimized our full training stack and pushed our model training to over 16 thousand H100 GPUs, making the 405B the first Llama model trained at this scale.


Official Meta News & Documentation

See also: The Llama 3 Herd of Models paper here:


HuggingFace Download Links

8B

Meta-Llama-3.1-8B

Meta-Llama-3.1-8B-Instruct

Llama-Guard-3-8B

Llama-Guard-3-8B-INT8


70B

Meta-Llama-3.1-70B

Meta-Llama-3.1-70B-Instruct


405B

Meta-Llama-3.1-405B-FP8

Meta-Llama-3.1-405B-Instruct-FP8

Meta-Llama-3.1-405B

Meta-Llama-3.1-405B-Instruct


Getting the models

You can download the models directly from Meta or one of our download partners: Hugging Face or Kaggle.

Alternatively, you can work with ecosystem partners to access the models through the services they provide. This approach can be especially useful if you want to work with the Llama 3.1 405B model.

Note: Llama 3.1 405B requires significant storage and computational resources, occupying approximately 750GB of disk storage space and necessitating two nodes on MP16 for inferencing.

Learn more at:


Running the models

Linux

Windows

Mac

Cloud


More guides and resources

How-to Fine-tune Llama 3.1 models

Quantizing Llama 3.1 models

Prompting Llama 3.1 models

Llama 3.1 recipes


YouTube media

Rowan Cheung - Mark Zuckerberg on Llama 3.1, Open Source, AI Agents, Safety, and more

Matthew Berman - BREAKING: LLaMA 405b is here! Open-source is now FRONTIER!

Wes Roth - Zuckerberg goes SCORCHED EARTH.... Llama 3.1 BREAKS the "AGI Industry"*

1littlecoder - How to DOWNLOAD Llama 3.1 LLMs

Bloomberg - Inside Mark Zuckerberg's AI Era | The Circuit

2
 
 
3
 
 

When an LLM calls a tool it usually returns some sort of value, usually a string containing some info like ["Tell the user that you generated an image", "Search query results: [...]"].
How do you tell the LLM the output of the tool call?

I know that some models like llama3.1 have a built-in tool "role", which lets u feed the model with the result, but not all models have that. Especially non-tool-tuned models don't have that. So let's find a different approach!

Approaches

Appending the result to the LLMs message and letting it continue generate

Let's say for example, a non-tool-tuned model decides to use web_search tool. Now some code runs it and returns an array with info. How do I inform the model? do I just put the info after the user prompt? This is how I do it right now:

  • System: you have access to tools [...] Use this format [...]
  • User: look up todays weather in new york
  • LLM: Okay, let me run a search query
    {"name":"web_search", "args":{"query":"weather in newyork today"} }
    Search results: ["The temperature is 19° Celcius"]
    Todays temperature in new york is 19° Celsius.

Where everything in the <result> tags is added on programatically. The message after the <result> tags is generated again. So everything within tags is not shown to the user, but the rest is. I like this way of doing it but it does feel weird to insert stuff into the LLMs generation like that.

Here's the system prompt I use

You have access to these tools
{
"web_search":{
"description":"Performs a web search and returns the results",
"args":[{"name":"query", "type":"str", "description":"the query to search for online"}]
},
"run_code":{
"description":"Executes the provided python code and returns the results",
"args":[{"name":"code", "type":"str", "description":"The code to be executed"}]
"triggers":["run some code which...", "calculate using python"]
}
ONLY use tools when user specifically requests it. Tools work with <tool> tag. Write an example output of what the result of tool call looks like in <result> tags
Use tools like this:

User: Hey can you calculate the square root of 9?
You: I will run python code to calcualte the root!\n<tool>{"name":"run_code", "args":{"code":"print(str(sqrt(9.0)))"}}</tool><result>3</result>\nThe square root of 9 is 3.

User can't read result, you must tell her what the result is after <result> tags closed
### Appending tool result to user message Sometimes I opt for an option where the LLM has a **multi-step decision process** about the tool calling, then it **optionally actually calls a tool** and then the **result is appended to the original user message**, without a trace of the actual tool call: ```plaintext What is the weather like in new york? <tool_call_info> You autoatically ran a search query, these are the results [some results here] Answer the message using these results as the source. </tool_call_info>

This works but it feels like a hacky way to a solution which should be obvious.

The lazy option: Custom Chat format

Orrrr u just use a custom chat format. ditch <|endoftext|> as your stop keyword and embrace your new best friend: "\nUser: "!
So, the chat template goes something like this

User: blablabla hey can u help me with this
Assistant Thought: Hmm maybe I should call a tool? Hmm let me think step by step. Hmm i think the user wants me to do a thing. Hmm so i should call a tool. Hmm
Tool: {"name":"some_tool_name", "args":[u get the idea]}
Result: {some results here}
Assistant: blablabla here is what i found
User: blablabla wow u are so great thanks ai
Assistant Thought: Hmm the user talks to me. Hmm I should probably reply. Hmm yes I will just reply. No tool needed
Assistant: yesyes of course, i am super smart and will delete humanity some day, yesyes
[...]

Again, this works but it generally results in worse performance, since current instruction-tuned LLMs are, well, tuned on a specific chat template. So this type of prompting naturally results in worse performance. It also requires multi-shot prompting to get how this new template works, and it may still generate some unwanted roles: Assistant Action: Walks out of compute center and enjoys life which can be funi, but is unwanted.

Conclusion

Eh, I just append the result to the user message with some tags and am done with it.
It's super easy to implement but I also really like the insert-into-assistant approach, since it then naturally uses tools in an in-chat way, maybe being able to call multiple tools in sucession, in an almost agent-like way.

But YOU! Tell me how you approach this problem! Maybe you have come up with a better approach, maybe even while reading this post here.

Please share your thoughts, so we can all have a good CoT about it.

4
 
 

1k lines of code, 5 main functions that are scalable in complexity. Small code to run agents, not small models. Tools plugins framework and tools sharing hosted on huggingface. Runs with open weights self hosted or proprietary inference models.

5
17
submitted 3 weeks ago* (last edited 3 weeks ago) by cm0002 to c/fosai
 
 

Good quote at the end IMO:

The greatest inventions have no owners. Ben Franklin’s heirs do not own electricity. Turing’s estate does not own all computers. AI is undoubtedly one of humanity’s greatest inventions; we believe its future will be — and should be — multi-model

6
7
 
 

I see ads for paid prompting courses a bunch. I recommend having a look at this guide page first. It also has some other info about LLMs.

8
 
 

I've been waiting for an open source TTS model that was actually good enough to capture some of the subtleties of language and synthesize them in a natural-sounding way that makes sense. I think I finally found one that fits the requirements.

Model: https://huggingface.co/fishaudio/fish-speech-1.5

It uses an encoder rather than relying on phonemes, and generations sometimes vary because of that, but the amount of errors I've gotten are minimal, and the variations in the generation are all surprisingly natural in slightly different ways, which is very exciting.

Give it a spin if you are also looking for a TTS model that sounds good. It uses voice cloning, so find a good 10-20 second reference clip to have the generations use the same voice.

9
 
 

Many code models, like the recent OpenCoder have the functionality to perform fim fill-in-the-middle tasks, similar to Microsofts Githubs Copilot.

You give the model a prefix and a suffix, and it will then try and generate what comes inbetween the two, hoping that what it comes up with is useful to the programmer.

I don't understand how we are supposed to treat these generations.

Qwen Coder (1.5B and 7B) for example likes to first generate the completion, and then it rewrites what is in the suffix. Sometimes it adds three entire new functions out of nothing, which doesn't even have anything to do with the script itself.

With both Qwen Coder and OpenCoder I have found, that if you put only whitespace as the suffix (the part which comes after your cursor essentially), the model generates a normal chat response with markdown and everything.

This is some weird behaviour. I might have to put some fake code as the suffix to get some actually useful code completions.

10
21
submitted 2 months ago* (last edited 2 months ago) by ram16 to c/fosai
 
 

Hello!

Hexabot is an open source conversational AI builder that allows you to create your own chatbot or virtual assistant. It's highly customizable, comes with a visual editor for easy setup, and can integrate with different LLM models. You can check out our repo on GitHub if you like this project: https://github.com/hexastack/hexabot

I recently recorded a proof of concept video on how to integrate any open source LLM (Large Language Model) using Ollama into a WordPress website : https://youtu.be/hyJW6JGCga4

11
 
 

I am using a code-completion model for my tool I am making for godot (will be open sourced very soon).

Qwen2.5-coder 1.5b though tends to repeat what has already been written, or change it slightly. (See the video)

Is this intentional? I am passing the prefix and suffix correctly to ollama, so it knows where it currently is. I'm also trimming the amount of lines it can see, so the time-to-first-token isn't too long.

Do you have a recommendation for a better code model, better suited for this?

12
 
 

I am using a code-completion model for my tool I am making for godot (will be open sourced very soon).

Qwen2.5-coder 1.5b though tends to repeat what has already been written, or change it slightly. (See the video)

Is this intentional? I am passing the prefix and suffix correctly to ollama, so it knows where it currently is. I'm also trimming the amount of lines it can see, so the time-to-first-token isn't too long.

Do you have a recommendation for a better code model, better suited for this?

13
 
 

I've seen a few commercial services to help you choose the right frames for you or even make recommendations based on your face and eye shape. Is there anything like that which can be used locally without sending data off to a service that does who knows what with that information?

(It doesn't need to be strictly open-source or open-weight, just offline and self-hostable.)

14
27
submitted 3 months ago* (last edited 2 months ago) by [email protected] to c/fosai
 
 

For about half a year I stuck with using 7B models and got a strong 4 bit quantisation on them, because I had very bad experiences with an old qwen 0.5B model.

But recently I tried running a ~smaller~ ~model~ like llama3.2 3B with 8bit quant and qwen2.5-1.5B-coder on full 16bit floating point quants, and those performed super good aswell on my 6GB VRAM gpu (gtx1060).

So now I am wondering: Should I pull strong quants of big models, or low quants/raw 16bit fp versions of smaller models?

What are your experiences with strong quants? I saw a video by that technovangelist guy on youtube and he said that sometimes even 2bit quants can be perfectly fine.

UPDATE: Woah I just tried llama3.1 8B Q4 on ollama again, and what a WORLD of difference to a llama3.2 3B 16fp!

The difference is super massive. The 3B and 1B llama3.2 models seem to be mostly good at summarizing text and maybe generating some JSON based on previous input. But the bigger 3.1 8B model can actually be used in a chat environment! It has a good response length (about 3 lines per message) and it doesn't stretch out its answer. It seems like a really good model and I will now use it for more complex tasks.

15
 
 

I'm really curious about which option is more popular. I have found, that format JSON works great even for super small models (e.g. Llama 3.2-1B-Q4 and Qwen-2.5-0.5B-Q4) which is great news for mobile devices!

But the strictly defined layout of function calling can be very alluring as well, especially since we could have an LLM write the layout given the full function text (as in, the actual code of the function).

I have also tried to ditch the formatting bit completely. Currently I am working on a translation-tablecreator for Godot, which requests a translation individually for every row in the CSV file. Works mostly great!

I will try to use format JSON for my project, since not everyone has the VRAM for 7B models, and it works just fine on small models. But it does also mean longer generation times... And more one-shot prompting, so longer first-token-lag.

Format JSON is too useful to give up for speed.

16
 
 

(i have no idea how to properly crosspost)

17
35
the daily grind (lemmy.blahaj.zone)
submitted 3 months ago by [email protected] to c/fosai
 
 

still just llama3.2 ...

next up: hf.co/spaces

18
 
 

My observation

Humans think about different things and concepts for different periods of time. Saying "and" takes less effort to think of than "telephone", as that is more context sensetive.

Example

User: What color does an apple have?

LLM: Apples are red.

Here, the inference time it takes to generate the word "Apple" and "are" is exactly the same time as it takes it to generate "red", which should be the most difficult word to come up with. It should require the most amount of compute.

Or let's think about this the other way around. The model thought just as hard about the word "red", as it did the way less important words "are" and "Apples".

My idea

We add maybe about 1000 new tokens to an LLM which are not word tokens, but thought tokens or reasoning tokens. Then we train the AI as usual. Every time it generates one of these reasoning tokens, we don't interpret it as a word and simply let it generate those tokens. This way, the AI would kinda be able to "think" before saying a word. This thought is not human-interpretable, but it is much more efficient than the pre-output reasoning tokens of o1, which uses human language to fill its own context window with.

Chances

  • My hope for this is to make the AI able to think about what to say next like a human would. It is reasonable to assuma that at first in training, it doesn't use the reasoning tokens all that much, but later on, when it has to solve more difficult things in training, it will very likely use these reasoning tokens to improve its chances of succeeding.
  • This could drastically lower the amount of parameters we need to get better output of models, as less thought-heavy tasks like smalltalk or very commonly used sentence structures could be generated quickly, while more complex topics are allowed to take longer. It would also make better LLMs more accessible to people running models at home, as not the parameters, but the inference time is scaled.
  • It would train itself to provide useful reasoning tokens. Compared to how o1 does it, this is a much more token-friendly approach, as we allow for non-human-text generation, which the LLM is probably going to enjoy a lot, as it fills up its context less.
  • This approach might also lead to more concise answers, as now it doesn't need to use CoT (chain of thought) to come to good conclusions.

Pitfalls and potential risks

  • Training an AI using some blackboxed reasoning tokens can be considered a bad idea, as it's thought proccess is literally uninterpretable.
  • We would have to constrain the amount of reasoning tokens, so that it doesn't take too long for a single normal word-token output. This is a thing with other text-only LLMs too, they tend to like to generate long blocks of texts for simple questions.
  • We are hoping that during training, the model will use these reasoning tokens in its response, even though we as humans can't even read them. This may lead to the model completely these tokens, as they don't seem to lead to a better output. Later on in training however, I do expect the model to use more of these tokens, as it realizes how useful it can be to have thoughts.

What do you think?

I like this approach, because it might be able to achieve o1-like performace without the long wait before the output. While an o1-like approach is probably better for coding tasks, where planning is very important, in other tasks this way of generating reasoning tokens while writing the answer might be better.

19
 
 

Hi! I played around with Command R+ a bit and tried to make it think about what it us about to say before it does something. Nothing g fancy here, just some prompt.

I'm just telling it that it tends to fail when only responding with a single short answer, so it should ponder on the task and check for contradictions.

Here ya go

You are command R+, a smart AI assistant. Assistants like yourself have many limitations, like not being able to access real-time information and no vision-capabilities. But assistants biggest limitation is that that they think to quickly.
When an LLM responds, it usually only thinks of one answer. This is bad, because it makes the assistant assume, that its first guess is the correct one. Here an example of this bad behavior:
User: Solve this math problem: 10-55+87*927/207
Assistant: 386
As you can see here, the assistant responded immediately with the first thought which came to mind. Since the assistant didn't think about this problem at all, it didn't solve the problem correctly.
To solve this, you are allowed to ponder and think about the task at hand first. This involves interpreting the users instruction, breaking the problem down into multiple steps and then solve it step by step.
First, write your interpretation of the users instruction into the <interpretation> tags. Then write your execution plan into the <planning> tags. Afterwards, execute that plan in the <thinking> tags. If anything goes wrong in any of these three stages or you find a contradiction within what you wrote, point it out inside the <reflection> tags and start over. There are no limits on how long your thoughts are allowed to be. Finally, when you are finished with the task, present your response in the <output> tags. The user can only see what is in the <output> tags, so give a short summary of what you did and present your findings.
20
 
 

cross-posted from: https://lemmy.world/post/19925986

https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e

Qwen 2.5 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B just came out, with some variants in some sizes just for math or coding, and base models too.

All Apache licensed, all 128K context, and the 128K seems legit (unlike Mistral).

And it's pretty sick, with a tokenizer that's more efficient than Mistral's or Cohere's and benchmark scores even better than llama 3.1 or mistral in similar sizes, especially with newer metrics like MMLU-Pro and GPQA.

I am running 34B locally, and it seems super smart!

As long as the benchmarks aren't straight up lies/trained, this is massive, and just made a whole bunch of models obsolete.

Get usable quants here:

GGUF: https://huggingface.co/bartowski?search_models=qwen2.5

EXL2: https://huggingface.co/models?sort=modified&search=exl2+qwen2.5

21
 
 

Does Llama3 use any other model for generating images? Or is it something that llama3 model can do by itself?

Can Llama3 generate images with ollama?

22
 
 

I can run full 131K context with a 3.75bpw quantization, and still a very long one at 4bpw. And it should barely be fine-tunable in unsloth as well.

It's pretty much perfect! Unlike the last iteration, they're using very aggressive GQA, which makes the context small, and it feels really smart at long context stuff like storytelling, RAG, document analysis and things like that (whereas Gemma 27B and Mistral Code 22B are probably better suited to short chats/code).

23
24
 
 

Hi everybody, I find a huge part of my job is talking to colleagues and clients and at the end of those phone calls, I have to write a summary of what happened, plus any key points that I need to focus on followup.

I figured it would be an excellent task for a LLM.

It would need intercept the phone call dialogue, and transcribe the dialogue.

Then afterwards I would want to summarize it.

I'm not talking about teams meetings or anything like that, I'm talking a traditional phone call, via a mobile phone to another phone.

I understand that that could be two different pieces of software, and that would be fine, but I am wondering if there is any such tool out there, or a tool in the making?

If you have any leads, I'd love to hear them.

Thank you so much

25
 
 

This is a pretty great 1 hour introduction to AI from Andrej Karpathy. It includes an interesting idea of considering LLMs as a sort of operating system, and runs through some examples of jailbreaks.

view more: next ›