AI / Machine Learning

156 readers
1 users here now

founded 1 year ago
MODERATORS
1
2
3
4
1
submitted 1 year ago* (last edited 1 year ago) by [email protected] to c/[email protected]
 
 

https://duckai.org/

cross-posted from: https://lemmy.intai.tech/post/134262

DuckAI is an open and scalable academic lab and open-source community working on various Machine Learning projects. Our team consists of researchers from the Georgia Institute of Technology and beyond, driven by our passion for investigating large language models and multimodal systems.

Our present endeavors concentrate on the development and analysis of a variety of dataset projects, with the aim of comprehending the depth and performance of these models across diverse domains.

Our objective is to welcome people with a variety of backgrounds to cutting-edge ML projects and rapidly scale up our community to make an impact on the ML landscape.

We are particularly devoted to open-sourcing datasets that can turn into an important infrastructure for the community and exploring various ways to improve the design of foundation models.

5
 
 

cross-posted from: https://lemmy.intai.tech/post/133548

https://arxiv.org/pdf/1706.03762.pdf

Attention Is All You Need

By Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin

Word count: 4221

Estimated read time: 17 minutes

Links:

Summary: This paper proposes a new neural network architecture called the Transformer that is based solely on attention mechanisms, without using sequence aligned RNNs or convolutions. The Transformer achieves state-of-the-art results in machine translation while being more parallelizable and requiring significantly less time to train. Key contributions:

Proposes multi-head self-attention as a replacement for recurrence and convolutions in encoder-decoder architectures. Self-attention connects all positions with a constant number of sequentially executed operations, whereas recurrent layers require O(n) sequential operations.

Introduces scaled dot-product attention, which performs better than additive attention for large values of attention dimension. Applies attention scaling to improve training.

Employs positional encodings instead of recurrence to enable the model to make use of sequence order. Shows that learned positional embeddings can replace sinusoids with negligible loss in quality.

Achieves state-of-the-art BLEU scores on WMT 2014 English-to-German and English-to-French translation at a fraction of the training cost of previous models. Outperforms all previously published models on English constituency parsing with limited training data.

The Transformer's reliance on attention and positional encodings rather than recurrence make it very promising for parallelization and scaling to longer sequences. The results demonstrate the potential of attention-based models to supplant RNNs and CNNs in sequence transduction tasks.

Evaluation: The Transformer architecture presents several advantages for using large language models and generative adversarial networks:

The Transformer is highly parallelizable since it does away with sequence-aligned RNNs. This makes it very suitable for scaling up with more parameters and data.

The multi-head self-attention provides a way to jointly attend to information from different representation subspaces at different positions, allowing modeling of dependencies regardless of distance. This is useful for long-range dependencies in large contexts.

Positional encodings allow the model to make use of sequence order without recurrence. This can enable generating coherent, ordered outputs in GANs and large LMs.

The Transformer achieves excellent results with limited training data, suggesting its representations transfer well. This is promising for few-shot learning and fine-tuning large LMs.

The paper provides useful analysis into the roles different attention heads learn, which can inform work on interpretable attention-based representations.

Overall, the Transformer architecture seems very promising as a foundation for large scale language modeling and GAN training. The representations it learns appear powerful yet transparent. The results on parsing suggest it can capture linguistic phenomena well. The parallelizability enables scaling. Much follow-on work has already adapted and refined the Transformer, making it very relevant today.

6
 
 

cross-posted from: https://lemmy.intai.tech/post/124795

Large Language Models as Tool Makers Authors: Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, Denny Zhou

Word count: 4579 words

Estimated read time: 12 minutes

Source code: https://github.com/ctlllll/LLM-ToolMaker

Summary:

This paper proposes a framework called LLMs As Tool Makers (LATM) that enables large language models (LLMs) to create and utilize their own tools for solving complex reasoning tasks. The key idea is to separate the process into two stages - tool making and tool using. In the tool making stage, a powerful yet expensive LLM acts as the "tool maker" to generate reusable Python functions for solving demonstrations of a task. In the tool using stage, a lightweight and cost-effective LLM acts as the "tool user" to call these tools to solve new instances of the task.

Experiments on tasks like logical deduction, tracking shuffled objects, Dyck language parsing, etc show that with tools made by GPT-4, GPT-3.5 Turbo as the tool user can match or exceed the performance of GPT-4 at lower cost. The authors also introduce a "dispatcher" LLM to handle streaming tasks by identifying when to reuse existing tools or request new ones.

Overall, this work demonstrates a promising approach to enabling LLMs to create their own tools, reducing reliance on human-crafted tools. The division of labor also allows using smaller models for most of the inferences, improving cost-efficiency. This technique could significantly expand the capabilities of LLMs in a scalable manner.

The proposed LATM framework demonstrates an interesting and promising approach to improving the reasoning and problem-solving capabilities of large language models in a cost-effective manner. Here are some thoughts on its applicability:

The ability for LLMs to create their own tools could be very useful for building practical applications. For any recurring task, the model could generate a reusable tool instead of solving from scratch each time. This could make applications more efficient and scalable.

The staged approach allows combining different sized models optimally - a powerful model makes tools, while lightweight models use the tools. This cost-effectiveness is attractive for real-world applications with budget constraints.

The tools being in Python allows them to integrate into application codebases easily. The dispatcher model also provides flexibility to handle new tasks.

The method's applicability does seem more geared towards logical reasoning, procedural and algorithmic tasks right now. Further research may be needed to extend it to other domains.

There are still open challenges around rigorously testing and validating the quality and safety of automatically generated tools. Methods to provide human oversight would be important.

Overall, the LATM paradigm does appear promising for augmenting LLMs and enabling them to participate more actively in their own learning and tooling. With further research to broaden its scope, it could become a general framework for efficiently enhancing LLM capabilities.

So in summary, LATM seems quite promising as a technique for unlocking more of the potential of LLMs for practical applications requiring complex reasoning in a scalable and cost-efficient manner. More research is still needed, but the principles demonstrated align well with enabling wider usage of LLMs and GANs in applications.

7
8
9
10
11
 
 

An intriguing video discussing Falcon 40B, another LLM that seems to perform really quite well, especially given its much smaller size than models like GPT 4.

12
 
 

cross-posted from: https://lemmy.intai.tech/post/41936

  • repo
  • [tweet[(https://twitter.com/saten_work/status/1674856415977181184)

Reviews:

BabyCommandAGI, which is based on @yoheinakajima 's BabyAGI, can now automatically create apps just by providing feedback.

The following example is for creating a Reversi game Flutter app.

Set the following OBJECTIVE and INITIAL_TASK, then wait for about 30 minutes.

OBJECTIVE: "Please install the Flutter environment via git, implement a Flutter app to play Reversi with black and white stones, and make the Flutter app you created accessible from outside the container by running 'flutter run -d web-server --web-port 8080 --web-hostname 0.0.0.0'."

INITIAL_TASK: "Develop a task list"

The AI will install the Flutter environment on the Linux container and create a Flutter app.

Once 'flutter run -d web-server --web-port 8080 --web-hostname 0.0.0.0' is executed, access

'http://localhost:8080/'

from your browser. The first Flutter app I created was an empty app named Reversi.

There are still many problems, but BabyCommandAGI has successfully installed Flutter on Linux in Docker and has been successful in creating a Flutter project. (However, I have not yet succeeded in implementing the Othello app in Flutter)

Here are the main current steps of BabyCommandAGI.

  1. Try to download Flutter with curl -> There's no curl command
  2. Install the curl command with apt-get
  3. Download Flutter with the curl command
  4. Set the environment path and run flutter doctor -> Error: Unable to find git in your PATH.
  5. Install the git command with apt-get
  6. Delete the original flutter folder
  7. Clone flutter with git
  8. Run flutter doctor -> There's no unzip tool
  9. Install unzip with apt-get
  10. Successfully create a project with flutter create This is where I am now.

To give feedback to the AI, please enter "f". After a short wait, it will be in a state of waiting for feedback.

"The Reversi board with black and white stones is not displayed and I can't play. Please make it possible to play Reversi."

I gave this feedback. Then wait again until

'flutter run -d web-server --web-port 8080 --web-hostname 0.0.0.0'

is executed. Once it is executed, access 'http://localhost:8080/' again.

This time the board was displayed, but it seems that the stones cannot be placed. (If nothing changes, please give feedback again)

"I can't place stones. Reversi is a game where you place stones and the stones in between become the same color when sandwiched by the same color. Please make it possible to play Reversi."

I gave this feedback.

The stones were displayed, but this time the stones in between did not flip even when sandwiched by stones. (Although the white stones are hard to see, they are there.)

"The sandwiched stones do not become the same color as the sandwiching stones. Reversi is a game where the sandwiched stones become the same color as the sandwiching stones. Please make it possible to play Reversi."

I gave this feedback.

The sandwiched stones started to flip. However, the white stones are hard to see and the stones are small, so

"The white stones are hard to see, so please add a black border to make them easier to see. Also, the stones are small, so please make them a little bigger."

I gave this feedback.

The white stones now have a border and are easier to see, and the stones are a little bigger. However, the black stones have become a strange display.

"The black stones have become a strange square display. Please make the black stones display as black circles."

I gave this feedback.

The strange display of the black stones has also changed to black round stones, and Reversi is now playable.

With BabyCommandAGI, you don't have to program, just give it feedback and it will automatically program a simple app for you.

Also, while BabyCommandAGI is good at building and engineering environments, it is not designed specifically for them, so there may be other use cases.

13
 
 

LMYield enables you to guide OpenAI's Chat API generations into arbitrary output patterns, and is specifically designed to enhance chain of thought prompting for agents.

The motivating concept behind LMYield is that for a given context, an agnetic entity will spawn some number of ordered, related chain of thoughts, and they should be yielded as a subscribable stream.

Features:

Simple, intuitive syntax, based on Handlebars templating. Rich output structure with speculative caching and multiple generations to ensure desired output structure. Designed specifically for agentic chain of thought. Typescript not python

14
5
One-2-3-45 (lemmy.intai.tech)
submitted 1 year ago by [email protected] to c/[email protected]
 
 

cross-posted from: https://lemmy.intai.tech/post/41706

One-2-3-45: Any Single Image to 3D Mesh in 45 Seconds without Per-Shape Optimization

Minghua Liu1∗ Chao Xu2∗ Haian Jin3,4∗ Linghao Chen1,4∗ Mukund Varma T5 Zexiang Xu6Hao Su1

Summary: This paper presents a method to reconstruct 3D shapes from a single image in an end-to-end manner without time-consuming optimization. Their approach consists of three main parts:

Multi-view synthesis: They leverage a view-conditioned 2D diffusion model, Zero123, to generate multi-view images of the input object.

Pose estimation: They estimate the elevation angle of the input image to determine the camera poses of the multi-view images.

3D reconstruction: They employ a neural surface reconstruction method based on signed distance fields to reconstruct a 3D textured mesh from the multi-view images in a single feed-forward pass.

Their key contributions are:

Reconstruction in just 45 seconds without per-shape optimization Producing higher quality geometry due to the use of SDF representation Generating more 3D consistent results thanks to the multi-view synthesis module Achieving better adherence to the input image compared to existing methods They evaluate their approach on synthetic data and real images, demonstrating superior performance in terms of both mesh quality and runtime compared to existing zero-shot single image 3D reconstruction approaches.

Evaluation: This approach has strong potential for applications in 3D content creation and augmented/virtual reality. The key benefits are:

Fast inference time of 45 seconds, which is orders of magnitude faster than optimization-based approaches. This makes it suitable for production environments with low latency requirements.

Ability to reconstruct 3D shapes from a single image of any object, not restricted to specific object categories. This enables a wide range of applications.

Good adherence to the input image, producing realistic 3D shapes that match the given input. This is important for applications where fidelity to the input is critical.

The ability to extend to text-to-3D tasks by integrating with text-to-image diffusion models, providing an unrestricted input domain.

The main limitation is the dependence on the Zero123 diffusion model for multi-view synthesis, which occasionally produces inconsistent predictions that can impact reconstruction quality. However, the overall results demonstrate strong potential for real-world applications. With further improvements to the multi-view synthesis module and additional regularizations, this approach could enable a wide range of novel applications that require reconstructing realistic 3D shapes from a single image in near real-time.

15
 
 

cross-posted from: https://lemmy.intai.tech/post/40699

Models

Datasets

Repos

Related Papers

Credit:

Tweet

Archive:

@Yampeleg The first model to beat 100% of ChatGPT-3.5 Available on Huggingface

🔥 OpenChat_8192

🔥 105.7% of ChatGPT (Vicuna GPT-4 Benchmark)

Less than a month ago the world witnessed as ORCA [1] became the first model to ever outpace ChatGPT on Vicuna's benchmark.

Today, the race to replicate these results open-source comes to an end.

Minutes ago OpenChat scored 105.7% of ChatGPT.

But wait! There is more!

Not only OpenChat beated Vicuna's benchmark, it did so pulling off a LIMA [2] move!

Training was done using 6K GPT-4 conversations out of the ~90K ShareGPT conversations.

The model comes in three versions: the basic OpenChat model, OpenChat-8192 and OpenCoderPlus (Code generation: 102.5% ChatGPT)

This is a significant achievement considering that it's the first (released) open-source model to surpass the Vicuna benchmark. 🎉🎉

Congratulations to the authors!!


[1] - Orca: The first model to cross 100% of ChatGPT: https://arxiv.org/pdf/2306.02707.pdf [2] - LIMA: Less Is More for Alignment - TL;DR: Using small number of VERY high quality samples (1000 in the paper) can be as powerful as much larger datasets: https://arxiv.org/pdf/2305.11206

16
 
 

cross-posted from: https://lemmy.intai.tech/post/40583

NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation.

Twitter Reddit

News I've seen the posts about SuperHOT and just recently, the paper from Meta which uses RoPE interpolation, and I've noticed an immediate improvement that can be brought to this method. Basically if you apply Neural Tangent Kernel (NTK) theory to this problem, it becomes clear that simply interpolating the RoPE's fourier space "linearly" is very sub-optimal, as it prevents the network to distinguish the order and positions of tokens that are very close by. Borrowing from NTK literature, scaling down the fourier features too much will eventually even prevent succesful finetunes (this is corroborated by the recent paper by Meta that suggests an upper bound of ~600x)

Instead of the simple linear interpolation scheme, I've tried to design a nonlinear interpolation scheme using tools from NTK literature. Basically this interpolation scheme changes the base of the RoPE instead of the scale, which intuitively changes the "spinning" speed which each of the RoPE's dimension vectors compared to the next. Because it does not scale the fourier features directly, all the positions are perfectly distinguishable from eachother, even when taken to the extreme (eg. streched 1million times, which is effectively a context size of 2 Billion)

To my surprise, this method works extremely well, so much so that you don't even need to fine tune the LLaMA 7B model for 4096 context size! The perplexity degradation is minimal. I'm sure with fine tuning this would become even better.

Enough explanations, here's some empirical results. All the perplexity measurements are done on LLaMA 7b with the tau/scrolls · Datasets at Hugging Face dataset (I only used a subset of gov_report).

Here's a graph showing the average perplexity of LLaMA 7b on a set of 40 very long prompts (12k+ context size). Compared to changing the scale (from SuperHOT, which was set to 4), we change a factor alpha, which when equal to 8 provides the same context size increase but with much less perplexity degradation. All without any finetuning!

Code can be found on a colab notebook with a test example. NTKAwareScaledRotaryEmbedding.ipynb - Colaboratory (google.com)

Again, the changes to the RoPE code is only 3 lines.

Also just for fun, I've pasted the entirety of the Meta paper [2306.15595] Extending Context Window of Large Language Models via Positional Interpolation (arxiv.org), with the abstract and conclusions removed, and asked OASST-sft-6-llama-30b some questions. I think it fared pretty well given it hasn't been fine tuned at all for 6000+ context size. I used an alpha of 8 for this prompt.

The entire paper is about 6200 tokens. These prompts can be found in the colab notebook linked above. The following questions are simply appended at the end of the prompt which contains the entire research paper.

QUESTION

Please give me a brief summary of this research paper in a few bullet points.

ANSWER

  • Large Language models (LLMs) often have a pre-defined context window size, but sometimes they need to operate on longer contexts, e.g., in conversation systems or summarization tasks.

  • Extending the context window of an existing pre-trained LLM is challenging because direct extrapolation of positional encodings can lead to poor performance.

  • Researchers propose "Position Interpolation" to extend the context window of LLMs by downscaling the positional encodings.

  • They prove that interpolated positional encodings have a much smaller upper bound than extrapolated ones, allowing the model to adapt to longer contexts more efficiently.

  • They experimentally validate the effectiveness of Position Interpolation on LLaMA models, reducing the perplexity gap between the original and extended contexts.

  • They compare Position Interpolation with direct extrapolation and find that Position Interpolation works better.

  • They apply Position Interpolation to LLaMa models and fine-tune them on the Pile dataset, finding that the models can effectively utilize longer contexts.

  • They also evaluate the models' abilities to attend to longer contexts and perform well on benchmarks within the original context size.

  • They apply Position Interpolation to long document summarization and obtain competitive results.

  • They provide insights into the interpolation/extrapolation tradeoff and discuss potential improvements.

  • They plan to explore more advanced methods like data augmentation and multi-task learning to improve the interpolation.

QUESTION

Please write me the abstract for this paper.

ANSWER

This paper presents a technique called "Position Interpolation" for extending the context window size of pre-trained Large Language Models (LLMs) without modifying their architecture. Previous methods like ALiBi and LeX were unable to efficiently extend the context window size of LLMs, leading to poor performance. The authors propose Position Interpolation, which involves downscaling the positional encoding of the LLM's attention mechanism to match the desired longer context window size. They prove mathematically that interpolated attention scores are more stable than extrapolated ones, allowing the model to adapt to longer contexts with just a few hundred fine-tuning steps. Experiments showed that Position Interpolation enables LLMs to perform well on longer contexts, reducing perplexity by up to 0.8 points on average, and generating strong summarizations. Extending the context window size to 32768 led to slight degradation on the original benchmarks, but the models adapted to longer texts. The authors conclude that interpolating positional encodings allows the model to attend to relevant evidence in the longer contexts, and the stability of the interpolated attention score helps the model avoid catastrophic behaviors.

QUESTION

How many steps was the model fine tuned for the final results? Give a short answer.

ANSWER

The authors of the paper "Extending Context Windows of Large Language Models via Position Interpolation" fine-tuned the model for 1000 batches for the experiments presented in the paper.

QUESTION

How big is the interpolation bound compared to the extrapolation bound? Give a short answer.

ANSWER

The interpolation bound is at least 2 - 29.7 * 10^(-6) times smaller than the extrapolation bound.

Sorry for the large dump of information on reddit, I'm too lazy to write a blog post for this. I might give a more detailed explanation on how I derived the formula used to calculate the base if enough people are interested.

I did not test fine-tuning performance as I do not have the resources or the time to fine tune an LLM, I just derived this formula during lunch and experimented with it. However, I think that this method will do even better with fine tuning. Also thanks to the people behind the SuperHOT blogpost, it was their hard work that inspired me and allowed me to make this contribution for everyone!

Finally, I really hope this post will inspire others to start experimenting on ways to improve LLMs. There's so much to learn and so much left to discover! What a time to be alive!

17
 
 

cross-posted from: https://lemmy.intai.tech/post/33969

This is site is made by the same team that just released the orca dataset.

What the heck is this?

Discord: https://discord.gg/ad27GQgc7K

18
19
20
21
22
23
 
 
24
25
view more: next ›