this post was submitted on 02 Aug 2023
361 points (94.1% liked)

Technology

59656 readers
3043 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


founded 1 year ago
MODERATORS
 

Tech experts are starting to doubt that ChatGPT and A.I. ‘hallucinations’ will ever go away: ‘This isn’t fixable’::Experts are starting to doubt it, and even OpenAI CEO Sam Altman is a bit stumped.

you are viewing a single comment's thread
view the rest of the comments
[–] Womble 1 points 1 year ago (1 children)

The key vector for John might effectively say “I am: a noun describing a male person.” The network would detect that these two vectors match and move information about the vector for John into the vector for his.

This is the bit you are missing, the attention network actively changes the token vectors depending on context, this is transferring new information into the meanings of that word.

[–] Zeth0s 1 points 1 year ago* (last edited 1 year ago) (1 children)

The network doesn't detect matches, but the model definitely works on similarities. Words are mapped in a hyperspace, with the idea that that space can mathematically retain conceptual similarity as spatial representation.

Words are transformed in a mathematical representation that is able (or at least tries) to retain semantic information of words.

But different meanings of the different words belongs to the words themselves and are defined by the language, model cannot modify them.

Anyway we are talking about details here. We could kill the audience of boredom

Edit. I asked gpt-4 to summarize the concepts. I believe it did a decent job. I hope it helps:

  1. Embedding Space:

    • Initially, every token is mapped to a point (or vector) in a high-dimensional space via embeddings. This space is typically called the "embedding space."
    • The dimensionality of this space is determined by the size of the embeddings. For many Transformer models, this is often several hundred dimensions, e.g., 768 for some versions of GPT and BERT.
  2. Positional Encodings:

    • These are vectors added to the embeddings to provide positional context. They share the same dimensionality as the embedding vectors, so they exist within the same high-dimensional space.
  3. Transformations Through Layers:

    • As tokens' representations (vectors) pass through Transformer layers, they undergo a series of linear and non-linear transformations. These include matrix multiplications, additions, and the application of functions like softmax.
    • At each layer, the vectors are "moved" within this high-dimensional space. When we say "moved," we mean they are transformed, resulting in a change in their coordinates in the vector space.
    • The self-attention mechanism allows a token's representation to be influenced by other tokens' representations, effectively "pulling" or "pushing" it in various directions in the space based on the context.
  4. Nature of the Vector Space:

    • This space is abstract and high-dimensional, making it hard to visualize directly. However, in this space, the "distance" and "direction" between vectors can have semantic meaning. Vectors close to each other can be seen as semantically similar or related.
    • The exact nature and structure of this space are learned during training. The model adjusts the parameters (like weights in the attention mechanisms and feed-forward networks) to ensure that semantically or syntactically related concepts are positioned appropriately relative to each other in this space.
  5. Output Space:

    • The final layer of the model transforms the token representations into an output space corresponding to the vocabulary size. This is a probability distribution over all possible tokens for the next word prediction.

In essence, the entire process of token representation within the Transformer model can be seen as continuous transformations within a vector space. The space itself can be considered a learned representation where relative positions and directions hold semantic and syntactic significance. The model's training process essentially shapes this space in a way that facilitates accurate and coherent language understanding and generation.

[–] Womble 1 points 1 year ago (1 children)

Yes of course it works on similarities, I havent disputed that. My point was that the transformations of the token vectors are a transfer of information, and that this transfer of information is not lost as things move out of the context length. That information may slowly decohere over time if it is not reinforced, but the model does not immediately forget things as they move out of context as you originally claimed.

[–] Zeth0s 1 points 1 year ago* (last edited 1 year ago) (1 children)

It does, as model only works with a well defined chunk of tokens of a given length. Everything before is lost. Clearly part of the information of previous context is in that chunk.

But let's say that I am talking about wine, at some point I talk about chianti. I and the chatbot go on discussing for over 4k words (I am using chatgpt as an example) without mentioning chianti. After that the chatbot will know we are discussing about wine, but it won't know we covered the topic of chianti.

This is what I meant.

[–] Womble 1 points 1 year ago (1 children)

I'm only going to reply this time then I'm done here as we are going round in circles. I'm saying that is not what happens as the attention network would link Chianti and wine together in that case and move information between them. So even after Chianti has gone out of the context window it is more likely to pick Chianti than Merlot when it requires a type of wine.

[–] Zeth0s 1 points 1 year ago

Good call, it doesn't look like wr are convincing each other ;)