How does ‘self-attention’ work in transformer models?

161 views

I’m currently diving into the world of machine learning and transformers, and I’m trying to wrap my head around the concept of “attention” in transformer models. I’ve been reading papers and documentation, but I’m still struggling to fully grasp it.

**My Struggle:**

I get that attention involves multiplying “query” and “key” vectors to determine the importance of different words in a sequence, but I don’t quite understand why this multiplication gives us a meaningful metric for importance.

**What I’m looking for:**

I’m comfortable with moderate level technicalities but require a deeper insight into the inner workings and rationale behind these mechanisms. Please share any insights, analogies, or technical details that can shed light on this concept.

Thanks a bunch!

In: 0

Anonymous 0 Comments

You may have more luck on an ML specific subreddit. This is an extremely technical question that requires specialist knowledge even to understand what you’re asking.