Skip to content

Commit fdb8c9c

Browse files
Update README.md
1 parent 6d7ef06 commit fdb8c9c

File tree

1 file changed

+11
-2
lines changed

1 file changed

+11
-2
lines changed

README.md

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,8 @@ To understand this article, Now we need to know:
1313
- Remember the length and width of square patches are fixed during training of the model. It doesn't depend upon the size of image.
1414
- Now, VLM identify the color and shape that patches contain.
1515

16+
![Screenshot 2025-05-08 222811](https://github.com/user-attachments/assets/c140f3e6-b898-4637-be4f-72816daeeb1d)
17+
1618
### Is pixels has similar use case as sub tokens has?
1719
- Now if you remember, You need to have one question in your mind i.e. in LLM we learned the concept of subtoken is used when LLM need to understand the meaning of complex or unseen word
1820
- But in VLM, The concept of pixels are not used to understand the strcture or color of the complex image.
@@ -31,7 +33,9 @@ To understand this article, Now we need to know:
3133
- The location of object is determined by using mathematical concept like sin and cosine.
3234
- Every position and color of object in a image is different so, the vector number are always unique.
3335

34-
36+
![Linear Projection Layer_ - visual selection](https://github.com/user-attachments/assets/de77e3f8-78db-48c4-8899-1d3e6cc7ef3b)
37+
38+
3539
### CLS Token:
3640
In this step, When user upload his/ her image in VLM then CLS token is created where all the information of the image or patches has been stored in vector form like which patch contain
3741
which shapes or color etc.
@@ -42,18 +46,23 @@ which shapes or color etc.
4246
- Self Attention Layer helps VLM to compare the patches and identify which patch is the another part of patch.
4347
- Feed- Forward Layer helps self attention layer to have deep research according to the information stored in CLS token.
4448
- The patches which has less difference in vector number are considered as highly related to each other like this transformer can known the relation between patches.
45-
49+
![Transformer (Neural Network)_ - visual selection](https://github.com/user-attachments/assets/001e3381-744a-4734-9275-d65392c44e02)
4650

4751
## 4. Positional Encoding:
4852
- Images are 2D model so position of each patch matter.
4953
- If you are thinking this information is also stored in CLS token then you are thinking wrong.
5054
- Positional Encoding will automatically identify the position of patch as it use mathmatical concept like sin and cosin with which it will automatically detect the position of patch.
5155
- Remember, Vector number is generated according to the position of the patch so by the help of this vector number Positional Encoding automatically detect the position of patch.
5256

57+
![Positional Encoding_ - visual selection](https://github.com/user-attachments/assets/101c549a-9eec-4c27-9e1a-0fbdedb4edc9)
58+
59+
5360
## 5. Binary Conversion:
5461
- Now, The vector numbers are converted into binary number where numbers are represented in 0 and 1 form.
5562
- Each o and 1 represent 1 bit.
5663

64+
![Binary Conversion_ - visual selection](https://github.com/user-attachments/assets/27b0a730-b912-4bd3-ac6c-d14a89981bcf)
65+
5766
### Do you know why we need more memory to train this type of AI model?
5867
- Now your answer will be probabily: Because we train AI model on various data set so it take much memory.
5968
- Ok this answer is very common have you tried to explore more deep?

0 commit comments

Comments
 (0)