

The new ViT methodology, called “Patch-to-Cluster attention” (PaCa), addresses both challenges. Depending on the application, understanding the ViT’s decision-making process, also known as its model interpretability, can be very important. But it’s not entirely clear how the ViT is determining what is a dog and what is not. For example, you might have trained a ViT to identify dogs in an image. Second, it is difficult for users to understand exactly how ViTs make decisions. This is particularly problematic for ViTs, because images contain so much data. Relative to the amount of data being plugged into the AI, transformer models require a significant amount of computational power and use a large amount of memory. For example, ViTs could be used to detect and categorize objects in an image, such as identifying all of the cars or all of the pedestrians in an image.įirst, transformer models are very complex. ViTs are transformer-based AI that are trained using visual inputs. For example, ChatGPT is an AI that uses transformer architecture, but the inputs used to train it are language. Transformers are among the most powerful existing AI models. Researchers have now developed a new methodology that addresses both challenges, while also improving the ViT’s ability to identify, classify and segment objects in images. Matt Shipman transformers (ViTs) are powerful artificial intelligence (AI) technologies that can identify or categorize objects in images – however, there are significant challenges related to both computing power requirements and decision-making transparency.
