Utilizing grouped spatial gating, GSF breaks down the input tensor, and then fuses the decomposed tensors through channel weighting. Spatio-temporal feature extraction from 2D CNNs can be efficiently and effectively achieved by integrating GSF, requiring minimal parameter and computational resources. By employing two popular 2D CNN families, we conduct a detailed examination of GSF, resulting in state-of-the-art or competitive performance on five common action recognition benchmarks.
The integration of embedded machine learning models for edge inference necessitates navigating complex trade-offs between resource metrics, such as energy use and memory footprint, and performance metrics, such as processing time and predictive accuracy. This paper explores Tsetlin Machines (TM) as an alternative to neural networks, an emerging machine-learning algorithm. It utilizes learning automata to build propositional logic rules to facilitate classification. this website Employing algorithm-hardware co-design, we propose a novel methodology for TM training and inference processes. The REDRESS methodology, using independent transition machine training and inference strategies, is designed to decrease the memory footprint of the resultant automata, making them ideal for low-power and ultra-low-power applications. The Tsetlin Automata (TA) array's binary structure holds learned information; 0 signifies excludes, and 1, includes. REDRESS's include-encoding, a lossless TA compression approach, achieves over 99% compression by only storing information regarding inclusion elements. Bio-mathematical models To boost the accuracy and sparsity of TAs, a novel, computationally minimal training process, called Tsetlin Automata Re-profiling, is employed, reducing the number of inclusions and, thus, the memory footprint. Finally, REDRESS's inference algorithm, intrinsically bit-parallel, operates on the optimized TA in its compressed form, ensuring no decompression is needed during runtime, resulting in superior speedups when contrasted with state-of-the-art Binary Neural Network (BNN) models. The REDRESS approach allows the TM model to outperform BNN models across all design metrics when evaluated on five distinct benchmark datasets. Considering the machine learning domain, MNIST, CIFAR2, KWS6, Fashion-MNIST, and Kuzushiji-MNIST datasets are essential tools. The utilization of REDRESS on the STM32F746G-DISCO microcontroller resulted in speed and energy benefits of 5 to 5700 times greater than those achievable with various BNN models.
Fusion methods based on deep learning have demonstrated encouraging results in image fusion tasks. Due to the network architecture's crucial function in the fusion process, this result is observed. Despite this, conceptualizing a robust fusion architecture presents significant obstacles, which contributes to the design of fusion networks remaining an art, not a science. We mathematically approach the fusion task to tackle this issue, showcasing the relationship between its optimum solution and the network architecture that enables its execution. In the paper, a novel method for building a lightweight fusion network is described, based on this approach. Instead of resorting to a time-consuming trial-and-error network design method, it offers an alternative solution. Our approach to fusion integrates a learnable representation, the architecture of the fusion network shaped by the optimization algorithm creating the learnable model. Our learnable model is built upon the fundamental principle of the low-rank representation (LRR) objective. By replacing the iterative optimization process with a specialized feed-forward network, the matrix multiplications, central to the solution, are transformed into convolutional operations. Employing this novel network design, a lightweight, end-to-end fusion network is created, merging infrared and visible light imagery. The detail-to-semantic information loss function, carefully crafted to safeguard image details and amplify the critical characteristics of the source images, is crucial for its successful training. The proposed fusion network, based on our experiments, performs fusion more effectively than existing state-of-the-art fusion methods when tested on public datasets. It is noteworthy that our network necessitates fewer training parameters compared to other existing methodologies.
Deep models for visual recognition face a significant hurdle in learning from long-tailed datasets, requiring the training of robust deep architectures on a large number of images following this distribution. Deep learning, in the past ten years, has established itself as a strong recognition model, fostering the learning of high-quality image representations and driving remarkable progress in general visual identification. Yet, a substantial imbalance in class sizes, a recurring issue in practical visual recognition tasks, frequently limits the effectiveness of deep network-based recognition models in actual applications, as they can exhibit a strong bias towards the dominant classes and struggle with the less prevalent ones. In response to this challenge, a substantial volume of research has been undertaken in recent years, yielding encouraging advancements within the field of deep long-tailed learning. This paper is dedicated to presenting an exhaustive survey of recent advancements in deep long-tailed learning, recognizing the significant strides in this field. More specifically, we have organized existing deep long-tailed learning studies into three broad categories—namely, class re-balancing, information augmentation, and module improvement. We will now methodically review these approaches using this classification. Following the theoretical framework, an empirical investigation of several advanced methodologies is conducted, assessing their handling of class imbalance through a newly proposed metric called relative accuracy. mediation model We summarize the survey by highlighting the practical applications of deep long-tailed learning and proposing exciting future research directions.
Objects contained within a single visual context are interconnected in varying degrees, with only a certain subset of these interconnections being significant. Influenced by the Detection Transformer's proficiency in object detection, we frame scene graph generation as a problem concerning set prediction. We propose Relation Transformer (RelTR), an end-to-end scene graph generation model, built with an encoder-decoder structure within this paper. The visual feature context is considered by the encoder, while the decoder, using different types of attention mechanisms, infers a fixed-size set of subject-predicate-object triplets with coupled subject and object queries. To facilitate end-to-end training, a custom set prediction loss is devised to perform the matching of predicted triplets against ground truth triplets. Unlike the majority of existing scene graph generation approaches, RelTR employs a single-stage architecture, directly forecasting sparse scene graphs based solely on visual cues without integrating entities or annotating every potential predicate. Extensive experiments employing the Visual Genome, Open Images V6, and VRD datasets confirm that our model achieves fast inference with superior performance.
Local features are widely utilized in a variety of visual applications, answering pressing needs in industrial and commercial sectors. Large-scale applications necessitate high standards for the accuracy and speed of local features, demanding these aspects. Local feature learning research, while often focused on individual keypoint descriptions, frequently fails to account for the interconnections between these keypoints within a global spatial framework. This paper introduces AWDesc, incorporating a consistent attention mechanism (CoAM), enabling local descriptors to perceive image-level spatial context during both training and matching. Local features are detected using a combination of local feature detection and a feature pyramid, leading to more accurate and consistent keypoint localization. For the task of local feature representation, we furnish two versions of AWDesc, designed to accommodate a spectrum of accuracy and processing time requirements. By way of Context Augmentation, non-local contextual information is introduced to address the inherent locality problem within convolutional neural networks, allowing local descriptors to encompass a wider scope for improved descriptions. In creating robust local descriptors, we suggest the Adaptive Global Context Augmented Module (AGCA) and the Diverse Surrounding Context Augmented Module (DSCA), which incorporate contextual data from the global to the immediate surrounding areas. On the contrary, a streamlined backbone network is engineered, alongside our unique knowledge distillation approach, to obtain the ideal harmony between speed and precision. Subsequently, we performed detailed experiments on image matching, homography estimation, visual localization, and 3D reconstruction, and the results establish the superiority of our approach over the existing state-of-the-art local descriptors. The AWDesc project's code is hosted on GitHub at this location: https//github.com/vignywang/AWDesc.
Accurate matching of points within point clouds is essential for tasks like 3D registration and recognition. This paper introduces a reciprocal voting approach for ordering 3D correspondences. Achieving reliable scoring for correspondences in a mutual voting system hinges on refining both the voters and the candidates. For the initial correspondence set, a graph is developed according to the pairwise compatibility constraint. In the second step, nodal clustering coefficients are implemented to preemptively eliminate a part of the outliers, thus streamlining the subsequent voting process. To model nodes and edges in the graph, we consider nodes as candidates and edges as voters, respectively, in our third step. The graph undergoes mutual voting to determine the score of correspondences. The correspondences are ordered, at the end, by their vote totals, with those receiving the highest scores identified as inliers.