Abstract: Strong representations of target speakers can help extract important information about speakers and detect corresponding temporal regions in multi-speaker conversations. In this study, we ...
Abstract: In previous works, powerful convolutional neural network (CNN) backbones are typically used for one- or two-stage detectors to facilitate multicategories object classification. Unfortunately ...