• Robin Ssenyonga
  • 06.11.2017
  • 23.05.2018

In recent research on acoustic scene analysis for distant-talking voice communication interfaces, the classification of sounds gained significant importance. In this thesis, a system shall be developed which classifies a set of typical sounds in domestic environments, such as music, speech, vacuum cleaner, keyboard typing, breaking glass, clattering tableware, and estimates the position of the source.
Based on given data sets including Google’s AudioSet, a multiclass-classifier using Mel-filterbank features and a Convolutional Neural Network (CNN) architecture should be designed and evaluated. Using the AudioSet for training of the original classifier, transfer learning algorithms shall be implemented to adapt the classifier to a living-room dataset. Beamforming should then be used to further improve the classification results: For this, a localization algorithm will be implemented which provides an estimate for the source position to the beamformer. Data-independent beamformers will then be directed to the estimated source position to extract an - relative to a single microphone - enhanced source signal.
The thesis requires a thorough literature review in the areas of sound classification and, to a lesser extent, acoustic source localization and data-independent beamforming. Proper handling of the data and thoughtful design of the classifier and the classification experiments is essential for the conclusions to be drawn. Adequate parametrization of localization and beamforming algorithms will determine the expected benefit of multichannel recordings. The documentation of the experiments and the according software is expected to meet the requirements for reproducible research.