Detecting incongruity in the expression of emotions in short videos based on a multimodal approach
Every day people face uncertainty, which is already an integral part of their lives. Uncertainty creates risks for various kinds of companies, in particular, the financial sector may incur losses due to various kinds of human errors. People turn to the opinion of experts who have special knowledge to eliminate this uncertainty. It is established that the expert shows insolvency if he uses incongruent manipulation techniques. In this article we propose a method that allows solving the problem of congruence estimation. The hypothesis that a person with a prepared speech and a person with a spontaneous speech will have a different level of congruence is also put forward and tested in this work. The similarity of emotional states of verbal and nonverbal channels is evaluated in our solution for determining congruence. Convolutional neural networks (CNN) were used to assess a person’s emotional state from video and audio, speech-to-text to extract the text of the speaker’s speech, and a pre-trained BERT model for subsequent analysis of emotional color. Tests have shown that with the help of this development it is possible not only to distinguish the incongruence of a person, but also to point out the unnatural nature of his origin (to distinguish a simply incongruent person from a deepfake).