A Multi-Stream Sequence Learning Framework for Human Interaction Recognition
- Authors
- Haroon, U[Haroon, Umair]; Ullah, A[Ullah, Amin]; Hussain, T[Hussain, Tanveer]; Ullah, W[Ullah, Waseem]; Sajjad, M[Sajjad, Muhammad]; Muhammad, K[Muhammad, Khan]; Lee, MY[Lee, Mi Young]; Baik, SW[Baik, Sung Wook]
- Issue Date
- Jun-2022
- Publisher
- IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
- Keywords
- Feature extraction; Skeleton; Pose estimation; Computer architecture; Computational modeling; Logic gates; Optical imaging; Bidirectional LSTM (BD-LSTM); three-dimensional (3-D) convolutional neural network (CNN); 1-D CNN; human interaction recognition (HIR); human pose estimation; skeleton joint key points; multistream network
- Citation
- IEEE TRANSACTIONS ON HUMAN-MACHINE SYSTEMS, v.52, no.3, pp.435 - 444
- Indexed
- SCIE
SCOPUS
- Journal Title
- IEEE TRANSACTIONS ON HUMAN-MACHINE SYSTEMS
- Volume
- 52
- Number
- 3
- Start Page
- 435
- End Page
- 444
- URI
- https://scholarx.skku.edu/handle/2021.sw.skku/96232
- DOI
- 10.1109/THMS.2021.3138708
- ISSN
- 2168-2291
- Abstract
- Human interaction recognition (HIR) is challenging due to multiple humans' involvement and their mutual interaction in a single frame, generated from their movements. Mainstream literature is based on three-dimensional (3-D) convolutional neural networks (CNNs), processing only visual frames, where human joints data play a vital role in accurate interaction recognition. Therefore, this article proposes a multistream network for HIR that intelligently learns from skeletons' key points and spatiotemporal visual representations. The first stream localises the joints of the human body using a pose estimation model and transmits them to a 1-D CNN and bidirectional long short-term memory to efficiently extract the features of the dynamic movements of each human skeleton. The second stream feeds the series of visual frames to a 3-D convolutional neural network to extract the discriminative spatiotemporal features. Finally, the outputs of both streams are integrated via fully connected layers that precisely classify the ongoing interactions between humans. To validate the performance of the proposed network, we conducted a comprehensive set of experiments on two benchmark datasets, UT-interaction and TV human interaction, and found 1.15% and 10.0% improvement in the accuracy.
- Files in This Item
- There are no files associated with this item.
- Appears in
Collections - Computing and Informatics > Convergence > 1. Journal Articles

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.