Detailed Information

Cited 16 time in webofscience Cited 15 time in scopus
Metadata Downloads

A Multi-Stream Sequence Learning Framework for Human Interaction Recognition

Authors
Haroon, U[Haroon, Umair]Ullah, A[Ullah, Amin]Hussain, T[Hussain, Tanveer]Ullah, W[Ullah, Waseem]Sajjad, M[Sajjad, Muhammad]Muhammad, K[Muhammad, Khan]Lee, MY[Lee, Mi Young]Baik, SW[Baik, Sung Wook]
Issue Date
Jun-2022
Publisher
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Keywords
Feature extraction; Skeleton; Pose estimation; Computer architecture; Computational modeling; Logic gates; Optical imaging; Bidirectional LSTM (BD-LSTM); three-dimensional (3-D) convolutional neural network (CNN); 1-D CNN; human interaction recognition (HIR); human pose estimation; skeleton joint key points; multistream network
Citation
IEEE TRANSACTIONS ON HUMAN-MACHINE SYSTEMS, v.52, no.3, pp.435 - 444
Indexed
SCIE
SCOPUS
Journal Title
IEEE TRANSACTIONS ON HUMAN-MACHINE SYSTEMS
Volume
52
Number
3
Start Page
435
End Page
444
URI
https://scholarx.skku.edu/handle/2021.sw.skku/96232
DOI
10.1109/THMS.2021.3138708
ISSN
2168-2291
Abstract
Human interaction recognition (HIR) is challenging due to multiple humans' involvement and their mutual interaction in a single frame, generated from their movements. Mainstream literature is based on three-dimensional (3-D) convolutional neural networks (CNNs), processing only visual frames, where human joints data play a vital role in accurate interaction recognition. Therefore, this article proposes a multistream network for HIR that intelligently learns from skeletons' key points and spatiotemporal visual representations. The first stream localises the joints of the human body using a pose estimation model and transmits them to a 1-D CNN and bidirectional long short-term memory to efficiently extract the features of the dynamic movements of each human skeleton. The second stream feeds the series of visual frames to a 3-D convolutional neural network to extract the discriminative spatiotemporal features. Finally, the outputs of both streams are integrated via fully connected layers that precisely classify the ongoing interactions between humans. To validate the performance of the proposed network, we conducted a comprehensive set of experiments on two benchmark datasets, UT-interaction and TV human interaction, and found 1.15% and 10.0% improvement in the accuracy.
Files in This Item
There are no files associated with this item.
Appears in
Collections
Computing and Informatics > Convergence > 1. Journal Articles

qrcode

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.

Related Researcher

Researcher MUHAMMAD, KHAN photo

MUHAMMAD, KHAN
Computing and Informatics (Convergence)
Read more

Altmetrics

Total Views & Downloads

BROWSE