RUS  ENG
Полная версия
ЖУРНАЛЫ // Компьютерная оптика

Компьютерная оптика, 2023, том 47, выпуск 4, страницы 637–649 (Mi co1165)

Mutual modality learning for video action classification
S. A. Komkov, M. D. Dzabraev, A. A. Petiushko

Список литературы

1. Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T, “HMDB: a large video database for human motion recognition”, 2011 Int Conf on Computer Vision, 2011, 2556–2563  crossref
2. UCF101 – Action recognition data set, 2021 https://www.crcv.ucf.edu/research/data-sets/ucf101/
3. Kinetics, 2021 https://www.deepmind.com/open-source/kinetics
4. Goyal R, Kahou SE, Michalski V, et al., “The “something something” video database for learning and evaluating visual common sense”, 2017 IEEE Int Conf on Computer Vision (ICCV), 2017, 5842–5850  crossref
5. Miech A, Zhukov D, Alayrac J-B, Tapaswi M, Laptev I, Sivic J, “Howto100m: Learning a text-video embedding by watching hundred million narrated video clips”, 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, 2630–2640  crossref
6. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M, “Learning spatiotemporal features with 3D convolutional networks”, 2015 IEEE Int Conf on Computer Vision (ICCV), 2015, 4489–4497  crossref
7. Feichtenhofer C, Fan H, Malik J, He K, “Slowfast networks for video recognition”, 2019 IEEE/CVF Int Conf on Computer Vision (ICCV), 2019, 6202–6211  crossref
8. Carreira J, Zisserman A, “Quo vadis, action recognition? A new model and the kinetics dataset”, 2017 IEEE Conf on Computer Vision and Pattern Recognition (CVPR), 2017, 6299–6308  crossref
9. Lin J, Gan C, Han S, “TSM: Temporal shift module for efficient video understanding”, 2019 IEEE/CVF Int Conf on Computer Vision (ICCV), 2019, 7083–7093  crossref
10. Simonyan K, Zisserman A, “Two-stream convolutional networks for action recognition in videos”, NIPS'14: Proc 27th Int Conf on Neural Information Processing Systems, 1 (2014), 568–576
11. Fan L, Huang W, Gan C, Ermon S, Gong B, Huang J, “End-to-end learning of motion representation for video understanding”, 2018 IEEE/CVF Conf on Computer Vision and Pattern Recognition, 2018, 6016–6025  crossref
12. Crasto N, Weinzaepfel P, Alahari K, Schmid C, “MARS: Motion-augmented RGB stream for action recognition”, 2019 IEEE/CVF Conf on Computer Vision and Pattern Recognition (CVPR), 2019, 7882–7891  crossref
13. Piergiovanni AJ, Ryoo MS, “Representation flow for action recognition”, 2019 IEEE/CVF Conf on Computer Vision and Pattern Recognition (CVPR), 2019, 9945–9953  crossref
14. Stroud JC, Ross DA, Sun C, Deng J, Sukthankar R, “D3D: Distilled 3D networks for video action recognition”, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), 2020, 625–634  crossref
15. Zhang Y, Xiang T, Hospedales TM, Lu H, “Deep mutual learning”, 2018 IEEE/CVF Conf on Computer Vision and Pattern Recognition, 2018, 4320–4328  crossref
16. Wang X, Girshick R, Gupta A, He K, “Non-local neural networks”, 2018 IEEE/CVF Conf on Computer Vision and Pattern Recognition, 2018, 7794–7803  crossref
17. Xie S, Sun C, Huang J, Tu Z, Murphy K, “Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification”, Computer Vision – ECCV 2018 (15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XV), eds. Ferrari V, Hebert M, Sminchisescu C, Weiss Y, Springer Nature Switzerland AG, Cham, Switzerland, 2018, 305–321  crossref
18. Zolfaghari M, Singh K, Brox T, “Eco: Efficient convolutional network for online video understanding”, Computer Vision – ECCV 2018 (15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part II), eds. Ferrari V, Hebert M, Sminchisescu C, Weiss Y, Springer Nature Switzerland AG, Cham, Switzerland, 2018, 695–712  crossref
19. Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M, “A closer look at spatiotemporal convolutions for action recognition”, 2018 IEEE/CVF Conf on Computer Vision and Pattern Recognition, 2018, 6450–6459  crossref
20. Yang C, Xu Y, Shi J, Dai B, Zhou B, “Temporal pyramid network for action recognition”, 2020 IEEE/CVF Conf on Computer Vision and Pattern Recognition (CVPR), 2020, 591–600  crossref
21. Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L, “Temporal segment networks: Towards good practices for deep action recognition”, Computer vision – ECCV 2016 (14th European conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII), eds. Leibe B, Matas J, Sebe N, Welling M, Springer Nature Switzerland AG, Cham, Switzerland, 2016, 20–36  crossref
22. He K, Zhang X, Ren S, Sun J, “Deep residual learning for image recognition”, 2016 IEEE Conf on Computer Vision and Pattern Recognition (CVPR), 2016, 770–778  crossref
23. Shao H, Qian S, Liu Y, “Temporal interlacing network”, Proc AAAI Conf on Artificial Intelligence, 34:7 (2020), 11966–11973  crossref
24. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I, “Attention is all you need”, 31st Conference on Neural Information Processing Systems (NIPS 2017), 2017, 1–11
25. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N, “An image is worth 16x16 words: Transformers for image recognition at scale”, International Conference on Learning Representations (ICLR 2021), 2021, 1–21
26. Fan H, Xiong B, Mangalam K, Li Y, Yan Z, Malik J, Feichtenhofer C, “Multiscale vision transformers”, 2021 IEEE/CVF Int Conf on Computer Vision (ICCV), 2021, 6804–6815  crossref
27. Liu Z, Ning J, Cao Y, Wei Y, Zhang Z, Lin S, Hu H, “Video swin transformer”, 2016 IEEE Conf on Computer Vision and Pattern Recognition (CVPR), 2022, 3202–3211  crossref
28. Jiang B, Wang M, Gan W, Wu W, Yan J, “Stm: Spatiotemporal and motion encoding for action recognition”, 2019 IEEE/CVF Int Conf on Computer Vision (ICCV), 2019, 2000–2009  crossref
29. Hinton G, Vinyals O, Dean J, Distilling the knowledge in a neural network, 2015, arXiv: 1503.02531  crossref
30. Furlanello T, Lipton Z, Tschannen M, Itti L, Anandkumar A, “Born-again neural networks”, Proc 35th Int Conf on Machine Learning, 2018, 1607–1616
31. Zhang B, Wang L, Wang Z, Qiao Y, Wang H, “Real-time action recognition with enhanced motion vector CNNs”, 2016 IEEE Conf on Computer Vision and Pattern Recognition (CVPR), 2016, 2718–2726  crossref  mathscinet
32. Wang W, Tran D, Feiszli M, “What makes training multi-modal classification networks hard?”, 2020 IEEE/CVF Conf on Computer Vision and Pattern Recognition (CVPR), 2020, 12695–12705  crossref
33. Deng J, Dong W, Socher R, Li L-J, Li K, Li F-F, “ImageNet: A large-scale hierarchical image database”, 2009 IEEE Conf on Computer Vision and Pattern Recognition, 2009, 248–255  crossref
34. Sigurdsson GA, Varol G, Wang X, Farhadi A, Laptev I, Gupta A, “Hollywood in homes: Crowdsourcing data collection for activity understanding”, Computer Vision – ECCV 2016 (14th European conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I), eds. Leibe B, Matas J, Sebe N, Welling M, Springer Nature Switzerland AG, Cham, Switzerland, 2016, 510–526  crossref
35. Zach C, Pock T, Bischof H, “A duality based approach for realtime TV-L1 optical flow”, Pattern recognition (29th DAGM Symposium, Heidelberg, Germany, September 12-14, 2007, Proceedings), eds. Hamprecht FA, Schnörr C, Jähne B, Springer-Verlag, Berlin, Heidelberg, 2007, 214–223  crossref
36. Gehrig D, Gehrig M, Hidalgo-Carrió J, Scaramuzza D, “Video to events: Recycling video datasets for event cameras”, 2020 IEEE/CVF Conf on Computer Vision and Pattern Recognition (CVPR), 2020, 3586–3595  crossref
37. Fan Q, Chen C-FR, Kuehne H, Pistoia M, Cox D, “More is less: Learning efficient video representations by big-little network and depthwise temporal aggregation”, Advances in Neural Information Processing Systems 32 (NeurIPS 2019), 2019, 2264–2273
38. Perez-Rua J-M, Martinez B, Zhu X, Toisoul A, Escorcia V, Xiang T, Knowing what, where and when to look: Efficient video action modeling with attention, 2020, arXiv: 2004.01278  crossref


© МИАН, 2025