Abstract:
Social networks now serve not as a mere medium for entertainment,
but as an information distribution channel that is replacing
classical mass media. In this article we describe a scalable trend
detection system implemented with the social network OK. Actors
(users and communities) of social networks form a broad agenda.
The content of social networks is specific:
UGC (user generated content) is difficult to process;
actors generate a multilingual text. This requires
attracting a large number of highly paid professionals in the case
of classical media analysis;
modern social networks comprise a highly-connected
society with high “response time”. Therefore, the system must
work in real time;
social networks are used by spammers as a platform
for promotion and obtrusive advertising, therefore the system
should contain the ability to filter spam content.
Applying standard methods of media analysis
to this seems impossible. It creates a natural demand for
developing and implementing textual trend detection and analysis
software. There are two main approaches of trend detection in
academic papers: topic modeling (and further topics evolutionary
analysis) and distributive models based on frequency-like
properties of distinct terms. We conducted an analysis of
scientific papers using both approaches taking into account the
specific features of social networks. As a result of research, it
was decided to use distributive models as a base for the system
development. OK is one of the largest social networks in Russia
and the CIS countries. Actors generate over 100M symbols of text
every day. Even basic processing is a serious technical problem.
So we are forced to use Big Data approaches through the
development. We introduce lambda-architecture based on three main
components:
daily-batch processing component, based on Apache
Spark;
streaming processing component, based on Apache
Samza;
mini-batch processing component, based on Spark
Streaming.
The article describes in detail the architecture and
technical features of each component. In conclusion we present the
results of operating the system as well as discuss areas for
further research and development. Refs 13. Figs 7. Table 1.
Keywords:natural language processing, trend detection, big data.