Research on Topics Trends Based on Weighted K-means Hongzhi Tao, Jianfeng Li and Tao Luo Cong Wang Beijing Key Laboratory of Network System Architecture and Convergence Beijing University of Posts and Telecommunications Beijing 100876, China {taohongzhi & lijf & tluo}@bupt.edu.cn Department of Software Beijing University of Posts and Telecommunications Beijing 100876, China wangc@bupt.edu.cn finally did classify again for each cluster to obtain accurate clusters. Fu [8] used graph partitioning method, which based on the principle of minimum and maximum, to discovery userrelated page set and calculate authority value of pages by HITS algorithm that can effectively find the related theme. The topic clustering algorithms are simple and effective, but ignore the semantic information, which can result big deviation in specific field. On the other hand, keywords combination can fully express main text information, which is widely used in the field of text classification [9] [10] [11]. Word2Vec [12] [13] [14] [15] is an efficient deep learning language model, which can produce vectors that contains strong semantic information, it is convenient to the similarity measurement. The Word2Vec has been widely used in many tasks in the Natural Language Processing [16]. Considering the advantages of keywords combination and Word2Vec respectively in topic expression and semantic expression, this paper gives weighted K-means topic discovery algorithm in tobacco control. In the algorithm, the correlation weight is set up to measure the correlation between keywords and topic. Abstract²To capture the trends of concerned topics in specific field, people often use topic discovery methods to get this goal. The traditional topic discovery algorithms are generally divided into two types, text clustering algorithm and text topic model. The former lacks of attention on semantic information, and the latter always ignores relativity of the topic. These affect the topic discovery and topic trend. Therefore, combining with the keywords combination and Word2Vec model to strength expression of semantic information in topic clustering, this article sets weighted K-means algorithm for topic discovery. The results show our weighted K-means algorithm can get good clustering effect, and have improved semantic expression and topic relativity, it can also be used to filter noise data. In particular, the algorithm has better performance in a specific topic field. Keywords-topic trend;澳 topic discovery; topic clustering; Word2Vec; keywords combination; weighted K-means I. INTRODUCTION In 2016, Google Alpha-Go won chess master Lee Sedol, resulting booming of artificial intelligence, machine learning and Natural Language Processing are in prosperity again. The topic trend is a hot research recently. It is generally found through topic discovery methods [1]. The traditional topic discovery algorithms are usually divided into two types, text clustering algorithm and text topic model. II. TOPICS TRENDS MODEL The model in this paper consists data obtaining, data preprocessing, keyword extracting, Word2Vec (training, obtaining vector and calculating similarity), weighted K-means (keyword combination, cluster analysis, topic recognize), and drawing topics trends. As show in Figure. 1. The topic model is a method of modeling hidden topic, it overcomes traditional similarity calculation method, and can solve ambiguity and polysemy problems. Among them, Wu [2] used LSA latent semantic model, its basic idea is mapping sparse high-dimensional space to low dimensional semantic space and forming word-document matrix, LSA can complete singular value decomposition of matrix, calculating similarity matrix and the similarity between documents, clustering on document. Ruan [3] used LDA probabilistic topic model and How-Net knowledge database to research on network comments. Various clustering algorithm [4] [5] can be used to do topic discovery [6], they are mainly based on unstructured text data mapping and distance relations to realize the topic partition. Lin [7] used SVM classifier to process labeled training data, and then used K-means++ clustering algorithm to get results, Figure 1. 978-1-5090--/1/$31.00 ©201 IEEE The Structure of System Model A. Data resource and preprocessing In order to study the relationship between Chinese public policy in tobacco control and the topics of WHO "world No Tobacco Day" [21], this paper use Chinese media reports about tobacco control. The news data related to tobacco control come from the Internet, and time interval is from 2011 to 2015, every month in every year has a number of news reports, each news report is stored in a file, we have 27663 files with 127MB. III. WEIGHTED K-MEANS ALGORITHM Clustering algorithm is unsupervised learning. At present, the clustering algorithm can be divided into hierarchical clustering algorithm, dividing clustering algorithm, clustering algorithm based on density and grid [4] [5]. This paper is based on traditional K-means algorithm [20], modifying the input data, setting relevant weight to improve the clustering effect. We utilize advantages of keywords combination and Word2Vec to improve topic relevance and semantic expression. This article presents weighted K-means algorithm, which is described as following. Acquired from the network, data format is messy. Therefore, data preprocessing is needed, its purpose is forming a unified encoding, font format and font size, and does not contain special characters, recognition errors. Each document D, has a series of key words, as follows: Keywords Extraction As the basic operation of Natural Language Processing, there are many effective keyword extraction algorithms [17]. They are divided into three categories: 1) based on statistical methods, such as TF, TF-IDF.2) based on word co-occurrence methods, such as Keyword.3) based on word network methods, such as SWN and BC. In this paper, data has graph structure, we use TextRank [18], which belongs to third category and based on Google PageRank [19], it has better performance and ideal complexity. PageRank algorithm is used to calculate the importance of web pages initially. All pages can be viewed as a directed graph, each node is a web page. TextRank uses the ideas of PageRank. Each word in word set is the node in the PageRank. There are some undirected unweighted edges between any two words. Then, we can calculate the importance of each word node. B. w1 , w2 ,..., wi 1 , wi (1) Each word has a vector, as following: V1 , V2 ,..., Vi 1, Vi (2) For a series of specific topic news data, in order to determine the correlation between each keyword and topic, first, we must determine a reference word, for example, this article uses tobacco control data, so the reference word is 'tobacco control', at the same time, we get the vector of 'tobacco control' from Word2vec model and use VT to represent, we use Euclidean distance represents relevance between keyword and the topic, then, a variable called relevant weight is defined as follows formulation: Ci 1 C. Word2Vec Deep learning is a hot research subject in machine learning recently, and the Word2Vec [12] [13] [14] [15] is a simple, efficient deep learning model for language model, at the same time, the word space is converted to vector space providing capacity to calculate. Recently, there are a lot of pioneering work in the natural language processing, which are based on it and prove that the algorithm has significant effect. || Vi VT ||2 / n || V V ||2 1 i T2 2 || VT || / n || VT || (3) n is dimension of vectors, when Vi VT , Ci 1 , relevant weight is largest, when Vi (0,..., 0) , Ci 0 , the weight is smallest. Therefore, relevant weight is good method to express relativity between keyword and particular topic. With the definition of the relevant weight above, it can further discuss the high-dimensional methods, for each article D, it will produce a series of keywords and then forming VD , a high-dimensional vector to represent a document, the specific formulation is following: In natural language processing, the fundamental work is representation algorithm to represent word, at present, there are many kinds of representation algorithm, and the simplest, most intuitive and also most commonly method is the One-hot Representation, every word is modified to be a vector. The dimension of vector is same as vocabulary size, the vast majority of elements is zero, there is only one elements is 1, this dimension represents the current word. There is a serious problem in One-hot representation, semantic deletion, which leads to the fact that any word is isolated and does not have semantic connections. VD (C1V1 ,C2V2 ,...,Ci 1Vi 1,Ci Vi ) (4) Formulation above tell us, each keyword vector is multiplied by a relevant weight, then, all vectors concatenate together to strengthen or weaken keyword representation ability, it make the high correlation keyword closer to topic and the low correlation keyword remoter from topic. Overall, this operation enhance the capacity of expressing topic, it would bring good effect to topic clustering. So, we present our weighted K-means algorithm: Words representation commonly used in Deep Learning is not One-hot Representation, but the Distributed Representation, which has low dimensional real vector. Word2Vec contains two kinds of training models: CBOW and Skip-gram, and has many efficient computing method, such as hierarchical softmax, negative sampling. This paper adopt the Skip-gram training model and hierarchical softmax. x Input: cluster number k, the topic word T and vectors set. x Output: k clusters meet the minimum standards of variance. x calculate relevant weight of every word; x form high-dimensional vectors for each article; x randomly choose k clusters center from n samples; x for the rest, each must be measured its distance to each center of cluster, and mark it which has lowest distance; x calculate the center of cluster again; x repeat step 4 ~ 5 until the distance between new center and original center is equal or less than a specified threshold, the end of the algorithm. wide, when the weighted K-means is narrow. The specific statistical data shows in Table 1 and 2. TABLE I. K-means Weighted K-means TABLE II. K-means Weighted K-means C2 5089 13575 C3 5401 404 C4 6747 7946 C5 5111 4664 THE STATISTICAL DATA OF DISTANCE C1 [42,70] [7,31] C2 [35,65] [1,12] C3 [40,70] [18,31] C4 [14,70] [6,18] C5 [16,68] [9,21] Distance Interval. It can be seen from the statistical results: weighted Kmeans algorithm changes distance from cluster center, samples with high correlation has smaller distance, samples with low correlation has bigger distance, at the same time, the distance will lead to changing clustering results. In particular, for distance interval, in K-means algorithm, values were 32, 30, 30, 56, 52, and in the weighted K-means algorithm, values were 24, 11, 13, 12, 12, the latter clustering result is better than the former, samples with same attribute are more concentrated, samples with different attributes are more dispersed, and all cluster boundary between clusters are more obvious; samples of each cluster in K-means algorithm were 5315, 5401, 6747, 5089, 5111, and the number of weighted K-means algorithm were 1074, 13575, 404, 7946, 4664, the distribution in former is balance, the latter is dispersed. From the view of topic trend, balance distribution is not conducive to the topic trend. The number of each clusters in K-means is insignificant, and the number of each cluster in weighted K-means is striking. So, the latter is suitable in topic trend. From another view, samples around the cluster boundary is hard to recognize, weighted Kmeans algorithm introduces more advanced semantic information and relevance of topic to assist cluster division. EXPERIMENT This paper designs two experiments to get topics trends. One is based on traditional K-means algorithm, another is based on weighted K-means algorithm. The two experiments has the same process, such as collection data, data preprocessing, keywords extraction, Word2Vec model, topic clustering and trend expression, there is only one difference, the former simply joints vectors into high-dimensional vector; the latter forms a high dimensional vector based on relevant weights. Because experiments belong to unsupervised learning and we use real data, there is no evaluation method to evaluate the SHUIRUPDQFHVRZHWDNH:+2ZRUOG1R7REDFFR'D\¶VWRSLF as reference information [21], which is regard as bellwether of social media. Then, we take K-means algorithm experiment as a benchmark, compared with proposed model in this paper. In both experiments, 10 keywords will be extracted for each article, word vector has 100 dimension, the number of clusters is 5, corresponding to five years WHO theme of World No Tobacco Day [21]. V. C1 5315 1074 Total Samples: 27663 In general, weighted K-means algorithm first gets a series keywords, and then obtains the vector of each keyword using trained Word2Vec model, and calculates the relative weights between specific topics, and then forms high-dimensional vector based on keyword vectors and related weights to express all the information of document, finally, a series of highdimension text vector will be calculated by K-means, the clustering result is obtained. IV. THE STATISTICAL DATA OF NUMBERS First of all, we extract main keywords from each cluster, which are shown in Table 3. RESULTS TABLE III. The final results are showing in the Figure 2 below: 1 2 3 4 5 Figure 2. Topics Trends. FREQUENT KEYWORDS K-means tobacco control/public place/penalty/no smoking/citizen campaign/administration/conference/ rules/laws advertisement/sponsor/sales promotion/tobacco company/packing disease/patient/lung cancer/increase/cigarette addiction produce/illegal/law case/sin/fake Weighted K-means Public place/penalty/ban/harm/office fake/company/sale/illegal/prod uce law/rules/ban/news/conference campaign/advertisement/packin g/sponsor/promote check/hospital/disease/cancer/c igarette addiction 5 Keywords in every cluster In two pictures above, horizontal axis has no practical significance, just as the sign of different clusters, vertical axis represents the distance from the cluster center. As shown in the figure, distance interval of the traditional K-means algorithm is On the basis of key information, we can summed up five topics, they are tobacco control and policy system, tobacco control and illegal trade, tobacco control and health care, tobacco control and social activities, tobacco control and LQGXVWU\ LQIRUPDWLRQ :+2 :RUOG 1R 7REDFFR 'D\¶V topics [21] are shown in Table 4. around cluster boundary and obtain better clustering results and more reasonable trends. ACKNOWLEDGMENT TABLE IV. year 2011 2012 2013 2014 2015 TOPICS OF WHO NO TOBACCO DAY topic the framework convention on tobacco control the tobacco industry interference bans on tobacco industry(advertising, promotion and sponsorship) raising tobacco taxes banning tobacco illegal trade This work is supported by the National Key Research and Development Program of China (2016YFF0201003). tobacco control and policy system REFERENCES [1] tobacco control and industry information [2] tobacco control and illegal trade [3] One topic in every year [4] By comparing information in the two tables above, two experiments identified the same five topics. Then we draw the topic trend rely on the number of each cluster. The following two picture in Figure 3 are our result from two experiments we handled: [5] [6] [7] [8] [9] [10] [11] (T1_1: social activities; T1_2: policy system; T1_3: industry information; T1_4: health care; T1_5: illegal trade. T2_1: social activities; T2_2: illegal trade; [12] T2 3: policy system; T2 4: industry information; T2 5: health care) Figure 3. Topics Trends [13] The pictures show that K-means algorithm can extract the topic trends, and weighted K-means algorithm not only figure out topics trends, but also display recent concern topic prominently and make it easy to distinguish between topics. In detail, the topic lines in K-means algorithm are gathered in a narrow range ([10%, 30%]), important topics are not obvious. And the trends of weighted K-means algorithm are relatively dispersed ([0%, 60%]), important topics are showed obviously. The three important topics in WHO World No Tobacco Day in recent five years are significantly displayed (approximately 50% of the proportion of tobacco control and information industry, about 30% of tobacco control and illegal trade). [14] [15] [16] [17] [18] [19] In summary, weighted K-means and K-means both can correctly identify topics, what difference is the weighted Kmeans algorithm take advantage of keyword combination and Word2vec model to enhance the ability to cluster samples [20] [21] Xv T T.Review of Weibo topic discovery methods.Inner Mongolia Science Technology and Economy, 2009, 2015(19):81-83. Wu H, Geng H T.Research on BBS topic discovery algorithm based on latent semantic analysis. Computer Knowledge and Technology, 2008, 4(29):185-187. Ruan G S. Research on topic discovery of Web reviews based on LDA. Journal of Information, 2014(3):161-164. Sun J.G, Liu J, Zhao L Y. Clustering algorithm research. Journal of Software. 2008, 19(1):48-61. R Xu. Survey of clustering algorithms. IEEE Transactions on Neural Networks. 2005, 16(3):645-678. Guo J Y, Cai Y, Zhen Y X, et al. Topic discovery based on Text Clustering. Computer Engineering and Design, 2008, 29(6):1426-1428. Lin X L, Hu K K, Hu Q. Research on micro-blog user relationship mining based on Python. Journal of Intelligence, 2014(6):144-148. Fu X H, Ma Z F, He M, Feng B Q. A personalized topic extraction and hierarchical discovery algorithm. Journal of Xi'an Jiaotong University, 2005, 39(2):119-122. Huang C, Zhao P Q, Li J. Based on the analysis of the common words; The Quantitative Analysis of China's Science and Technology Innovation Policy Change. 2015(9). Zhao H Y. Based on keywords combined vector model of automatic text classification. Market Modernization. 2008(26):20-21. Jiang C J, Peng H, Chen J C, et al. Based on the combination of word and synonym set keyword extraction algorithm. Application Research of Computers. 2010, 27(8):2853-2856. T Mikolov, I Sutskever, K Chen, et al. Distributed Representations of Words and Phrases and their Compositionality. The Journal of Advances in Neural Information Processing Systems, 2013, 26:3111-3119. T Mikolov, K Chen, G Corrado, et al. Efficient Estimation of Word Representations in Vector space. The Journal of Computer Science, 2013. T Mikolov, WT Yih, G Zweig. Linguistic Regularities in Continuous Space Word Representations. HLT-NAACL, 2013. Y Goldberg, O Levy. Word2Vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method. The Journal of Eprint Arxiv, 2014. Zhou L. The working principle and application of Word2Vec. Sci-Tech Information Development & Economy, 2015(2):145-148. Stuart Rose, Dave Engel, Nick Cramer. Automatic keyword extraction form individual documents. Text Mining: Applications and Theory, 2002, 9(3): 17±22. R Mihalcea, P Tarau. TextRank: Bringing Order into Texts. Conference on Empirical Methods in Natural Language Processing, 2004:404-411. L Page. The PageRank Citation Ranking Bringing Order to the Web. Stanford Infolab, 1998, 9(1):1-14. Ja Hartigan, Ma Wong. A K-means clustering algorithm. Applied Statistics. 2013, 28(1):100-108. WHO Website: http://www.who.int/tobacco/wntd/previous/en/.

1/--страниц