вход по аккаунту



код для вставкиСкачать
Research on Topics Trends
Based on Weighted K-means
Hongzhi Tao, Jianfeng Li and Tao Luo
Cong Wang
Beijing Key Laboratory of Network System Architecture
and Convergence
Beijing University of Posts and Telecommunications
Beijing 100876, China
{taohongzhi & lijf & tluo}
Department of Software
Beijing University of Posts and Telecommunications
Beijing 100876, China
finally did classify again for each cluster to obtain accurate
clusters. Fu [8] used graph partitioning method, which based
on the principle of minimum and maximum, to discovery userrelated page set and calculate authority value of pages by HITS
algorithm that can effectively find the related theme. The topic
clustering algorithms are simple and effective, but ignore the
semantic information, which can result big deviation in specific
field. On the other hand, keywords combination can fully
express main text information, which is widely used in the field
of text classification [9] [10] [11]. Word2Vec [12] [13] [14]
[15] is an efficient deep learning language model, which can
produce vectors that contains strong semantic information, it is
convenient to the similarity measurement. The Word2Vec has
been widely used in many tasks in the Natural Language
Processing [16]. Considering the advantages of keywords
combination and Word2Vec respectively in topic expression
and semantic expression, this paper gives weighted K-means
topic discovery algorithm in tobacco control. In the algorithm,
the correlation weight is set up to measure the correlation
between keywords and topic.
Abstract²To capture the trends of concerned topics in specific
field, people often use topic discovery methods to get this goal.
The traditional topic discovery algorithms are generally divided
into two types, text clustering algorithm and text topic model.
The former lacks of attention on semantic information, and the
latter always ignores relativity of the topic. These affect the topic
discovery and topic trend. Therefore, combining with the
keywords combination and Word2Vec model to strength
expression of semantic information in topic clustering, this article
sets weighted K-means algorithm for topic discovery. The results
show our weighted K-means algorithm can get good clustering
effect, and have improved semantic expression and topic
relativity, it can also be used to filter noise data. In particular, the
algorithm has better performance in a specific topic field.
Keywords-topic trend;澳 topic discovery; topic clustering;
Word2Vec; keywords combination; weighted K-means
In 2016, Google Alpha-Go won chess master Lee Sedol,
resulting booming of artificial intelligence, machine learning
and Natural Language Processing are in prosperity again. The
topic trend is a hot research recently. It is generally found
through topic discovery methods [1]. The traditional topic
discovery algorithms are usually divided into two types, text
clustering algorithm and text topic model.
The model in this paper consists data obtaining, data preprocessing, keyword extracting, Word2Vec (training, obtaining
vector and calculating similarity), weighted K-means (keyword
combination, cluster analysis, topic recognize), and drawing
topics trends. As show in Figure. 1.
The topic model is a method of modeling hidden topic, it
overcomes traditional similarity calculation method, and can
solve ambiguity and polysemy problems. Among them, Wu [2]
used LSA latent semantic model, its basic idea is mapping
sparse high-dimensional space to low dimensional semantic
space and forming word-document matrix, LSA can complete
singular value decomposition of matrix, calculating similarity
matrix and the similarity between documents, clustering on
document. Ruan [3] used LDA probabilistic topic model and
How-Net knowledge database to research on network
Various clustering algorithm [4] [5] can be used to do topic
discovery [6], they are mainly based on unstructured text data
mapping and distance relations to realize the topic partition.
Lin [7] used SVM classifier to process labeled training data,
and then used K-means++ clustering algorithm to get results,
Figure 1.
978-1-5090--/1/$31.00 ©201 IEEE
The Structure of System Model
Data resource and preprocessing
In order to study the relationship between Chinese public
policy in tobacco control and the topics of WHO "world No
Tobacco Day" [21], this paper use Chinese media reports about
tobacco control. The news data related to tobacco control come
from the Internet, and time interval is from 2011 to 2015, every
month in every year has a number of news reports, each news
report is stored in a file, we have 27663 files with 127MB.
Clustering algorithm is unsupervised learning. At present,
the clustering algorithm can be divided into hierarchical
clustering algorithm, dividing clustering algorithm, clustering
algorithm based on density and grid [4] [5]. This paper is based
on traditional K-means algorithm [20], modifying the input
data, setting relevant weight to improve the clustering effect.
We utilize advantages of keywords combination and
Word2Vec to improve topic relevance and semantic expression.
This article presents weighted K-means algorithm, which is
described as following.
Acquired from the network, data format is messy.
Therefore, data preprocessing is needed, its purpose is forming
a unified encoding, font format and font size, and does not
contain special characters, recognition errors.
Each document D, has a series of key words, as follows:
Keywords Extraction
As the basic operation of Natural Language Processing,
there are many effective keyword extraction algorithms [17].
They are divided into three categories: 1) based on statistical
methods, such as TF, TF-IDF.2) based on word co-occurrence
methods, such as Keyword.3) based on word network methods,
such as SWN and BC. In this paper, data has graph structure,
we use TextRank [18], which belongs to third category and
based on Google PageRank [19], it has better performance and
ideal complexity. PageRank algorithm is used to calculate the
importance of web pages initially. All pages can be viewed as a
directed graph, each node is a web page. TextRank uses the
ideas of PageRank. Each word in word set is the node in the
PageRank. There are some undirected unweighted edges
between any two words. Then, we can calculate the importance
of each word node.
w1 , w2 ,..., wi 1 , wi
Each word has a vector, as following:
V1 , V2 ,..., Vi 1, Vi
For a series of specific topic news data, in order to
determine the correlation between each keyword and topic,
first, we must determine a reference word, for example, this
article uses tobacco control data, so the reference word is
'tobacco control', at the same time, we get the vector of
'tobacco control' from Word2vec model and use VT to
represent, we use Euclidean distance represents relevance
between keyword and the topic, then, a variable called relevant
weight is defined as follows formulation:
Ci 1 C.
Deep learning is a hot research subject in machine learning
recently, and the Word2Vec [12] [13] [14] [15] is a simple,
efficient deep learning model for language model, at the same
time, the word space is converted to vector space providing
capacity to calculate. Recently, there are a lot of pioneering
work in the natural language processing, which are based on it
and prove that the algorithm has significant effect.
|| Vi VT ||2 / n
|| V V ||2
1 i T2
|| VT || / n
|| VT ||
n is dimension of vectors, when Vi VT , Ci 1 , relevant
weight is largest, when Vi (0,..., 0) , Ci 0 , the weight is
smallest. Therefore, relevant weight is good method to express
relativity between keyword and particular topic.
With the definition of the relevant weight above, it can
further discuss the high-dimensional methods, for each article
D, it will produce a series of keywords and then forming VD , a
high-dimensional vector to represent a document, the specific
formulation is following:
In natural language processing, the fundamental work is
representation algorithm to represent word, at present, there are
many kinds of representation algorithm, and the simplest, most
intuitive and also most commonly method is the One-hot
Representation, every word is modified to be a vector. The
dimension of vector is same as vocabulary size, the vast
majority of elements is zero, there is only one elements is 1,
this dimension represents the current word. There is a serious
problem in One-hot representation, semantic deletion, which
leads to the fact that any word is isolated and does not have
semantic connections.
(C1V1 ,C2V2 ,...,Ci 1Vi 1,Ci Vi )
Formulation above tell us, each keyword vector is
multiplied by a relevant weight, then, all vectors concatenate
together to strengthen or weaken keyword representation
ability, it make the high correlation keyword closer to topic and
the low correlation keyword remoter from topic. Overall, this
operation enhance the capacity of expressing topic, it would
bring good effect to topic clustering. So, we present our
weighted K-means algorithm:
Words representation commonly used in Deep Learning is
not One-hot Representation, but the Distributed Representation,
which has low dimensional real vector. Word2Vec contains
two kinds of training models: CBOW and Skip-gram, and has
many efficient computing method, such as hierarchical softmax,
negative sampling. This paper adopt the Skip-gram training
model and hierarchical softmax.
Input: cluster number k, the topic word T and vectors
Output: k clusters meet the minimum standards of
calculate relevant weight of every word;
form high-dimensional vectors for each article;
randomly choose k clusters center from n samples;
for the rest, each must be measured its distance to each
center of cluster, and mark it which has lowest distance;
calculate the center of cluster again;
repeat step 4 ~ 5 until the distance between new center
and original center is equal or less than a specified
threshold, the end of the algorithm.
wide, when the weighted K-means is narrow. The specific
statistical data shows in Table 1 and 2.
Weighted K-means
Weighted K-means
Distance Interval.
It can be seen from the statistical results: weighted Kmeans algorithm changes distance from cluster center, samples
with high correlation has smaller distance, samples with low
correlation has bigger distance, at the same time, the distance
will lead to changing clustering results. In particular, for
distance interval, in K-means algorithm, values were 32, 30, 30,
56, 52, and in the weighted K-means algorithm, values were 24,
11, 13, 12, 12, the latter clustering result is better than the
former, samples with same attribute are more concentrated,
samples with different attributes are more dispersed, and all
cluster boundary between clusters are more obvious; samples
of each cluster in K-means algorithm were 5315, 5401, 6747,
5089, 5111, and the number of weighted K-means algorithm
were 1074, 13575, 404, 7946, 4664, the distribution in former
is balance, the latter is dispersed. From the view of topic trend,
balance distribution is not conducive to the topic trend. The
number of each clusters in K-means is insignificant, and the
number of each cluster in weighted K-means is striking. So, the
latter is suitable in topic trend. From another view, samples
around the cluster boundary is hard to recognize, weighted Kmeans algorithm introduces more advanced semantic
information and relevance of topic to assist cluster division.
This paper designs two experiments to get topics trends.
One is based on traditional K-means algorithm, another is
based on weighted K-means algorithm. The two experiments
has the same process, such as collection data, data
preprocessing, keywords extraction, Word2Vec model, topic
clustering and trend expression, there is only one difference,
the former simply joints vectors into high-dimensional vector;
the latter forms a high dimensional vector based on relevant
Because experiments belong to unsupervised learning and
we use real data, there is no evaluation method to evaluate the
as reference information [21], which is regard as bellwether of
social media. Then, we take K-means algorithm experiment as
a benchmark, compared with proposed model in this paper. In
both experiments, 10 keywords will be extracted for each
article, word vector has 100 dimension, the number of clusters
is 5, corresponding to five years WHO theme of World No
Tobacco Day [21].
Total Samples: 27663
In general, weighted K-means algorithm first gets a series
keywords, and then obtains the vector of each keyword using
trained Word2Vec model, and calculates the relative weights
between specific topics, and then forms high-dimensional
vector based on keyword vectors and related weights to express
all the information of document, finally, a series of highdimension text vector will be calculated by K-means, the
clustering result is obtained.
First of all, we extract main keywords from each cluster,
which are shown in Table 3.
The final results are showing in the Figure 2 below:
Figure 2. Topics Trends.
tobacco control/public
place/penalty/no smoking/citizen
promotion/tobacco company/packing
cancer/increase/cigarette addiction
produce/illegal/law case/sin/fake
Weighted K-means
igarette addiction
5 Keywords in every cluster
In two pictures above, horizontal axis has no practical
significance, just as the sign of different clusters, vertical axis
represents the distance from the cluster center. As shown in the
figure, distance interval of the traditional K-means algorithm is
On the basis of key information, we can summed up five
topics, they are tobacco control and policy system, tobacco
control and illegal trade, tobacco control and health care,
tobacco control and social activities, tobacco control and
[21] are shown in Table 4.
around cluster boundary and obtain better clustering results and
more reasonable trends.
the framework convention on tobacco
the tobacco industry interference
bans on tobacco industry(advertising,
promotion and sponsorship)
raising tobacco taxes
banning tobacco illegal trade
This work is supported by the National Key Research and
Development Program of China (2016YFF0201003).
tobacco control and
policy system
tobacco control and
industry information
tobacco control and
illegal trade
One topic in every year
By comparing information in the two tables above, two
experiments identified the same five topics. Then we draw the
topic trend rely on the number of each cluster. The following
two picture in Figure 3 are our result from two experiments we
(T1_1: social activities; T1_2: policy system; T1_3: industry information;
T1_4: health care; T1_5: illegal trade. T2_1: social activities; T2_2: illegal trade;
T2 3: policy system; T2 4: industry information; T2 5: health care)
Figure 3. Topics Trends
The pictures show that K-means algorithm can extract the
topic trends, and weighted K-means algorithm not only figure
out topics trends, but also display recent concern topic
prominently and make it easy to distinguish between topics. In
detail, the topic lines in K-means algorithm are gathered in a
narrow range ([10%, 30%]), important topics are not obvious.
And the trends of weighted K-means algorithm are relatively
dispersed ([0%, 60%]), important topics are showed obviously.
The three important topics in WHO World No Tobacco Day in
recent five years are significantly displayed (approximately
50% of the proportion of tobacco control and information
industry, about 30% of tobacco control and illegal trade).
In summary, weighted K-means and K-means both can
correctly identify topics, what difference is the weighted Kmeans algorithm take advantage of keyword combination and
Word2vec model to enhance the ability to cluster samples
Xv T T.Review of Weibo topic discovery methods.Inner Mongolia
Science Technology and Economy, 2009, 2015(19):81-83.
Wu H, Geng H T.Research on BBS topic discovery algorithm based on
latent semantic analysis. Computer Knowledge and Technology,
2008, 4(29):185-187.
Ruan G S. Research on topic discovery of Web reviews based on LDA.
Journal of Information, 2014(3):161-164.
Sun J.G, Liu J, Zhao L Y. Clustering algorithm research. Journal of
Software. 2008, 19(1):48-61.
R Xu. Survey of clustering algorithms. IEEE Transactions on Neural
Networks. 2005, 16(3):645-678.
Guo J Y, Cai Y, Zhen Y X, et al. Topic discovery based on Text
Clustering. Computer Engineering and Design, 2008, 29(6):1426-1428.
Lin X L, Hu K K, Hu Q. Research on micro-blog user relationship
mining based on Python. Journal of Intelligence, 2014(6):144-148.
Fu X H, Ma Z F, He M, Feng B Q. A personalized topic extraction and
hierarchical discovery algorithm. Journal of Xi'an Jiaotong University,
2005, 39(2):119-122.
Huang C, Zhao P Q, Li J. Based on the analysis of the common words;
The Quantitative Analysis of China's Science and Technology
Innovation Policy Change. 2015(9).
Zhao H Y. Based on keywords combined vector model of automatic text
classification. Market Modernization. 2008(26):20-21.
Jiang C J, Peng H, Chen J C, et al. Based on the combination of word
and synonym set keyword extraction algorithm. Application Research of
Computers. 2010, 27(8):2853-2856.
T Mikolov, I Sutskever, K Chen, et al. Distributed Representations of
Words and Phrases and their Compositionality. The Journal of Advances
in Neural Information Processing Systems, 2013, 26:3111-3119.
T Mikolov, K Chen, G Corrado, et al. Efficient Estimation of Word
Representations in Vector space. The Journal of Computer Science,
T Mikolov, WT Yih, G Zweig. Linguistic Regularities in Continuous
Space Word Representations. HLT-NAACL, 2013.
Y Goldberg, O Levy. Word2Vec Explained: deriving Mikolov et al.'s
negative-sampling word-embedding method. The Journal of Eprint
Arxiv, 2014.
Zhou L. The working principle and application of Word2Vec. Sci-Tech
Information Development & Economy, 2015(2):145-148.
Stuart Rose, Dave Engel, Nick Cramer. Automatic keyword extraction
form individual documents. Text Mining: Applications and Theory,
2002, 9(3): 17±22.
R Mihalcea, P Tarau. TextRank: Bringing Order into Texts. Conference
on Empirical Methods in Natural Language Processing, 2004:404-411.
L Page. The PageRank Citation Ranking Bringing Order to the Web.
Stanford Infolab, 1998, 9(1):1-14.
Ja Hartigan, Ma Wong. A K-means clustering algorithm. Applied
Statistics. 2013, 28(1):100-108.
WHO Website:
Без категории
Размер файла
282 Кб
2017, 8076604, iceiec
Пожаловаться на содержимое документа