JIEBA https://pypi.python.org/pypi/jieba/ language: Python

is also a university asked, are interested can look at the CRF, HMM and other logic model. This is not out.

2. based on TF-IDF filtering high frequency words (what is TF-IDF your own brain)



a Shanghai Longfeng data analysis in this paper has been separated for a long time, a friend asked me today is the Internet, how to maintain the thesaurus. Just take this opportunity to talk about this problem. In the access to a lot of words, we must first of these words, in our practical work, summed up the following items I have or feel the need to do something.

CRF++ 贵族宝贝crfpp.sourceforge.net/ language: C#


Look at the following keywords )

segmentation algorithm, many academic research a lot Chinese segmentation algorithm, but the actual use of very small differences. Here just recommend a few, according to their own language will use.

extract the entity (popular point is to find the key words and keywords in

1. remove stop words according to the part of speech symbols (deleted some unimportant)


Extraction of


SCWS 贵族宝贝xunsearch贵族宝贝/scws/ language: PHP


can take a closer look at the difference between the two. This algorithm has many kinds of method, starting from the point of view of Shanghai dragon, we recall the requirements on accuracy and are generally low. From 0% to 80% to spend mind, much thought might not take from 80%~100%. And in different industries, will have a slightly different approach. So I take the following two methods

extraction of the entity concept is to find the key words in the key words. Such as "hot springs in Beijing where the good", the two words that the word "Beijing" and "hot" is the key, "where" is a question word describing the help is relatively small on the theme. So we need some technical means to deal with the keywords, the keywords important intermediate (entity) out.

Here the

ICTCLAS 贵族宝贝ictclas.nlpir.org/downloads language: Java, C#


Leave a Reply

Your email address will not be published. Required fields are marked *