[学习] NLTK Python自然语言处理

主要记录NLP相关, Python语言基础就略过了.

在Python3 Win7 64位下进行测试

先安装PyYAML

再安装NLTK

Python 3版本

Github上nltk3-alpha

python setup.py install

其它依赖的包有NumPy、Matplotlib等.

NLTK 入门

>>> import nltk
>>> nltk.download()
NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> c
Data Server:
  - URL: <http://nltk.github.com/nltk_data/>
  - 3 Package Collections Available
  - 78 Individual Packages Available
Local Machine:
  - Data directory: C:\Users\Zoey\AppData\Roaming\nltk_data
---------------------------------------------------------------------------
    s) Show Config   u) Set Server URL   d) Set Data Dir   m) Main Menu
---------------------------------------------------------------------------
Config> d
  New Directory> D:\nltk_data
---------------------------------------------------------------------------
    s) Show Config   u) Set Server URL   d) Set Data Dir   m) Main Menu
---------------------------------------------------------------------------
Config> m
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> book
      Downloading collection 'book'
        |
        | ...
        |
      Done downloading collection book

如果不能正常下载, 则需手动下载, 放入相应的文件夹内.

Searched in:
  - 'C:/\\Users\\Zoey/nltk_data'
  - 'C:\\nltk_data'
  - 'D:\\nltk_data'
  - 'E:\\nltk_data'
  - 'C:\\Python33\\nltk_data'
  - 'C:\\Python33\\lib\\nltk_data'
  - 'C:\\Users\\Zoey\\AppData\\Roaming\\nltk_data'

似乎不是任意文件夹都可以. 一开始放在F盘提示找不到.

>>> from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
>>> text1
<Text: Moby Dick by Herman Melville 1851>
>>> text2
<Text: Sense and Sensibility by Jane Austen 1811>

搜索文本

词语索引 concordance

>>> text1.concordance("monstrous")
Building index...
Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us ,
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
...
>>> text2.concordance("monstrous")
Displaying 11 of 11 matches:
. " Now , Palmer , you shall see a monstrous pretty girl ." He immediately went
your sister is to marry him . I am monstrous glad of it , for then I shall have
ou may tell your sister . She is a monstrous lucky girl to get him , upon my ho
k how you will like them . Lucy is monstrous pretty , and so good humoured and
 Jennings , " I am sure I shall be monstrous glad of Miss Marianne ' s company
 usual noisy cheerfulness , " I am monstrous glad to see you -- sorry I could n
t however , as it turns out , I am monstrous glad there was never any thing in
so scornfully ! for they say he is monstrous fond of her , as well he may . I s
possible that she should ." " I am monstrous glad of it . Good gracious ! I hav
thing of the kind . So then he was monstrous happy , and talked on some time ab
e very genteel people . He makes a monstrous deal of money , and they keep thei
>>> text3.concordance("lived")
Building index...
Displaying 25 of 38 matches:
ay when they were created . And Adam lived an hundred and thirty years , and be
ughters : And all the days that Adam lived were nine hundred and thirty yea and
...

词语索引使我们看到词语的上下文, 例如上面text1得到的词语"monstrous"的上下文为"most_size", "that_bulk"等.

要查找有相似上下文的词语, 使用 similar 函数

>>> text1.similar("monstrous")
Building word-context index...
abundant candid careful christian contemptible curious delightfully
determined doleful domineering exasperate fearless few gamesome
horrible impalpable imperial lamentable lazy loving

>>> text2.similar("monstrous")
Building word-context index...
very exceedingly heartily so a amazingly as extremely good great
remarkably sweet vast

函数 common_contexts 允许我们研究两个或两个以上的词的共同的上下文.

例如"monstrous"与"very", 其中"very"是上面text2.similar("monstrous")函数检索出来的具有相似上下文的词语, 因此可以得到如下结果:

>>> text2.common_contexts(["monstrous", "very"])
a_lucky a_pretty am_glad be_glad is_pretty
>>> text2.common_contexts(["monstrous", "he"])
No common contexts were found

而"he"并不存在于text2.similar("monstrous")函数检索出来的结果中, 因此得到提示没有找到共同上下文.

使用另一对词试试:

>>> text2.common_contexts(["glad", "happy"])
be_to so_to very_to was_to

用离散图表示一个单词在文本中出现的位置:

>>> text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])

美国总统就职演说词汇分布图: 可以用来研究随时间推移语言使用上的变化

>>> text2.dispersion_plot(["Elinor", "Marianne", "Edward", "Willoughby"])  # ex6

产生随机文本(这里不理解, 怎么产生的? 似乎是随机抽取句子短语. 总之不是抽取连续的一段.)

>>> text3.generate()
Building ngram index...
In the mount Gilead . And Pharaoh said unto the land of Egy And there
was none that could interpret them unto his father and his brother is
dead , and duke Amalek : these are the sons of Shem , after their
tongues , in the field before it was good , and go , and called all
his sons with him , that he may bless me . And it came to pass , as
the LORD ; and let her be burnt . When the chief butler and of cattle
after their kind , and be clean
>>> text3.generate()
In the six hundredth and first year , and smite me , He that is
uncircumcised ; for the Egyptians came unto him , I lay yesternight
with my sister : that it was dark , behold , also his blood ? Come ,
and I saw him afar off . And , behold , seven other years . And in thy
seed shall all the plain , and , lo , here is seed for ever . And they
dwelt from Havilah unto Shur , and shalt serve with thee in the way
side ? And his mother '

计数词汇

>>> len(text2)  # text2 中有多少个词
141576
>>> len(set(text2))  # 有多少个不同的词
6833
>>> len(text5)
45010
>>> len(set(text5))
6066
>>> len(text5) / len(set(text5))
7.420046158918563
>>> text5.count("lol")
704
>>> 100 * text5.count("lol") / len(text5)
1.5640968673628082

频率分布

>>> fdist = FreqDist(text2)
>>> fdist
<FreqDist with 6833 samples and 141576 outcomes>
>>> vocabulary = fdist.keys()
>>> vocabulary
<map object at 0x000000001FFA3DD8>
>>> vocabulary = list(vocabulary)
>>> vocabulary[:50]
[',', 'to', '.', 'the', 'of', 'and', 'her', 'a', 'I', 'in', 'was', 'it', '"', ';', 'she', 'be', 'that', 'for', 'not', 'as', 'you', 'with', 'had', 'his', 'he', "'", 'have', 'at', 'by', 'is', '."', 's', 'Elinor', 'on', 'all', 'him', 'so', 'but', 'which', 'could', 'Marianne', 'my', 'Mrs', 'from', 'would', 'very', 'no', 'their', 'them', '--']
>>> fdist['Marianne']
566
>>> fdist.plot(50, cumulative=True)
# 50个最常用词的累积频率图

查看只出现一次的词

>>> fdist.hapaxes()

寻找文本特征词汇

聊天语料库中长度超过7个字符出现次数超过7次的词:

>>> fdist5 = FreqDist(text5)
>>> sorted([w for w in set(text5) if len(w) > 7 and fdist5[w] > 7])
['#14-19teens', '#talkcity_adults', '((((((((((', '........', 'Question', 'actually', 'anything', 'computer', 'cute.-ass', 'everyone', 'football', 'innocent', 'listening', 'remember', 'seriously', 'something', 'together', 'tomorrow', 'watching']

词语搭配和双连词 Collocations and Bigrams

>>> bigrams(['more', 'is', 'said', 'than', 'done'])
[('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]
>>> text4.collocations()
Building collocations list
United States; fellow citizens; four years; years ago; Federal
Government; General Government; American people; Vice President; Old
World; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice;
God bless; every citizen; Indian tribes; public debt; one another;
foreign nations; political parties
>>> text5.collocations()  # ex7
Building collocations list
wanna chat; PART JOIN; MODE #14-19teens; JOIN PART; PART PART;
cute.-ass MP3; MP3 player; JOIN JOIN; times .. .; ACTION watches; guys
wanna; song lasts; last night; ACTION sits; -...)...- S.M.R.; Lime
Player; Player 12%; dont know; lez gurls; long time
>>> text8.collocations()
Building collocations list
would like; medium build; social drinker; quiet nights; non smoker;
long term; age open; Would like; easy going; financially secure; fun
times; similar interests; Age open; weekends away; poss rship; well
presented; never married; single mum; permanent relationship; slim
build

FreqDist

>>> fdist = FreqDist([len(w) for w in text1])
>>> fdist
<FreqDist with 19 samples and 260819 outcomes>
>>> fdist.keys()
<map object at 0x00000000034F5DA0>
>>> list(fdist)
[3, 1, 4, 2, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20]

ex14 sent3中"the"出现的索引值

>>> [i for i,j in enumerate(sent3) if j=='the']
[1, 5, 8]

ex15 text5中以字母b开头的词, 按字母顺序显示

>>> sorted(set([w for w in text5 if w.startswith('b')]))

ex16 语言基础

>>> range(10)
range(0, 10)
>>> list(range(10))
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> list(range(10, 20))
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
>>> list(range(10, 20, 2))
[10, 12, 14, 16, 18]
>>> list(range(20, 10, -2))
[20, 18, 16, 14, 12]

ex17

>>> sunsetlist = [i for i,j in enumerate(text9) if j=='sunset']
# [629, 642, 1432, 1650, 13335, 13381, 16313, 27014, 49340, 52092, 60857, 60862, 64721, 64736]
>>> punclist = [i for i, j in enumerate(text9) if j in ['.', '!', '?']]
>>> se = set()
>>> for i in sunsetlist:
        for k, v in enumerate(punclist):  # 每次都从头遍历感觉不妥.
            if v > i and i > punclist[k-1]:
                se.add((punclist[k-1]+1, v))
                break
>>> sorted(se)
[(613, 643), (1411, 1433), (1628, 1656), (13325, 13336), (13352, 13384), (16304, 16328), (27010, 27021), (49311, 49341), (52086, 52108), (60786, 60864), (64698, 64737)]
>>> for s in sorted(se):
    print(' '.join(text9[s[0]:s[1]]) + text9[s[1]])
CHAPTER I THE TWO POETS OF SAFFRON PARK THE suburb of Saffron Park lay on the sunset side of London , as red and ragged as a cloud of sunset.
This particular evening , if it is remembered for nothing else , will be remembered in that place for its strange sunset.
For a long time the red - haired revolutionary had reigned without a rival ; it was upon the night of the sunset that his solitude suddenly ended.
He walked on the Embankment once under a dark red sunset.
The sky , indeed , was so swarthy , and the light on the river relatively so lurid , that the water almost seemed of fiercer flame than the sunset it mirrored.
Every trace of the passionate plumage of the cloudy sunset had been swept away , and a naked moon stood in a naked sky.
The sealed and sullen sunset behind the dark dome of St.
Nevertheless , the ride had been a long one , and by the time they reached the real town the west was warming with the colour and quality of sunset.
Up this side street the last sunset light shone as sharp and narrow as the shaft of artificial light at the theatre.
His silk hat was broken over his nose by a swinging bough , his coat - tails were torn to the shoulder by arresting thorns , the clay of England was splashed up to his collar ; but he still carried his yellow beard forward with a silent and furious determination , and his eyes were still fixed on that floating ball of gas , which in the full flush of sunset seemed coloured like a sunset cloud.
Then his carriage took a turn of the path , and he saw suddenly and quietly , like a long , low , sunset cloud , a long , low house , mellow in the mild light of sunset.

ex28

>>> def percent(word, text):
    pc = 100 * text.count(word) / len(text)
    return str(pc) + "%"
>>> percent("the", text1)
'5.260736372733581%'

ex29

>>> s1 = ['a', 'b']
>>> s2 = ['a', 'b', 'c']
>>> set(s1) < set(s2)
True
>>> set(s2) < set(s1)
False
>>> set(s2) - set(s1)
{'c'}

参考文档