斯坦福自然语言处理工具（Stanford CoreNLP）Python接口封装

Python 封装

Stanford CoreNLP 的网站中，已经列举出许多种封装。这些packages都是通过启动 Stanford CoreNLP server，然后向server发送请求，达到调用的目的。但这些包有的早已停止更新，不能同当前的版本（3.9.2）兼容，有的使用方法比较复杂，有的则缺失一些关键性的功能（比如分句）。

为了解决上面这些问题，我对 Lynten/stanford-corenlp 的代码进行了维护，修正了一些bug。修改后的代码在原来的基础上，增加了：

兼容Stanford CoreNLP最新版本（v3.9.2）
分词方法word_tokenize()可以正常使用（原版本存在bug）
增加了分句方法 sent_split()

代码地址为： https://github.com/styxjedi/stanford-corenlp

如果在使用过程中遇到问题，可以在这个仓库中提 issue，我会尽快解决。

安装方法

配置

Java 1.8+ （检查版本： java -version）（下载页）
Stanford CoreNLP （下载页）

安装

1
2
3

git clone https://github.com/styxjedi/stanford-corenlp.git
cd stanford-corenlp
python setup.py install

使用方法

默认处理英文

from stanfordcorenlp import StanfordCoreNLP

# 将路径换成自己的
nlp = StanfordCoreNLP('G:\JavaLibraries\stanford-corenlp-full-2018-02-27')

sentence = 'Guangdong University of Foreign Studies is located in Guangzhou. It\' s a very beautiful university.'
print('Sentence Split:', nlp.sent_split(sentence))
print('Tokenize:', nlp.word_tokenize(sentence))
print('Part of Speech:', nlp.pos_tag(sentence))
print('Named Entities:', nlp.ner(sentence))
print('Constituency Parsing:', nlp.parse(sentence))
print('Dependency Parsing:', nlp.dependency_parse(sentence))

nlp.close() # Do not forget to close! The backend server will consume a lot memery.

指定中文

# -*- coding:utf-8 -*-
sentence = '清华大学位于北京。北京是中华人民共和国的首都。'
with StanfordCoreNLP('path/to/your/model', lang='zh') as nlp:
    print(nlp.sent_split(sentence))
    print(nlp.word_tokenize(sentence))
    print(nlp.pos_tag(sentence))
    print(nlp.ner(sentence))
    print(nlp.parse(sentence))
    print(nlp.dependency_parse(sentence))

Stanford CoreNLP 简介

Note：本节内容引自 http://fancyerii.github.io/books/stanfordnlp/ .

Stanford CoreNLP是Stanford NLP Group基于他们的科研工作开发的一套NLP工具。Stanford NLP组的成员来自语言学系和计算机系，它是Stanford AI实验室的一部分。注意，最近Stanford也基于Python开发了一套纯深度学习的工具Stanford NLP。不过目前的版本还是0.2.0，还是属于比较早期的版本，而且很遗憾的是没有简体中文的支持(只有繁体中文)。

Stanford CoreNLP提供了一系列工具来处理人类的自然语言。它可以实现词干还原，标注词的词性。识别人名、地名、日期和时间等命名实体，同时还可以对它们进行归一化。对句子进行乘法句法分析和依存句法分析。还包括指代消解、情感分析和关系抽取等。

它的特点是：

一个集成多种工具的NLP工具集。
快速稳定，经过十多年的迭代目前的版本已经是3.9.2 。
使用最近的技术，整体的效果非常好。
支持多种语言(包括中文)
支持多种编程语言(通过Web Service的方式)
可以独立作为一个Web服务运行