简介
以下介绍的库均为从网页中自动解析想要的内容,从而解放了需要每个网站都要正则匹配或者xpath的超大工作量。
一、lassie:人性化的网页内容检索库
安装
pip3 install lassie
使用:
import lassie lassie.fetch('http://www.thepipefittings.com/compression-fittings.html')
输入:
{'images': [{'src': 'http://www.thepipefittings.com/favicon.ico', 'type': 'favicon'}], 'videos': [], 'url': 'http://www.thepipefittings.com/compression-fittings.html', 'title': 'Compression Fittings,Manipulative Compression Fittings,Brass Compression Fittings,Compression Fittings Suppliers', 'status_code': 200}
二、newspaper:新闻内容爬虫专用包
安装:
pip3 install newspaper3k
需要安装的是newspaper3k而不是newspaper,因为newspaper是python 2的安装包,pip install
newspaper 无法正常安装,请用python 3对应的 pip install newspaper3k正确安装。
使用:
from newspaper import Article # import nltk # nltk.download('punkt') url = 'http://www.thepipefittings.com/compression-fittings.html' article = Article(url) # Chinese article.download() article.parse() article.nlp() print(article.text)
三、goose3: HTML 内容/文章提取器(python3)
安装:
pip3 install goose3
使用:
from goose3 import Goose url = 'http://www.thepipefittings.com/compression-fittings.html' g = Goose() article = g.extract(url=url) article.title # article.meta_description # article.cleaned_text[:]
输入:
'Compression Fittings,Manipulative Compression Fittings,Brass Compression Fittings,Compression Fittings Suppliers'
四、python-readability:arc90 公司 readability 工具的 Python 高速端口
安装:
pip3 install readability-lxml
使用:
import requests from readability import Document url = 'https://www.pipingengineer.org/piping-materials-buttweld-fittings/' html = requests.get(url).content doc = Document(html) print('title:', doc.title()) print('content:', doc.summary(html_partial=True))
输出:
title: Not Acceptable! content: <div><body id="readabilityBody"><h1>Not Acceptable!</h1><p>An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security.</p></body></div>
五、textract:从任何格式的文档中提取文本,Word,PowerPoint,PDFs 等等
安装
pip3 install textract
使用:
import textract text = textract.process("xxx.pdf") #换成你自己本地的pdf print(text.decode('utf-8'))