简介 以下介绍的库均为从网页中自动解析想要的内容,从而解放了需要每个网站都要正则匹配或者xpath的超大工作量。 一、lassie:人性化的网页内容检索库 安装 pip3 install lassie 使用: import lassie lassie.fetch(‘http://www.thepipefittings.com/compression-fittings.html’) 输入: {‘images’: [{‘src’: ‘http://www.thepipefittings.com/favicon.ico’, ‘type’: ‘favicon’}], ‘videos’: [], ‘url’: ‘http://www.thepipefittings.com/compression-fittings.html’, ‘title’: ‘Compression Fittings,Manipulative Compression Fittings,Brass Compression Fittings,Compression Fittings Suppliers’, ‘status_code’: 200} 二、newspaper:新闻内容爬虫专用包 安装: pip3 install newspaper3k 需要安装的是newspaper3k而不是newspaper,因为newspaper是python 2的安装包,pip installnewspaper 无法正常安装,请用python 3对应的 pip install newspaper3k正确安装。 使用: from newspaper import Article # import nltk # nltk.download(‘punkt’) url = ‘http://www.thepipefittings.com/compression-fittings.html’ article = Article(url) # Chinese article.download() article.parse() article.nlp() print(article.text) 三、goose3: HTML 内容/文章提取器(python3) 安装: pip3 install goose3 使用: from goose3 import Goose url = ‘http://www.thepipefittings.com/compression-fittings.html’ g = Goose() article = g.extract(url=url) article.title # article.meta_description # article.cleaned_text[:] 输入: ‘Compression Fittings,Manipulative Compression Fittings,Brass Compression Fittings,Compression …