【python】六个常见爬虫案例【附源码】_业界新闻

发布时间:2024-07-18 02:58

阅读量:3

大家好，我是博主英杰，整理了几个常见的爬虫案例，分享给大家，适合小白学习

欢迎来到英杰社区https://bbs.csdn.net/topics/617804998

一、爬取豆瓣电影排行榜Top250存储到Excel文件

近年来，Python在数据爬取和处理方面的应用越来越广泛。本文将介绍一个基于Python的爬虫程序，用于抓取豆瓣电影Top250的相关信息，并将其保存为Excel文件。

获取网页数据的函数，包括以下步骤：
1. 循环10次，依次爬取不同页面的信息；
2. 使用`urllib`获取html页面；
3. 使用`BeautifulSoup`解析页面；
4. 遍历每个div标签，即每一部电影；
5. 对每个电影信息进行匹配，使用正则表达式提取需要的信息并保存到一个列表中；
6. 将每个电影信息的列表保存到总列表中。

效果展示：

源代码：

from bs4 import BeautifulSoup import  re  #正则表达式，进行文字匹配 import urllib.request,urllib.error #指定URL，获取网页数据 import xlwt  #进行excel操作     def main():     baseurl = "https://movie.douban.com/top250?start="     datalist= getdata(baseurl)     savepath = ".\\豆瓣电影top250.xls"     savedata(datalist,savepath)   #compile返回的是匹配到的模式对象 findLink = re.compile(r'<a href="(.*?)">')  # 正则表达式模式的匹配，影片详情 findImgSrc = re.compile(r'<img.*src="(.*?)"', re.S)  # re.S让换行符包含在字符中,图片信息 findTitle = re.compile(r'<span class="title">(.*)</span>')  # 影片片名 findRating = re.compile(r'<span class="rating_num" property="v:average">(.*)</span>')  # 找到评分 findJudge = re.compile(r'<span>(\d*)人评价</span>')  # 找到评价人数 #\d表示数字 findInq = re.compile(r'<span class="inq">(.*)</span>')  # 找到概况 findBd = re.compile(r'<p class="">(.*?)</p>', re.S)  # 找到影片的相关内容，如导演，演员等       ##获取网页数据 def  getdata(baseurl):     datalist=[]     for i in range(0,10):         url = baseurl+str(i*25)     ##豆瓣页面上一共有十页信息，一页爬取完成后继续下一页         html = geturl(url)         soup = BeautifulSoup(html,"html.parser") #构建了一个BeautifulSoup类型的对象soup，是解析html的         for item in soup.find_all("div",class_='item'): ##find_all返回的是一个列表             data=[]  #保存HTML中一部电影的所有信息             item = str(item) ##需要先转换为字符串findall才能进行搜索             link = re.findall(findLink,item)[0]  ##findall返回的是列表，索引只将值赋值             data.append(link)               imgSrc = re.findall(findImgSrc, item)[0]             data.append(imgSrc)               titles=re.findall(findTitle,item)  ##有的影片只有一个中文名，有的有中文和英文             if(len(titles)==2):                 onetitle = titles[0]                 data.append(onetitle)                 twotitle = titles[1].replace("/","")#去掉无关的符号                 data.append(twotitle)             else:                 data.append(titles)                 data.append(" ")  ##将下一个值空出来               rating = re.findall(findRating, item)[0]  # 添加评分             data.append(rating)               judgeNum = re.findall(findJudge, item)[0]  # 添加评价人数             data.append(judgeNum)               inq = re.findall(findInq, item)  # 添加概述             if len(inq) != 0:                 inq = inq[0].replace("。", "")                 data.append(inq)             else:                 data.append(" ")               bd = re.findall(findBd, item)[0]             bd = re.sub('<br(\s+)?/>(\s+)?', " ", bd)             bd = re.sub('/', " ", bd)             data.append(bd.strip())  # 去掉前后的空格             datalist.append(data)     return  datalist   ##保存数据 def  savedata(datalist,savepath):     workbook = xlwt.Workbook(encoding="utf-8",style_compression=0) ##style_compression=0不压缩     worksheet = workbook.add_sheet("豆瓣电影top250",cell_overwrite_ok=True) #cell_overwrite_ok=True再次写入数据覆盖     column = ("电影详情链接", "图片链接", "影片中文名", "影片外国名", "评分", "评价数", "概况", "相关信息")  ##execl项目栏     for i in range(0,8):         worksheet.write(0,i,column[i]) #将column[i]的内容保存在第0行，第i列     for i in range(0,250):         data = datalist[i]         for j in range(0,8):             worksheet.write(i+1,j,data[j])     workbook.save(savepath)     ##爬取网页 def geturl(url):     head = {         "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "                       "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36"     }     req = urllib.request.Request(url,headers=head)     try:   ##异常检测      response = urllib.request.urlopen(req)      html = response.read().decode("utf-8")     except urllib.error.URLError as e:         if hasattr(e,"code"):    ##如果错误中有这个属性的话             print(e.code)         if hasattr(e,"reason"):             print(e.reason)     return html   if __name__ == '__main__':     main()     print("爬取成功！！！")

二、爬取百度热搜排行榜Top50+可视化

2.1 代码思路：

导入所需的库：

import requests from bs4 import BeautifulSoup import openpyxl

requests 库用于发送HTTP请求获取网页内容。

BeautifulSoup 库用于解析HTML页面的内容。

openpyxl 库用于创建和操作Excel文件。

2.发起HTTP请求获取百度热搜页面内容：

url = 'https://top.baidu.com/board?tab=realtime' response = requests.get(url) html = response.content

这里使用了 requests.get() 方法发送GET请求，并将响应的内容赋值给变量 html。

3.使用BeautifulSoup解析页面内容：

soup = BeautifulSoup(html, 'html.parser')

创建一个 BeautifulSoup 对象，并传入要解析的HTML内容和解析器类型。

4.提取热搜数据：

hot_searches = [] for item in soup.find_all('div', {'class': 'c-single-text-ellipsis'}):     hot_searches.append(item.text)

这段代码通过调用 soup.find_all() 方法找到所有 <div> 标签，并且指定 class 属性为 'c-single-text-ellipsis' 的元素。

然后，将每个元素的文本内容添加到 hot_searches 列表中。

5.保存热搜数据到Excel：

workbook = openpyxl.Workbook() sheet = workbook.active sheet.title = 'Baidu Hot Searches'

使用 openpyxl.Workbook() 创建一个新的工作簿对象。

调用 active 属性获取当前活动的工作表对象，并将其赋值给变量 sheet。

使用 title 属性给工作表命名为 'Baidu Hot Searches'。

6.设置标题：

sheet.cell(row=1, column=1, value='百度热搜排行榜—博主:Yan-英杰')

使用 cell() 方法选择要操作的单元格，其中 row 和 column 参数分别表示行和列的索引。

将标题字符串 '百度热搜排行榜—博主:Yan-英杰' 写入选定的单元格。

7.写入热搜数据：

for i in range(len(hot_searches)):     sheet.cell(row=i+2, column=1, value=hot_searches[i])

使用 range() 函数生成一个包含索引的范围，循环遍历 hot_searches 列表。

对于每个索引 i，使用 cell() 方法将对应的热搜词写入Excel文件中。

8.保存Excel文件：

workbook.save('百度热搜.xlsx')

使用 save() 方法将工作簿保存到指定的文件名 '百度热搜.xlsx'。

9.输出提示信息：

print('热搜数据已保存到 百度热搜.xlsx')

在控制台输出保存成功的提示信息。

效果展示：

源代码：

 import requests from bs4 import BeautifulSoup import openpyxl  # 发起HTTP请求获取百度热搜页面内容 url = 'https://top.baidu.com/board?tab=realtime' response = requests.get(url) html = response.content  # 使用BeautifulSoup解析页面内容 soup = BeautifulSoup(html, 'html.parser')  # 提取热搜数据 hot_searches = [] for item in soup.find_all('div', {'class': 'c-single-text-ellipsis'}):     hot_searches.append(item.text)  # 保存热搜数据到Excel workbook = openpyxl.Workbook() sheet = workbook.active sheet.title = 'Baidu Hot Searches'  # 设置标题 sheet.cell(row=1, column=1, value='百度热搜排行榜—博主:Yan-英杰')  # 写入热搜数据 for i in range(len(hot_searches)):     sheet.cell(row=i+2, column=1, value=hot_searches[i])  workbook.save('百度热搜.xlsx') print('热搜数据已保存到 百度热搜.xlsx')

可视化代码：

import requests from bs4 import BeautifulSoup import matplotlib.pyplot as plt   # 发起HTTP请求获取百度热搜页面内容 url = 'https://top.baidu.com/board?tab=realtime' response = requests.get(url) html = response.content   # 使用BeautifulSoup解析页面内容 soup = BeautifulSoup(html, 'html.parser')   # 提取热搜数据 hot_searches = [] for item in soup.find_all('div', {'class': 'c-single-text-ellipsis'}):     hot_searches.append(item.text)   # 设置中文字体 plt.rcParams['font.sans-serif'] = ['SimHei'] plt.rcParams['axes.unicode_minus'] = False   # 绘制条形图 plt.figure(figsize=(15, 10)) x = range(len(hot_searches)) y = list(reversed(range(1, len(hot_searches)+1))) plt.barh(x, y, tick_label=hot_searches, height=0.8)  # 调整条形图的高度   # 添加标题和标签 plt.title('百度热搜排行榜') plt.xlabel('排名') plt.ylabel('关键词')   # 调整坐标轴刻度 plt.xticks(range(1, len(hot_searches)+1))   # 调整条形图之间的间隔 plt.subplots_adjust(hspace=0.8, wspace=0.5)   # 显示图形 plt.tight_layout() plt.show()

三、爬取斗鱼直播照片保存到本地目录

效果展示：

源代码：

  #导入了必要的模块requests和os import requests import os     # 定义了一个函数get_html(url)， # 用于发送GET请求获取指定URL的响应数据。函数中设置了请求头部信息， # 以模拟浏览器的请求。函数返回响应数据的JSON格式内容 def get_html(url):     header = {         'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'     }     response = requests.get(url=url, headers=header)     # print(response.json())     html = response.json()     return html     # 定义了一个函数parse_html(html)， # 用于解析响应数据中的图片信息。通过分析响应数据的结构， # 提取出每个图片的URL和标题，并将其存储在一个字典中，然后将所有字典组成的列表返回 def parse_html(html):     rl_list = html['data']['rl']     # print(rl_list)     img_info_list = []     for rl in rl_list:         img_info = {}         img_info['img_url'] = rl['rs1']         img_info['title'] = rl['nn']         # print(img_url)         # exit()         img_info_list.append(img_info)     # print(img_info_list)     return img_info_list     # 定义了一个函数save_to_images(img_info_list)，用于保存图片到本地。 # 首先创建一个目录"directory"，如果目录不存在的话。然后遍历图片信息列表， # 依次下载每个图片并保存到目录中，图片的文件名为标题加上".jpg"后缀。 def save_to_images(img_info_list):     dir_path = 'directory'     if not os.path.exists(dir_path):         os.makedirs(dir_path)     for img_info in img_info_list:         img_path = os.path.join(dir_path, img_info['title'] + '.jpg')         res = requests.get(img_info['img_url'])         res_img = res.content         with open(img_path, 'wb') as f:             f.write(res_img)         # exit()   #在主程序中，设置了要爬取的URL，并调用前面定义的函数来执行爬取、解析和保存操作。 if __name__ == '__main__':     url = 'https://www.douyu.com/gapi/rknc/directory/yzRec/1'     html = get_html(url)     img_info_list = parse_html(html)     save_to_images(img_info_list)

四、爬取酷狗音乐Top500排行榜

从酷狗音乐排行榜中提取歌曲的排名、歌名、歌手和时长等信息

代码思路：

效果展示：

源码：

import requests  # 发送网络请求，获取 HTML 等信息 from bs4 import BeautifulSoup  # 解析 HTML 信息，提取需要的信息 import time  # 控制爬虫速度，防止过快被封IP     headers = {     "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36"     # 添加浏览器头部信息，模拟请求 }   def get_info(url):     # 参数 url ：要爬取的网页地址     web_data = requests.get(url, headers=headers)  # 发送网络请求，获取 HTML 等信息     soup = BeautifulSoup(web_data.text, 'lxml')  # 解析 HTML 信息，提取需要的信息       # 通过 CSS 选择器定位到需要的信息     ranks = soup.select('span.pc_temp_num')     titles = soup.select('div.pc_temp_songlist > ul > li > a')     times = soup.select('span.pc_temp_tips_r > span')          # for 循环遍历每个信息，并将其存储到字典中     for rank, title, time in zip(ranks, titles, times):         data = {             "rank": rank.get_text().strip(),  # 歌曲排名             "singer": title.get_text().replace("\n", "").replace("\t", "").split('-')[1],  # 歌手名             "song": title.get_text().replace("\n", "").replace("\t", "").split('-')[0],  # 歌曲名             "time": time.get_text().strip()  # 歌曲时长         }         print(data)  # 打印获取到的信息   if __name__ == '__main__':     urls = ["https://www.kugou.com/yy/rank/home/{}-8888.html".format(str(i)) for i in range(1, 24)]     # 构造要爬取的页面地址列表     for url in urls:         get_info(url)  # 调用函数，获取页面信息         time.sleep(1)  # 控制爬虫速度，防止过快被封IP

五、爬取链家二手房数据做数据分析

在数据分析和挖掘领域中，网络爬虫是一种常见的工具，用于从网页上收集数据。介绍如何使用 Python 编写简单的网络爬虫程序，从链家网上海二手房页面获取房屋信息，并将数据保存到 Excel 文件中。

效果图：

代码思路：

首先，我们定义了一个函数 fetch_data(page_number)，用于获取指定页面的房屋信息数据。这个函数会构建对应页数的 URL，并发送 GET 请求获取页面内容。然后，使用 BeautifulSoup 解析页面内容，并提取每个房屋信息的相关数据，如区域、房型、关注人数、单价和总价。最终将提取的数据以字典形式存储在列表中，并返回该列表。

接下来，我们定义了主函数 main()，该函数控制整个爬取和保存数据的流程。在主函数中，我们循环爬取前 10 页的数据，调用 fetch_data(page_number) 函数获取每一页的数据，并将数据追加到列表中。然后，将所有爬取的数据存储在 DataFrame 中，并使用 df.to_excel('lianjia_data.xlsx', index=False) 将数据保存到 Excel 文件中。

最后，在程序的入口处，通过 if __name__ == "__main__": 来执行主函数 main()。

源代码：

import requests   from bs4 import BeautifulSoup   import pandas as pd     # 收集单页数据 xpanx.com   def fetch_data(page_number):     url = f"https://sh.lianjia.com/ershoufang/pg{page_number}/"       response = requests.get(url)       if response.status_code != 200:         print("请求失败")           return []       soup = BeautifulSoup(response.text, 'html.parser')       rows = []       for house_info in soup.find_all("li", {"class": "clear LOGVIEWDATA LOGCLICKDATA"}):         row = {}           # 使用您提供的类名来获取数据 xpanx.com           row['区域'] = house_info.find("div", {"class": "positionInfo"}).get_text() if house_info.find("div", {             "class": "positionInfo"}) else None           row['房型'] = house_info.find("div", {"class": "houseInfo"}).get_text() if house_info.find("div", {             "class": "houseInfo"}) else None           row['关注'] = house_info.find("div", {"class": "followInfo"}).get_text() if house_info.find("div", {             "class": "followInfo"}) else None           row['单价'] = house_info.find("div", {"class": "unitPrice"}).get_text() if house_info.find("div", {             "class": "unitPrice"}) else None           row['总价'] = house_info.find("div", {"class": "priceInfo"}).get_text() if house_info.find("div", {             "class": "priceInfo"}) else None           rows.append(row)       return rows     # 主函数   def main():     all_data = []       for i in range(1, 11):  # 爬取前10页数据作为示例           print(f"正在爬取第{i}页...")           all_data += fetch_data(i)       # 保存数据到Excel xpanx.com       df = pd.DataFrame(all_data)       df.to_excel('lianjia_data.xlsx', index=False)       print("数据已保存到 'lianjia_data.xlsx'")     if __name__ == "__main__":     main()

六、爬取豆瓣电影排行榜TOP250存储到CSV文件中

代码思路：

        首先，我们导入了需要用到的三个Python模块：requests、lxml和csv。

        然后，我们定义了豆瓣电影TOP250页面的URL地址，并使用getSource(url)函数获取网页源码。

        接着，我们定义了一个getEveryItem(source)函数，它使用XPath表达式从HTML源码中提取出每部电影的标题、URL、评分和引言，并将这些信息存储到一个字典中，最后将所有电影的字典存储到一个列表中并返回。

        然后，我们定义了一个writeData(movieList)函数，它使用csv库的DictWriter类创建一个CSV写入对象，然后将电影信息列表逐行写入CSV文件。

        最后，在if __name__ == '__main__'语句块中，我们定义了一个空的电影信息列表movieList，然后循环遍历前10页豆瓣电影TOP250页面，分别抓取每一页的网页源码，并使用getEveryItem()函数解析出电影信息并存储到movieList中，最后使用writeData()函数将电影信息写入CSV文件。

效果图：

源代码：

私信博主进入交流群，一起学习探讨，如果对CSDN周边以及有偿返现活动感兴趣： 可添加博主：Yan--yingjie 如果想免费获取图书，也可添加博主微信，每周免费送数十本     #代码首先导入了需要使用的模块：requests、lxml和csv。 import requests from lxml import etree import csv   # doubanUrl = 'https://movie.douban.com/top250?start={}&filter='     # 然后定义了豆瓣电影TOP250页面的URL地址，并实现了一个函数getSource(url)来获取网页的源码。该函数发送HTTP请求，添加了请求头信息以防止被网站识别为爬虫，并通过requests.get()方法获取网页源码。 def getSource(url):     # 反爬 填写headers请求头     headers = {         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'     }       response = requests.get(url, headers=headers)     # 防止出现乱码     response.encoding = 'utf-8'     # print(response.text)     return response.text     # 定义了一个函数getEveryItem(source)来解析每个电影的信息。首先，使用lxml库的etree模块将源码转换为HTML元素对象。然后，使用XPath表达式定位到包含电影信息的每个HTML元素。通过对每个元素进行XPath查询，提取出电影的标题、副标题、URL、评分和引言等信息。最后，将这些信息存储在一个字典中，并将所有电影的字典存储在一个列表中。 def getEveryItem(source):     html_element = etree.HTML(source)       movieItemList = html_element.xpath('//div[@class="info"]')       # 定义一个空的列表     movieList = []       for eachMoive in movieItemList:           # 创建一个字典 像列表中存储数据[{电影一},{电影二}......]         movieDict = {}           title = eachMoive.xpath('div[@class="hd"]/a/span[@class="title"]/text()')  # 标题         otherTitle = eachMoive.xpath('div[@class="hd"]/a/span[@class="other"]/text()')  # 副标题         link = eachMoive.xpath('div[@class="hd"]/a/@href')[0]  # url         star = eachMoive.xpath('div[@class="bd"]/div[@class="star"]/span[@class="rating_num"]/text()')[0]  # 评分         quote = eachMoive.xpath('div[@class="bd"]/p[@class="quote"]/span/text()')  # 引言（名句）           if quote:             quote = quote[0]         else:             quote = ''         # 保存数据         movieDict['title'] = ''.join(title + otherTitle)         movieDict['url'] = link         movieDict['star'] = star         movieDict['quote'] = quote           movieList.append(movieDict)           print(movieList)     return movieList     # 保存数据 def writeData(movieList):     with open('douban.csv', 'w', encoding='utf-8', newline='') as f:         writer = csv.DictWriter(f, fieldnames=['title', 'star', 'quote', 'url'])           writer.writeheader()  # 写入表头           for each in movieList:             writer.writerow(each)     if __name__ == '__main__':     movieList = []       # 一共有10页       for i in range(10):         pageLink = doubanUrl.format(i * 25)           source = getSource(pageLink)           movieList += getEveryItem(source)       writeData(movieList)

支持

资讯

【python】六个常见爬虫案例【附源码】

一、爬取豆瓣电影排行榜Top250存储到Excel文件

效果展示：

源代码：

二、爬取百度热搜排行榜Top50+可视化

2.1 代码思路：

效果展示：

源代码：

可视化代码：

三、爬取斗鱼直播照片保存到本地目录

效果展示：

源代码：

四、爬取酷狗音乐Top500排行榜

代码思路：

效果展示：

源码：

五、爬取链家二手房数据做数据分析

效果图：

代码思路：

源代码：

六、爬取豆瓣电影排行榜TOP250存储到CSV文件中

代码思路：

效果图：

源代码：

相关阅读

广告一刻