浅学爬虫-HTML和CSS结构_业界新闻

发布时间:2024-08-03 06:08

阅读量:0

HTML结构

HTML（HyperText Markup Language）是构建网页的基础语言。它通过标签（Tags）来定义网页的结构和内容。HTML文档的基本结构如下：

<!DOCTYPE html> <html lang="en">   <head>     <meta charset="UTF-8">     <meta name="viewport" content="width=device-width, initial-scale=1.0">     <title>Document</title>   </head>   <body>     <h1>这是标题</h1>     <p>这是一个段落。</p>     <a href="http://example.com">这是一个链接</a>   </body> </html>

基本标签介绍:

<!DOCTYPE html>: 声明文档类型，告诉浏览器这是一个HTML5文档。
<html>: HTML文档的根元素，表示整个HTML文档。
<head>: 包含页面的元数据，如编码、标题、样式等。
<meta charset="UTF-8">: 声明文档的字符编码为UTF-8。
<title>: 设置网页的标题，显示在浏览器标签页上。
<body>: 包含网页的可见内容。
<h1>: 一级标题，用于定义重要的标题。
<p>: 段落，用于定义文本段落。
<a>: 超链接，用于创建链接。

其他常用标签:

<h1> - <h6>: 标题标签，<h1>表示最高级别的标题，<h6>表示最低级别的标题。
<div>: 区块元素，用于定义文档中的分区或节。
<span>: 内联元素，用于对文档中的一部分文本进行分组。
<ul>: 无序列表，用于定义项目符号列表。
<ol>: 有序列表，用于定义编号列表。
<li>: 列表项，用于定义列表中的项。
<img>: 图像标签，用于嵌入图像。
<table>: 表格标签，用于创建表格结构。

CSS选择器

CSS（Cascading Style Sheets）用于控制HTML文档的样式。选择器是CSS中用于选取元素的模式。常见的CSS选择器包括：

标签选择器: 选取指定标签的所有元素。

p {   color: blue; }

类选择器: 选取具有指定类属性的所有元素，类名以.开头。

.example {   font-size: 16px; }

ID选择器: 选取具有指定ID属性的元素，ID名以#开头。

#header {   background-color: gray; }

属性选择器: 选取具有指定属性的元素。

[type="text"] {   border: 1px solid black; }

后代选择器: 选取某元素内的所有指定子元素。

div p {   color: red; }

其他常用选择器:

群组选择器: 选取所有符合选择器的元素。

h1, h2, h3 {   font-family: Arial, sans-serif; }

子元素选择器: 选取作为某元素直接子元素的所有指定元素。

ul > li {   list-style-type: square; }

伪类选择器: 选取处于特定状态的元素。

a:hover {   color: green; }

伪元素选择器: 选取元素的某部分内容。

p::first-line {   font-weight: bold; }

使用BeautifulSoup解析HTML

BeautifulSoup是一个用于解析HTML和XML文档的Python库。我们可以使用BeautifulSoup轻松地从网页中提取数据。

步骤1：安装BeautifulSoup

pip install beautifulsoup4

步骤2：编写解析HTML的代码

from bs4 import BeautifulSoup  # 示例HTML html_doc = """ <!DOCTYPE html> <html lang="en"> <head>     <meta charset="UTF-8">     <meta name="viewport" content="width=device-width, initial-scale=1.0">     <title>Example Page</title> </head> <body>     <h1>Example Header</h1>     <p class="description">This is a description paragraph.</p>     <a href="http://example.com" id="example-link">Example Link</a> </body> </html> """  # 创建BeautifulSoup对象 soup = BeautifulSoup(html_doc, 'html.parser')  # 提取标题 title = soup.title.string print(f"页面标题: {title}")  # 提取一级标题 header = soup.h1.string print(f"一级标题: {header}")  # 提取段落内容 description = soup.find('p', class_='description').string print(f"段落描述: {description}")  # 提取链接 link = soup.find('a', id='example-link')['href'] print(f"链接地址: {link}")

代码解释:

创建BeautifulSoup对象: 使用BeautifulSoup解析HTML文档。

soup = BeautifulSoup(html_doc, 'html.parser')

提取标题: 使用soup.title.string提取文档的标题。

title = soup.title.string print(f"页面标题: {title}")

提取一级标题: 使用soup.h1.string提取一级标题内容。

header = soup.h1.string print(f"一级标题: {header}")

提取段落内容: 使用soup.find方法结合标签名和类名提取段落内容。

description = soup.find('p', class_='description').string print(f"段落描述: {description}")

提取链接: 使用soup.find方法结合标签名和ID提取链接地址。

link = soup.find('a', id='example-link')['href'] print(f"链接地址: {link}")

BeautifulSoup常用方法

find(): 返回第一个符合条件的元素。

soup.find('a')

find_all(): 返回所有符合条件的元素列表。

soup.find_all('a')

select(): 使用CSS选择器选取元素。

soup.select('a[href]')

get_text(): 获取元素的文本内容。

soup.get_text()

示例解析复杂HTML

假设我们有一个更复杂的HTML文档：

<!DOCTYPE html> <html lang="en">   <head>     <meta charset="UTF-8">     <meta name="viewport" content="width=device-width, initial-scale=1.0">     <title>Sample Page</title>   </head>   <body>     <div id="content">       <h1>Sample Header</h1>       <p class="description">This is a sample description.</p>       <div class="links">         <a href="http://example1.com" class="external">Link 1</a>         <a href="http://example2.com" class="external">Link 2</a>       </div>     </div>   </body> </html>

我们可以编写以下代码来解析这个文档：

from bs4 import BeautifulSoup  # 示例HTML html_doc = """ <!DOCTYPE html> <html lang="en"> <head>     <meta charset="UTF-8">     <meta name="viewport" content="width=device-width, initial-scale=1.0">     <title>Sample Page</title> </head> <body>     <div id="content">         <h1>Sample Header</h1>         <p class="description">This is a sample description.</p>         <div class="links">             <a href="http://example1.com" class="external">Link 1</a>             <a href="http://example2.com" class="external">Link 2</a>         </div>     </div> </body> </html> """  # 创建BeautifulSoup对象 soup = BeautifulSoup(html_doc, 'html.parser')  # 提取标题 title = soup.title.string print(f"页面标题: {title}")  # 提取一级标题 header = soup.find('h1').string print(f"一级标题: {header}")  # 提取段落内容 description = soup.find('p', class_='description').string print(f"段落描述: {description}")  # 提取所有链接 links = soup.find_all('a', class_='external') for link in links:     href = link['href']     text = link.string     print(f"链接文本: {text}, 链接地址: {href}")

代码解释: