Scikit-Learn 基础教程_业界新闻

发布时间:2024-08-03 04:16

阅读量:0

1. 安装 Scikit-Learn

首先，确保你的 Python 环境已安装好。然后，可以通过 pip 或 conda 安装 scikit-learn：

pip install -U scikit-learn

或者如果你使用的是 Anaconda 发行版，可以运行：

conda install scikit-learn

2. 导入库

一旦安装完成，就可以开始导入必要的库：

import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.datasets import load_iris from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score

3. 加载数据

Scikit-Learn 包含了一些内置数据集，例如鸢尾花数据集 (Iris dataset)。下面是如何加载并查看数据集的示例：

# 加载数据 iris = load_iris() X = iris.data y = iris.target  # 查看数据的前几行 print("Features:", X[:5]) print("Labels:", y[:5])

4. 数据分割

为了评估模型性能，我们需要将数据分为训练集和测试集：

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

5. 数据预处理

数据预处理是机器学习中的一个重要步骤。这可能包括缺失值处理、特征缩放等：

from sklearn.preprocessing import StandardScaler  scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test)

6. 模型训练

选择一个模型并训练它。这里我们使用逻辑回归作为例子：

model = LogisticRegression() model.fit(X_train_scaled, y_train)

7. 模型评估

评估模型的性能：

y_pred = model.predict(X_test_scaled) accuracy = accuracy_score(y_test, y_pred) print("Accuracy:", accuracy)

8. 模型调优

我们可以使用网格搜索或随机搜索来调整模型参数以优化性能：

from sklearn.model_selection import GridSearchCV  param_grid = {'C': [0.1, 1, 10], 'penalty': ['l1', 'l2']} grid_search = GridSearchCV(LogisticRegression(), param_grid, cv=5) grid_search.fit(X_train_scaled, y_train)  best_params = grid_search.best_params_ print("Best Parameters:", best_params)

9. 应用模型

最后，使用最佳参数训练最终模型并在新数据上进行预测：

final_model = LogisticRegression(**best_params) final_model.fit(X_train_scaled, y_train) new_data = [[5.1, 3.5, 1.4, 0.2]]  # 示例新数据点 prediction = final_model.predict(scaler.transform(new_data)) print("Prediction:", prediction)