【sklearn实战】datasets数据集简介_业界新闻

发布时间:2024-07-28 20:53

阅读量:0

一 sklearn.datasets数据集

sklearn.datasets 中主要包含了4类数据集。

1.1 Toy datasets（玩具数据集）

scikit-learn 内置的一些小型标准数据集，不需要从某个外部网站下载任何文件，用datasets.load_xx()加载。比如：鸾尾花、波士顿房价等数据集。

Toy datasets 通过 sklearn.datasets.load_<name> 加载对应的数据集。

1.2 Real world datasets（真实世界数据集）

这些数据集通常需要通过sklearn.datasets.fetch_<name>函数从网络上下载，它们是近年来真实收集的数据，适用于更复杂的机器学习任务。例如，新闻组（20 Newsgroups）数据集，这是一个用于文本分类的大型数据集。

1.3 Generated datasets（生成数据集）

sklearn.datasets 还提供了一系列函数来生成人工数据集，如make_classification、make_regression等。这些函数可以根据用户指定的参数生成用于分类、回归等任务的数据集。

1.4 Loading other datasets（加载其它的数据集）

sklearn.datasets 还提供了一些加载其它数据集的方法，例如：

Sample Images（样本图片）：一些用于图像处理和计算机视觉任务的数据集，如Olivetti人脸识别数据集等。
可以加载SVMLight或LibSVM格式的数据集，这些格式常用于机器学习竞赛和研究中。
从OpenML下载数据：OpenML是一个用于机器学习数据和实验的公共存储库。通过sklearn.datasets.fetch_openml()函数，可以从OpenML下载各种数据集。
从外部加载数据集
- kaggle：https://www.kaggle.com
- 天池：https://tianchi.aliyun.com/dataset
- 飞桨：https://aistudio.baidu.com/aistudio/datasetoverview
- 讯飞：http://challenge.xfyun.cn/
- 搜狗实验室：http://www.sogou.com/labs/resource/list_pingce.php
- DC竞赛：https://www.pkbigdata.com/common/cmptIndex.html
- DF竞赛：https://www.datafountain.cn/dataset
- Google数据集：https://toolbox.google.com/datasetsearch
- 微软数据集：https://msropendata.com/
- 科赛网：https://www.kesci.com/home/dataset
- COCO是一个可用于object detection, segmentation and caption的大型数据集。
- ImageNet——图像总数约1,500,000; 每个都有多个边界框和相应的类标签。大小：约150GB
- Yelp Reviews——由数百万用户评论、商业类型和来自多个大型城市的超过20万张照片组成。这在全球都是一个常用的NLP挑战级数据集。大小：2.66 GB JSON，2.9 GB SQL and 7.5 GB Photos（全部已压缩）；数量：5,200,000条评论，174,000条商业类型，20万张图片和11个大型城市

建议除了玩具数据集和生成数据集以外，都在网上下载后用pandas导入。

例如，导入iris文件：

import pandas as pd import seaborn as sns  # 基于matplotlib和pandas的画图库  import matplotlib.pyplot as plt  data = pd.read_csv("/path/to/iris.csv", encoding='gbk')  # 我把数据集列名改成了中文 所以用gbk解码 sns.relplot(x='petal_width', y='sepal_length', hue="species", data=data)  # seaborn库这里不做过多介绍 plt.rcParams['font.sans-serif'] = ['SimHei']  # 步骤一（替换sans-serif字体） # plt.rcParams['axes.unicode_minus'] = False  # 步骤二（解决坐标轴负数的负号显示问题） plt.show()

值得注意的是，sklearn.datasets 中的数据集主要是为了方便教学和入门学习而提供的。在实际应用中，可能需要使用更大规模、更复杂的数据集来训练模型。此外，随着时间的推移，sklearn 库可能会更新和添加新的数据集，因此建议查阅最新的官方文档以获取最准确的信息。

二数据返回类型

both loaders and fetchers functions return a Bunch object holding at least two items: an array of shape n_samples * n_features with key data (except for 20newsgroups) and a numpy array of length n_samples, containing the target values, with key target. The datasets also contain a full description in their DESCR attribute and some contain feature_names and target_names.

data：特征数据数组，是 [n_samples * n_features] 的二维 numpy.ndarray 数组
target：标签数组，是 n_samples 的一维 numpy.ndarray 数组
DESCR：数据描述
feature_names：特征名，新闻数据，手写数字、回归数据集没有
target_names：标签名，回归数据集没有

例如：

from sklearn.datasets import load_iris iris = load_iris() print(iris.keys())  # 查看键(属性) dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module']) print(iris.data[:5]) # 获取特征值 print(iris.target[:5]) # 获取目标值 print(iris.DESCR) # 获取数据集描述 print(iris.data.shape,iris.target.shape)  # 查看数据的形状 (150, 4) (150,) print(iris.feature_names)  # 查看有哪些特征 这里共4种：['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'] print(iris.target_names) # target name：['setosa' 'versicolor' 'virginica']

It’s also possible for almost all of these function to constrain the output to be a tuple containing only the data and the target, by setting the return_X_y parameter to True.

例如：

from sklearn.datasets import load_iris data, target = load_iris(return_X_y=True)

The dataset generation functions return a tuple (X, y) consisting of a n_samples * n_features numpy array X and an array of length n_samples containing the targets y.