快速方便地下载huggingface的模型库和数据集_业界新闻

发布时间:2024-08-03 04:35

阅读量:0

快速方便地下载huggingface的模型库和数据集

方法一：用于使用 aria2/wget+git 下载 Huggingface 模型和数据集的 CLI 工具
- 特点
- Usage
方法二：模型下载【个人使用记录】

方法一：用于使用 aria2/wget+git 下载 Huggingface 模型和数据集的 CLI 工具

来自https://gist.github.com/padeoe/697678ab8e528b85a2a7bddafea1fa4f。

使用方法：将hfd.sh拷贝过去，然后参考下面的参考命令，下载数据集或者模型

🤗Huggingface 模型下载器

考虑到官方 huggingface-cli 缺乏多线程下载支持，以及错误处理不足在 hf_transfer 中，这个命令行工具巧妙地利用 wget 或 aria2 来处理 LFS 文件，并使用 git clone 来处理其余文件。

特点

⏯️ 从断点恢复：您可以随时重新运行它或按 Ctrl+C。
🚀 多线程下载：利用多线程加速下载过程。
🚫 文件排除：使用--exclude或--include跳过或指定文件，为具有重复格式的模型（例如，*.bin或*.safetensors）节省时间）。
🔐 身份验证支持：对于需要 Huggingface 登录的门控模型，请使用 --hf_username 和 --hf_token 进行身份验证。
🪞 镜像站点支持：使用“HF_ENDPOINT”环境变量进行设置。
🌍代理支持：使用“HTTPS_PROXY”环境变量进行设置。
📦 简单：仅依赖git、aria2c/wget。

Usage

首先，下载 hfd.sh 或克隆此存储库，然后授予脚本执行权限。

chmod a+x hfd.sh

为了方便起见，您可以创建一个别名

alias hfd="$PWD/hfd.sh"

使用说明：

$ ./hfd.sh -h Usage:   hfd <repo_id> [--include include_pattern] [--exclude exclude_pattern] [--hf_username username] [--hf_token token] [--tool aria2c|wget] [-x threads] [--dataset] [--local-dir path]  Description:   Downloads a model or dataset from Hugging Face using the provided repo ID.  Parameters:   repo_id        The Hugging Face repo ID in the format 'org/repo_name'.   --include       (Optional) Flag to specify a string pattern to include files for downloading.   --exclude       (Optional) Flag to specify a string pattern to exclude files from downloading.   include/exclude_pattern The pattern to match against filenames, supports wildcard characters. e.g., '--exclude *.safetensor', '--include vae/*'.   --hf_username   (Optional) Hugging Face username for authentication. **NOT EMAIL**.   --hf_token      (Optional) Hugging Face token for authentication.   --tool          (Optional) Download tool to use. Can be aria2c (default) or wget.   -x              (Optional) Number of download threads for aria2c. Defaults to 4.   --dataset       (Optional) Flag to indicate downloading a dataset.   --local-dir     (Optional) Local directory path where the model or dataset will be stored.  Example:   hfd bigscience/bloom-560m --exclude *.safetensors   hfd meta-llama/Llama-2-7b --hf_username myuser --hf_token mytoken -x 4   hfd lavita/medical-qa-shared-task-v1-toy --dataset

下载模型：

hfd bigscience/bloom-560m

下载模型需要登录

从https://huggingface.co/settings/tokens获取huggingface令牌，然后

hfd meta-llama/Llama-2-7b --hf_username YOUR_HF_USERNAME_NOT_EMAIL --hf_token YOUR_HF_TOKEN

下载模型并排除某些文件（例如.safetensors）：

hfd bigscience/bloom-560m --exclude *.safetensors

使用 aria2c 和多线程下载：

hfd bigscience/bloom-560m

输出：
下载过程中，将显示文件 URL：

$ hfd bigscience/bloom-560m --tool wget --exclude *.safetensors ... Start Downloading lfs files, bash script:  wget -c https://huggingface.co/bigscience/bloom-560m/resolve/main/flax_model.msgpack # wget -c https://huggingface.co/bigscience/bloom-560m/resolve/main/model.safetensors wget -c https://huggingface.co/bigscience/bloom-560m/resolve/main/onnx/decoder_model.onnx ...

# 安装包 apt update apt-get install aria2 apt-get install iftop apt-get install git-lfs  #参考命令 bash /xxx/xxx/hfd.sh mmaaz60/ActivityNet-QA-Test-Videos --tool aria2c -x 16 --dataset --local-dir /xxx/xxx/ActivityNet

hfd.sh

#!/usr/bin/env bash # Color definitions RED='\033[0;31m' GREEN='\033[0;32m' YELLOW='\033[1;33m' NC='\033[0m' # No Color  trap 'printf "${YELLOW}\nDownload interrupted. If you re-run the command, you can resume the download from the breakpoint.\n${NC}"; exit 1' INT  display_help() {     cat << EOF Usage:   hfd <repo_id> [--include include_pattern] [--exclude exclude_pattern] [--hf_username username] [--hf_token token] [--tool aria2c|wget] [-x threads] [--dataset] [--local-dir path]      Description:   Downloads a model or dataset from Hugging Face using the provided repo ID.  Parameters:   repo_id        The Hugging Face repo ID in the format 'org/repo_name'.   --include       (Optional) Flag to specify a string pattern to include files for downloading.   --exclude       (Optional) Flag to specify a string pattern to exclude files from downloading.   include/exclude_pattern The pattern to match against filenames, supports wildcard characters. e.g., '--exclude *.safetensor', '--include vae/*'.   --hf_username   (Optional) Hugging Face username for authentication. **NOT EMAIL**.   --hf_token      (Optional) Hugging Face token for authentication.   --tool          (Optional) Download tool to use. Can be aria2c (default) or wget.   -x              (Optional) Number of download threads for aria2c. Defaults to 4.   --dataset       (Optional) Flag to indicate downloading a dataset.   --local-dir     (Optional) Local directory path where the model or dataset will be stored.  Example:   hfd bigscience/bloom-560m --exclude *.safetensors   hfd meta-llama/Llama-2-7b --hf_username myuser --hf_token mytoken -x 4   hfd lavita/medical-qa-shared-task-v1-toy --dataset EOF     exit 1 }  MODEL_ID=$1 shift  # Default values TOOL="aria2c" THREADS=4 HF_ENDPOINT=${HF_ENDPOINT:-"https://hf-mirror.com"}  while [[ $# -gt 0 ]]; do     case $1 in         --include) INCLUDE_PATTERN="$2"; shift 2 ;;         --exclude) EXCLUDE_PATTERN="$2"; shift 2 ;;         --hf_username) HF_USERNAME="$2"; shift 2 ;;         --hf_token) HF_TOKEN="$2"; shift 2 ;;         --tool) TOOL="$2"; shift 2 ;;         -x) THREADS="$2"; shift 2 ;;         --dataset) DATASET=1; shift ;;         --local-dir) LOCAL_DIR="$2"; shift 2 ;;         *) shift ;;     esac done  # Check if aria2, wget, curl, git, and git-lfs are installed check_command() {     if ! command -v $1 &>/dev/null; then         echo -e "${RED}$1 is not installed. Please install it first.${NC}"         exit 1     fi }  # Mark current repo safe when using shared file system like samba or nfs ensure_ownership() {     if git status 2>&1 | grep "fatal: detected dubious ownership in repository at" > /dev/null; then         git config --global --add safe.directory "${PWD}"         printf "${YELLOW}Detected dubious ownership in repository, mark ${PWD} safe using git, edit ~/.gitconfig if you want to reverse this.\n${NC}"      fi }  [[ "$TOOL" == "aria2c" ]] && check_command aria2c [[ "$TOOL" == "wget" ]] && check_command wget check_command curl; check_command git; check_command git-lfs  [[ -z "$MODEL_ID" || "$MODEL_ID" =~ ^-h ]] && display_help  if [[ -z "$LOCAL_DIR" ]]; then     LOCAL_DIR="${MODEL_ID#*/}" fi  if [[ "$DATASET" == 1 ]]; then     MODEL_ID="datasets/$MODEL_ID" fi echo "Downloading to $LOCAL_DIR"  if [ -d "$LOCAL_DIR/.git" ]; then     printf "${YELLOW}%s exists, Skip Clone.\n${NC}" "$LOCAL_DIR"     cd "$LOCAL_DIR" && ensure_ownership && GIT_LFS_SKIP_SMUDGE=1 git pull || { printf "${RED}Git pull failed.${NC}\n"; exit 1; } else     REPO_URL="$HF_ENDPOINT/$MODEL_ID"     GIT_REFS_URL="${REPO_URL}/info/refs?service=git-upload-pack"     echo "Testing GIT_REFS_URL: $GIT_REFS_URL"     response=$(curl -s -o /dev/null -w "%{http_code}" "$GIT_REFS_URL")     if [ "$response" == "401" ] || [ "$response" == "403" ]; then         if [[ -z "$HF_USERNAME" || -z "$HF_TOKEN" ]]; then             printf "${RED}HTTP Status Code: $response.\nThe repository requires authentication, but --hf_username and --hf_token is not passed. Please get token from https://huggingface.co/settings/tokens.\nExiting.\n${NC}"             exit 1         fi         REPO_URL="https://$HF_USERNAME:$HF_TOKEN@${HF_ENDPOINT#https://}/$MODEL_ID"     elif [ "$response" != "200" ]; then         printf "${RED}Unexpected HTTP Status Code: $response\n${NC}"         printf "${YELLOW}Executing debug command: curl -v %s\nOutput:${NC}\n" "$GIT_REFS_URL"         curl -v "$GIT_REFS_URL"; printf "\n${RED}Git clone failed.\n${NC}"; exit 1     fi     echo "GIT_LFS_SKIP_SMUDGE=1 git clone $REPO_URL $LOCAL_DIR"      GIT_LFS_SKIP_SMUDGE=1 git clone $REPO_URL $LOCAL_DIR && cd "$LOCAL_DIR" || { printf "${RED}Git clone failed.\n${NC}"; exit 1; }      ensure_ownership      while IFS= read -r file; do         truncate -s 0 "$file"     done <<< $(git lfs ls-files | cut -d ' ' -f 3-) fi  printf "\nStart Downloading lfs files, bash script:\ncd $LOCAL_DIR\n" files=$(git lfs ls-files | cut -d ' ' -f 3-) declare -a urls  while IFS= read -r file; do     url="$HF_ENDPOINT/$MODEL_ID/resolve/main/$file"     file_dir=$(dirname "$file")     mkdir -p "$file_dir"     if [[ "$TOOL" == "wget" ]]; then         download_cmd="wget -c \"$url\" -O \"$file\""         [[ -n "$HF_TOKEN" ]] && download_cmd="wget --header=\"Authorization: Bearer ${HF_TOKEN}\" -c \"$url\" -O \"$file\""     else         download_cmd="aria2c --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c \"$url\" -d \"$file_dir\" -o \"$(basename "$file")\""         [[ -n "$HF_TOKEN" ]] && download_cmd="aria2c --header=\"Authorization: Bearer ${HF_TOKEN}\" --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c \"$url\" -d \"$file_dir\" -o \"$(basename "$file")\""     fi     [[ -n "$INCLUDE_PATTERN" && ! "$file" == $INCLUDE_PATTERN ]] && printf "# %s\n" "$download_cmd" && continue     [[ -n "$EXCLUDE_PATTERN" && "$file" == $EXCLUDE_PATTERN ]] && printf "# %s\n" "$download_cmd" && continue     printf "%s\n" "$download_cmd"     urls+=("$url|$file") done <<< "$files"  for url_file in "${urls[@]}"; do     IFS='|' read -r url file <<< "$url_file"     printf "${YELLOW}Start downloading ${file}.\n${NC}"      file_dir=$(dirname "$file")     if [[ "$TOOL" == "wget" ]]; then         [[ -n "$HF_TOKEN" ]] && wget --header="Authorization: Bearer ${HF_TOKEN}" -c "$url" -O "$file" || wget -c "$url" -O "$file"     else         [[ -n "$HF_TOKEN" ]] && aria2c --header="Authorization: Bearer ${HF_TOKEN}" --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c "$url" -d "$file_dir" -o "$(basename "$file")" || aria2c --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c "$url" -d "$file_dir" -o "$(basename "$file")"     fi     [[ $? -eq 0 ]] && printf "Downloaded %s successfully.\n" "$url" || { printf "${RED}Failed to download %s.\n${NC}" "$url"; exit 1; } done  printf "${GREEN}Download completed successfully.\n${NC}"

方法二：模型下载【个人使用记录】

这个代码不能保持目录结构，见下面的改进版

import datetime import os import threading  from huggingface_hub import hf_hub_url from huggingface_hub.hf_api import HfApi from huggingface_hub.utils import filter_repo_objects  # 执行命令 def execCmd(cmd):     print("命令%s开始运行%s" % (cmd, datetime.datetime.now()))     os.system(cmd)     print("命令%s结束运行%s" % (cmd, datetime.datetime.now()))   if __name__ == '__main__':     # 需下载的hf库名称     repo_id = "Salesforce/blip2-opt-2.7b"     # 本地存储路径     save_path = './blip2-opt-2.7b'          # 获取项目信息     _api = HfApi()     repo_info = _api.repo_info(         repo_id=repo_id,         repo_type="model",         revision='main',         token=None,     )      # 获取文件信息     filtered_repo_files = list(         filter_repo_objects(             items=[f.rfilename for f in repo_info.siblings],             allow_patterns=None,             ignore_patterns=None,         )     )      cmds = []     threads = []      # 需要执行的命令列表     for file in filtered_repo_files:         # 获取路径         url = hf_hub_url(repo_id=repo_id, filename=file)         # 断点下载指令         cmds.append(f'wget -c {url} -P {save_path}')     print(cmds)      print("程序开始%s" % datetime.datetime.now())     for cmd in cmds:         th = threading.Thread(target=execCmd, args=(cmd,))         th.start()         threads.append(th)     for th in threads:         th.join()     print("程序结束%s" % datetime.datetime.now())

保持目录结构

import datetime import os import threading from pathlib import Path  from huggingface_hub import hf_hub_url from huggingface_hub.hf_api import HfApi from huggingface_hub.utils import filter_repo_objects  # 执行命令 def execCmd(cmd):     print("命令%s开始运行%s" % (cmd, datetime.datetime.now()))     os.system(cmd)     print("命令%s结束运行%s" % (cmd, datetime.datetime.now()))  if __name__ == '__main__':     # 需下载的hf库名称     repo_id = "Salesforce/blip2-opt-2.7b"     # 本地存储路径     save_path = './blip2-opt-2.7b'      # 创建本地保存目录     Path(save_path).mkdir(parents=True, exist_ok=True)      # 获取项目信息     _api = HfApi()     repo_info = _api.repo_info(         repo_id=repo_id,         repo_type="model",         revision='main',         token=None,     )      # 获取文件信息     filtered_repo_files = list(         filter_repo_objects(             items=[f.rfilename for f in repo_info.siblings],             allow_patterns=None,             ignore_patterns=None,         )     )      cmds = []     threads = []      # 需要执行的命令列表     for file in filtered_repo_files:         # 获取路径         url = hf_hub_url(repo_id=repo_id, filename=file)         # 在本地创建子目录         local_file = os.path.join(save_path, file)         local_dir = os.path.dirname(local_file)         Path(local_dir).mkdir(parents=True, exist_ok=True)         # 断点下载指令         cmds.append(f'wget -c {url} -P {local_dir}')     print(cmds)      print("程序开始%s" % datetime.datetime.now())     for cmd in cmds:         th = threading.Thread(target=execCmd, args=(cmd,))         th.start()         threads.append(th)     for th in threads:         th.join()     print("程序结束%s" % datetime.datetime.now())

数据集下载

import datetime import os import threading from pathlib import Path  from huggingface_hub import HfApi from huggingface_hub.utils import filter_repo_objects  # 执行命令 def execCmd(cmd):     print("命令%s开始运行%s" % (cmd, datetime.datetime.now()))     os.system(cmd)     print("命令%s结束运行%s" % (cmd, datetime.datetime.now()))  if __name__ == '__main__':     # 需下载的数据集ID     dataset_id = "openai/webtext"     # 本地存储路径     save_path = './webtext'      # 创建本地保存目录     Path(save_path).mkdir(parents=True, exist_ok=True)      # 获取数据集信息     _api = HfApi()     dataset_info = _api.dataset_info(         dataset_id=dataset_id,         revision='main',         token=None,     )      # 获取文件信息     filtered_dataset_files = list(         filter_repo_objects(             items=[f.rfilename for f in dataset_info.siblings],             allow_patterns=None,             ignore_patterns=None,         )     )      cmds = []     threads = []      # 需要执行的命令列表     for file in filtered_dataset_files:         # 获取路径         url = dataset_info.get_file_url(file)         # 在本地创建子目录         local_file = os.path.join(save_path, file)         local_dir = os.path.dirname(local_file)         Path(local_dir).mkdir(parents=True, exist_ok=True)         # 断点下载指令         cmds.append(f'wget -c {url} -P {local_dir}')     print(cmds)      print("程序开始%s" % datetime.datetime.now())     for cmd in cmds:         th = threading.Thread(target=execCmd, args=(cmd,))         th.start()         threads.append(th)     for th in threads:         th.join()     print("程序结束%s" % datetime.datetime.now())