找回密码
 会员注册
查看: 15|回复: 0

快速方便地下载huggingface的模型库和数据集

[复制链接]

2万

主题

0

回帖

6万

积分

超级版主

积分
64454
发表于 2024-9-12 08:57:00 | 显示全部楼层 |阅读模式
快速方便地下载huggingface的模型库和数据集方法一:用于使用aria2/wget+git下载Huggingface模型和数据集的CLI工具特点Usage方法二:模型下载【个人使用记录】保持目录结构数据集下载不足之处方法一:用于使用aria2/wget+git下载Huggingface模型和数据集的CLI工具来自https://gist.github.com/padeoe/697678ab8e528b85a2a7bddafea1fa4f。使用方法:将hfd.sh拷贝过去,然后参考下面的参考命令,下载数据集或者模型🤗Huggingface模型下载器考虑到官方huggingface-cli缺乏多线程下载支持,以及错误处理不足在hf_transfer中,这个命令行工具巧妙地利用wget或aria2来处理LFS文件,并使用gitclone来处理其余文件。特点⏯️从断点恢复:您可以随时重新运行它或按Ctrl+C。🚀多线程下载:利用多线程加速下载过程。🚫文件排除:使用--exclude或--include跳过或指定文件,为具有重复格式的模型(例如,*.bin或*.safetensors)节省时间)。🔐身份验证支持:对于需要Huggingface登录的门控模型,请使用--hf_username和--hf_token进行身份验证。🪞镜像站点支持:使用“HF_ENDPOINT”环境变量进行设置。🌍代理支持:使用“HTTPS_PROXY”环境变量进行设置。📦简单:仅依赖git、aria2c/wget。Usage首先,下载hfd.sh或克隆此存储库,然后授予脚本执行权限。chmoda+xhfd.sh1为了方便起见,您可以创建一个别名aliashfd="$PWD/hfd.sh"1使用说明:$./hfd.sh-hUsage:hfd[--includeinclude_pattern][--excludeexclude_pattern][--hf_usernameusername][--hf_tokentoken][--toolaria2c|wget][-xthreads][--dataset][--local-dirpath]DescriptionownloadsamodelordatasetfromHuggingFaceusingtheprovidedrepoID.Parameters:repo_idTheHuggingFacerepoIDintheformat'org/repo_name'.--include(Optional)Flagtospecifyastringpatterntoincludefilesfordownloading.--exclude(Optional)Flagtospecifyastringpatterntoexcludefilesfromdownloading.include/exclude_patternThepatterntomatchagainstfilenames,supportswildcardcharacters.e.g.,'--exclude*.safetensor','--includevae/*'.--hf_username(Optional)HuggingFaceusernameforauthentication.**NOTEMAIL**.--hf_token(Optional)HuggingFacetokenforauthentication.--tool(Optional)Downloadtooltouse.Canbearia2c(default)orwget.-x(Optional)Numberofdownloadthreadsforaria2c.Defaultsto4.--dataset(Optional)Flagtoindicatedownloadingadataset.--local-dir(Optional)Localdirectorypathwherethemodelordatasetwillbestored.Example:hfdbigscience/bloom-560m--exclude*.safetensorshfdmeta-llama/Llama-2-7b--hf_usernamemyuser--hf_tokenmytoken-x4hfdlavita/medical-qa-shared-task-v1-toy--dataset1234567891011121314151617181920212223下载模型:hfdbigscience/bloom-560m1下载模型需要登录从https://huggingface.co/settings/tokens获取huggingface令牌,然后hfdmeta-llama/Llama-2-7b--hf_usernameYOUR_HF_USERNAME_NOT_EMAIL--hf_tokenYOUR_HF_TOKEN1下载模型并排除某些文件(例如.safetensors):hfdbigscience/bloom-560m--exclude*.safetensors1使用aria2c和多线程下载:hfdbigscience/bloom-560m1输出:下载过程中,将显示文件URL:$hfdbigscience/bloom-560m--toolwget--exclude*.safetensors...StartDownloadinglfsfiles,bashscript:wget-chttps://huggingface.co/bigscience/bloom-560m/resolve/main/flax_model.msgpack#wget-chttps://huggingface.co/bigscience/bloom-560m/resolve/main/model.safetensorswget-chttps://huggingface.co/bigscience/bloom-560m/resolve/main/onnx/decoder_model.onnx...12345678#安装包aptupdateapt-getinstallaria2apt-getinstalliftopapt-getinstallgit-lfs#参考命令bash/xxx/xxx/hfd.shmmaaz60/ActivityNet-QA-Test-Videos--toolaria2c-x16--dataset--local-dir/xxx/xxx/ActivityNet1234567hfd.sh#!/usr/bin/envbash#ColordefinitionsRED='\033[0;31m'GREEN='\033[0;32m'YELLOW='\033[1;33m'NC='\033[0m'#NoColortrap'printf"${YELLOW}\nDownloadinterrupted.Ifyoure-runthecommand,youcanresumethedownloadfromthebreakpoint.\n${NC}";exit1'INTdisplay_help(){cat<< EOF Usage: hfd [--includeinclude_pattern][--excludeexclude_pattern][--hf_usernameusername][--hf_tokentoken][--toolaria2c|wget][-xthreads][--dataset][--local-dirpath]DescriptionownloadsamodelordatasetfromHuggingFaceusingtheprovidedrepoID.Parameters:repo_idTheHuggingFacerepoIDintheformat'org/repo_name'.--include(Optional)Flagtospecifyastringpatterntoincludefilesfordownloading.--exclude(Optional)Flagtospecifyastringpatterntoexcludefilesfromdownloading.include/exclude_patternThepatterntomatchagainstfilenames,supportswildcardcharacters.e.g.,'--exclude*.safetensor','--includevae/*'.--hf_username(Optional)HuggingFaceusernameforauthentication.**NOTEMAIL**.--hf_token(Optional)HuggingFacetokenforauthentication.--tool(Optional)Downloadtooltouse.Canbearia2c(default)orwget.-x(Optional)Numberofdownloadthreadsforaria2c.Defaultsto4.--dataset(Optional)Flagtoindicatedownloadingadataset.--local-dir(Optional)Localdirectorypathwherethemodelordatasetwillbestored.Example:hfdbigscience/bloom-560m--exclude*.safetensorshfdmeta-llama/Llama-2-7b--hf_usernamemyuser--hf_tokenmytoken-x4hfdlavita/medical-qa-shared-task-v1-toy--datasetEOFexit1}MODEL_ID=$1shift#DefaultvaluesTOOL="aria2c"THREADS=4HF_ENDPOINT=${HF_ENDPOINT:-"https://hf-mirror.com"}while[[$#-gt0]];docase$1in--include)INCLUDE_PATTERN="$2";shift2;;--exclude)EXCLUDE_PATTERN="$2";shift2;;--hf_username)HF_USERNAME="$2";shift2;;--hf_token)HF_TOKEN="$2";shift2;;--tool)TOOL="$2";shift2;;-x)THREADS="$2";shift2;;--dataset)DATASET=1;shift;;--local-dir)LOCAL_DIR="$2";shift2;;*)shift;;esacdone#Checkifaria2,wget,curl,git,andgit-lfsareinstalledcheck_command(){if!command-v$1&>/dev/null;thenecho-e"${RED}$1isnotinstalled.Pleaseinstallitfirst.${NC}"exit1fi}#Markcurrentreposafewhenusingsharedfilesystemlikesambaornfsensure_ownership(){ifgitstatus2>&1|grep"fatal:detecteddubiousownershipinrepositoryat">/dev/null;thengitconfig--global--addsafe.directory"${PWD}"printf"${YELLOW}Detecteddubiousownershipinrepository,mark${PWD}safeusinggit,edit~/.gitconfigifyouwanttoreversethis.\n${NC}"fi}[["$TOOL"=="aria2c"]]&check_commandaria2c[["$TOOL"=="wget"]]&check_commandwgetcheck_commandcurl;check_commandgit;check_commandgit-lfs[[-z"$MODEL_ID"||"$MODEL_ID"=~^-h]]&display_helpif[[-z"$LOCAL_DIR"]];thenLOCAL_DIR="${MODEL_ID#*/}"fiif[["$DATASET"==1]];thenMODEL_ID="datasets/$MODEL_ID"fiecho"Downloadingto$LOCAL_DIR"if[-d"$LOCAL_DIR/.git"];thenprintf"${YELLOW}%sexists,SkipClone.\n${NC}""$LOCAL_DIR"cd"$LOCAL_DIR"&ensure_ownership&GIT_LFS_SKIP_SMUDGE=1gitpull||{printf"${RED}Gitpullfailed.${NC}\n";exit1;}elseREPO_URL="$HF_ENDPOINT/$MODEL_ID"GIT_REFS_URL="${REPO_URL}/info/refs?service=git-upload-pack"echo"TestingGIT_REFS_URLGIT_REFS_URL"response=$(curl-s-o/dev/null-w"%{http_code}""$GIT_REFS_URL")if["$response"=="401"]||["$response"=="403"];thenif[[-z"$HF_USERNAME"||-z"$HF_TOKEN"]];thenprintf"${RED}HTTPStatusCoderesponse.\nTherepositoryrequiresauthentication,but--hf_usernameand--hf_tokenisnotpassed.Pleasegettokenfromhttps://huggingface.co/settings/tokens.\nExiting.\n${NC}"exit1fiREPO_URL="https://$HF_USERNAMEHF_TOKEN@${HF_ENDPOINT#https://}/$MODEL_ID"elif["$response"!="200"];thenprintf"${RED}UnexpectedHTTPStatusCoderesponse\n${NC}"printf"${YELLOW}Executingdebugcommand:curl-v%s\nOutput{NC}\n""$GIT_REFS_URL"curl-v"$GIT_REFS_URL";printf"\n${RED}Gitclonefailed.\n${NC}";exit1fiecho"GIT_LFS_SKIP_SMUDGE=1gitclone$REPO_URL$LOCAL_DIR"GIT_LFS_SKIP_SMUDGE=1gitclone$REPO_URL$LOCAL_DIR&cd"$LOCAL_DIR"||{printf"${RED}Gitclonefailed.\n${NC}";exit1;}ensure_ownershipwhileIFS=read-rfile;dotruncate-s0"$file"done<<< $(git lfs ls-files | cut -d ' ' -f 3-) fi printf "\nStart Downloading lfs files, bash script:\ncd $LOCAL_DIR\n" files=$(git lfs ls-files | cut -d ' ' -f 3-) declare -a urls while IFS= read -r file; do url="$HF_ENDPOINT/$MODEL_ID/resolve/main/$file" file_dir=$(dirname "$file") mkdir -p "$file_dir" if [[ "$TOOL" == "wget" ]]; then download_cmd="wget -c \"$url\" -O \"$file\"" [[ -n "$HF_TOKEN" ]] && download_cmd="wget --header=\"Authorization: Bearer ${HF_TOKEN}\" -c \"$url\" -O \"$file\"" else download_cmd="aria2c --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c \"$url\" -d \"$file_dir\" -o \"$(basename "$file")\"" [[ -n "$HF_TOKEN" ]] && download_cmd="aria2c --header=\"Authorization: Bearer ${HF_TOKEN}\" --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c \"$url\" -d \"$file_dir\" -o \"$(basename "$file")\"" fi [[ -n "$INCLUDE_PATTERN" && ! "$file" == $INCLUDE_PATTERN ]] && printf "# %s\n" "$download_cmd" && continue [[ -n "$EXCLUDE_PATTERN" && "$file" == $EXCLUDE_PATTERN ]] && printf "# %s\n" "$download_cmd" && continue printf "%s\n" "$download_cmd" urls+=("$url|$file") done <<< "$files" for url_file in "${urls[@]}"; do IFS='|' read -r url file <<< "$url_file" printf "${YELLOW}Start downloading ${file}.\n${NC}" file_dir=$(dirname "$file") if [[ "$TOOL" == "wget" ]]; then [[ -n "$HF_TOKEN" ]] && wget --header="Authorization: Bearer ${HF_TOKEN}" -c "$url" -O "$file" || wget -c "$url" -O "$file" else [[ -n "$HF_TOKEN" ]] && aria2c --header="Authorization: Bearer ${HF_TOKEN}" --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c "$url" -d "$file_dir" -o "$(basename "$file")" || aria2c --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c "$url" -d "$file_dir" -o "$(basename "$file")" fi [[ $? -eq 0 ]] && printf "Downloaded %s successfully.\n" "$url" || { printf "${RED}Failed to download %s.\n${NC}" "$url"; exit 1; } done printf "${GREEN}Download completed successfully.\n${NC}" 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155 方法二:模型下载【个人使用记录】 这个代码不能保持目录结构,见下面的改进版 import datetime import os import threading from huggingface_hub import hf_hub_url from huggingface_hub.hf_api import HfApi from huggingface_hub.utils import filter_repo_objects # 执行命令 def execCmd(cmd): print("命令%s开始运行%s" % (cmd, datetime.datetime.now())) os.system(cmd) print("命令%s结束运行%s" % (cmd, datetime.datetime.now())) if __name__ == '__main__': # 需下载的hf库名称 repo_id = "Salesforce/blip2-opt-2.7b" # 本地存储路径 save_path = './blip2-opt-2.7b' # 获取项目信息 _api = HfApi() repo_info = _api.repo_info( repo_id=repo_id, repo_type="model", revision='main', token=None, ) # 获取文件信息 filtered_repo_files = list( filter_repo_objects( items=[f.rfilename for f in repo_info.siblings], allow_patterns=None, ignore_patterns=None, ) ) cmds = [] threads = [] # 需要执行的命令列表 for file in filtered_repo_files: # 获取路径 url = hf_hub_url(repo_id=repo_id, filename=file) # 断点下载指令 cmds.append(f'wget -c {url} -P {save_path}') print(cmds) print("程序开始%s" % datetime.datetime.now()) for cmd in cmds: th = threading.Thread(target=execCmd, args=(cmd,)) th.start() threads.append(th) for th in threads: th.join() print("程序结束%s" % datetime.datetime.now()) 12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758 保持目录结构 import datetime import os import threading from pathlib import Path from huggingface_hub import hf_hub_url from huggingface_hub.hf_api import HfApi from huggingface_hub.utils import filter_repo_objects # 执行命令 def execCmd(cmd): print("命令%s开始运行%s" % (cmd, datetime.datetime.now())) os.system(cmd) print("命令%s结束运行%s" % (cmd, datetime.datetime.now())) if __name__ == '__main__': # 需下载的hf库名称 repo_id = "Salesforce/blip2-opt-2.7b" # 本地存储路径 save_path = './blip2-opt-2.7b' # 创建本地保存目录 Path(save_path).mkdir(parents=True, exist_ok=True) # 获取项目信息 _api = HfApi() repo_info = _api.repo_info( repo_id=repo_id, repo_type="model", revision='main', token=None, ) # 获取文件信息 filtered_repo_files = list( filter_repo_objects( items=[f.rfilename for f in repo_info.siblings], allow_patterns=None, ignore_patterns=None, ) ) cmds = [] threads = [] # 需要执行的命令列表 for file in filtered_repo_files: # 获取路径 url = hf_hub_url(repo_id=repo_id, filename=file) # 在本地创建子目录 local_file = os.path.join(save_path, file) local_dir = os.path.dirname(local_file) Path(local_dir).mkdir(parents=True, exist_ok=True) # 断点下载指令 cmds.append(f'wget -c {url} -P {local_dir}') print(cmds) print("程序开始%s" % datetime.datetime.now()) for cmd in cmds: th = threading.Thread(target=execCmd, args=(cmd,)) th.start() threads.append(th) for th in threads: th.join() print("程序结束%s" % datetime.datetime.now()) 1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465 数据集下载 import datetime import os import threading from pathlib import Path from huggingface_hub import HfApi from huggingface_hub.utils import filter_repo_objects # 执行命令 def execCmd(cmd): print("命令%s开始运行%s" % (cmd, datetime.datetime.now())) os.system(cmd) print("命令%s结束运行%s" % (cmd, datetime.datetime.now())) if __name__ == '__main__': # 需下载的数据集ID dataset_id = "openai/webtext" # 本地存储路径 save_path = './webtext' # 创建本地保存目录 Path(save_path).mkdir(parents=True, exist_ok=True) # 获取数据集信息 _api = HfApi() dataset_info = _api.dataset_info( dataset_id=dataset_id, revision='main', token=None, ) # 获取文件信息 filtered_dataset_files = list( filter_repo_objects( items=[f.rfilename for f in dataset_info.siblings], allow_patterns=None, ignore_patterns=None, ) ) cmds = [] threads = [] # 需要执行的命令列表 for file in filtered_dataset_files: # 获取路径 url = dataset_info.get_file_url(file) # 在本地创建子目录 local_file = os.path.join(save_path, file) local_dir = os.path.dirname(local_file) Path(local_dir).mkdir(parents=True, exist_ok=True) # 断点下载指令 cmds.append(f'wget -c {url} -P {local_dir}') print(cmds) print("程序开始%s" % datetime.datetime.now()) for cmd in cmds: th = threading.Thread(target=execCmd, args=(cmd,)) th.start() threads.append(th) for th in threads: th.join() print("程序结束%s" % datetime.datetime.now()) 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263 不足之处 不支持需要授权的库。 文件太多可能会开很多线程。 创作不易,观众老爷们请留步… 动起可爱的小手,点个赞再走呗 (๑◕ܫ←๑) 欢迎大家关注笔者,你的关注是我持续更博的最大动力 原创文章,转载告知,盗版必究 ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 会员注册

本版积分规则

QQ|手机版|心飞设计-版权所有:微度网络信息技术服务中心 ( 鲁ICP备17032091号-12 )|网站地图

GMT+8, 2024-12-27 01:35 , Processed in 0.482267 second(s), 26 queries .

Powered by Discuz! X3.5

© 2001-2024 Discuz! Team.

快速回复 返回顶部 返回列表