|
快速方便地下载huggingface的模型库和数据集方法一:用于使用aria2/wget+git下载Huggingface模型和数据集的CLI工具特点Usage方法二:模型下载【个人使用记录】保持目录结构数据集下载不足之处方法一:用于使用aria2/wget+git下载Huggingface模型和数据集的CLI工具来自https://gist.github.com/padeoe/697678ab8e528b85a2a7bddafea1fa4f。使用方法:将hfd.sh拷贝过去,然后参考下面的参考命令,下载数据集或者模型🤗Huggingface模型下载器考虑到官方huggingface-cli缺乏多线程下载支持,以及错误处理不足在hf_transfer中,这个命令行工具巧妙地利用wget或aria2来处理LFS文件,并使用gitclone来处理其余文件。特点⏯️从断点恢复:您可以随时重新运行它或按Ctrl+C。🚀多线程下载:利用多线程加速下载过程。🚫文件排除:使用--exclude或--include跳过或指定文件,为具有重复格式的模型(例如,*.bin或*.safetensors)节省时间)。🔐身份验证支持:对于需要Huggingface登录的门控模型,请使用--hf_username和--hf_token进行身份验证。🪞镜像站点支持:使用“HF_ENDPOINT”环境变量进行设置。🌍代理支持:使用“HTTPS_PROXY”环境变量进行设置。📦简单:仅依赖git、aria2c/wget。Usage首先,下载hfd.sh或克隆此存储库,然后授予脚本执行权限。chmoda+xhfd.sh1为了方便起见,您可以创建一个别名aliashfd="$PWD/hfd.sh"1使用说明:$./hfd.sh-hUsage:hfd[--includeinclude_pattern][--excludeexclude_pattern][--hf_usernameusername][--hf_tokentoken][--toolaria2c|wget][-xthreads][--dataset][--local-dirpath]DescriptionownloadsamodelordatasetfromHuggingFaceusingtheprovidedrepoID.Parameters:repo_idTheHuggingFacerepoIDintheformat'org/repo_name'.--include(Optional)Flagtospecifyastringpatterntoincludefilesfordownloading.--exclude(Optional)Flagtospecifyastringpatterntoexcludefilesfromdownloading.include/exclude_patternThepatterntomatchagainstfilenames,supportswildcardcharacters.e.g.,'--exclude*.safetensor','--includevae/*'.--hf_username(Optional)HuggingFaceusernameforauthentication.**NOTEMAIL**.--hf_token(Optional)HuggingFacetokenforauthentication.--tool(Optional)Downloadtooltouse.Canbearia2c(default)orwget.-x(Optional)Numberofdownloadthreadsforaria2c.Defaultsto4.--dataset(Optional)Flagtoindicatedownloadingadataset.--local-dir(Optional)Localdirectorypathwherethemodelordatasetwillbestored.Example:hfdbigscience/bloom-560m--exclude*.safetensorshfdmeta-llama/Llama-2-7b--hf_usernamemyuser--hf_tokenmytoken-x4hfdlavita/medical-qa-shared-task-v1-toy--dataset1234567891011121314151617181920212223下载模型:hfdbigscience/bloom-560m1下载模型需要登录从https://huggingface.co/settings/tokens获取huggingface令牌,然后hfdmeta-llama/Llama-2-7b--hf_usernameYOUR_HF_USERNAME_NOT_EMAIL--hf_tokenYOUR_HF_TOKEN1下载模型并排除某些文件(例如.safetensors):hfdbigscience/bloom-560m--exclude*.safetensors1使用aria2c和多线程下载:hfdbigscience/bloom-560m1输出:下载过程中,将显示文件URL:$hfdbigscience/bloom-560m--toolwget--exclude*.safetensors...StartDownloadinglfsfiles,bashscript:wget-chttps://huggingface.co/bigscience/bloom-560m/resolve/main/flax_model.msgpack#wget-chttps://huggingface.co/bigscience/bloom-560m/resolve/main/model.safetensorswget-chttps://huggingface.co/bigscience/bloom-560m/resolve/main/onnx/decoder_model.onnx...12345678#安装包aptupdateapt-getinstallaria2apt-getinstalliftopapt-getinstallgit-lfs#参考命令bash/xxx/xxx/hfd.shmmaaz60/ActivityNet-QA-Test-Videos--toolaria2c-x16--dataset--local-dir/xxx/xxx/ActivityNet1234567hfd.sh#!/usr/bin/envbash#ColordefinitionsRED='\033[0;31m'GREEN='\033[0;32m'YELLOW='\033[1;33m'NC='\033[0m'#NoColortrap'printf"${YELLOW}\nDownloadinterrupted.Ifyoure-runthecommand,youcanresumethedownloadfromthebreakpoint.\n${NC}";exit1'INTdisplay_help(){cat<< EOF
Usage:
hfd [--includeinclude_pattern][--excludeexclude_pattern][--hf_usernameusername][--hf_tokentoken][--toolaria2c|wget][-xthreads][--dataset][--local-dirpath]DescriptionownloadsamodelordatasetfromHuggingFaceusingtheprovidedrepoID.Parameters:repo_idTheHuggingFacerepoIDintheformat'org/repo_name'.--include(Optional)Flagtospecifyastringpatterntoincludefilesfordownloading.--exclude(Optional)Flagtospecifyastringpatterntoexcludefilesfromdownloading.include/exclude_patternThepatterntomatchagainstfilenames,supportswildcardcharacters.e.g.,'--exclude*.safetensor','--includevae/*'.--hf_username(Optional)HuggingFaceusernameforauthentication.**NOTEMAIL**.--hf_token(Optional)HuggingFacetokenforauthentication.--tool(Optional)Downloadtooltouse.Canbearia2c(default)orwget.-x(Optional)Numberofdownloadthreadsforaria2c.Defaultsto4.--dataset(Optional)Flagtoindicatedownloadingadataset.--local-dir(Optional)Localdirectorypathwherethemodelordatasetwillbestored.Example:hfdbigscience/bloom-560m--exclude*.safetensorshfdmeta-llama/Llama-2-7b--hf_usernamemyuser--hf_tokenmytoken-x4hfdlavita/medical-qa-shared-task-v1-toy--datasetEOFexit1}MODEL_ID=$1shift#DefaultvaluesTOOL="aria2c"THREADS=4HF_ENDPOINT=${HF_ENDPOINT:-"https://hf-mirror.com"}while[[$#-gt0]];docase$1in--include)INCLUDE_PATTERN="$2";shift2;;--exclude)EXCLUDE_PATTERN="$2";shift2;;--hf_username)HF_USERNAME="$2";shift2;;--hf_token)HF_TOKEN="$2";shift2;;--tool)TOOL="$2";shift2;;-x)THREADS="$2";shift2;;--dataset)DATASET=1;shift;;--local-dir)LOCAL_DIR="$2";shift2;;*)shift;;esacdone#Checkifaria2,wget,curl,git,andgit-lfsareinstalledcheck_command(){if!command-v$1&>/dev/null;thenecho-e"${RED}$1isnotinstalled.Pleaseinstallitfirst.${NC}"exit1fi}#Markcurrentreposafewhenusingsharedfilesystemlikesambaornfsensure_ownership(){ifgitstatus2>&1|grep"fatal:detecteddubiousownershipinrepositoryat">/dev/null;thengitconfig--global--addsafe.directory"${PWD}"printf"${YELLOW}Detecteddubiousownershipinrepository,mark${PWD}safeusinggit,edit~/.gitconfigifyouwanttoreversethis.\n${NC}"fi}[["$TOOL"=="aria2c"]]&check_commandaria2c[["$TOOL"=="wget"]]&check_commandwgetcheck_commandcurl;check_commandgit;check_commandgit-lfs[[-z"$MODEL_ID"||"$MODEL_ID"=~^-h]]&display_helpif[[-z"$LOCAL_DIR"]];thenLOCAL_DIR="${MODEL_ID#*/}"fiif[["$DATASET"==1]];thenMODEL_ID="datasets/$MODEL_ID"fiecho"Downloadingto$LOCAL_DIR"if[-d"$LOCAL_DIR/.git"];thenprintf"${YELLOW}%sexists,SkipClone.\n${NC}""$LOCAL_DIR"cd"$LOCAL_DIR"&ensure_ownership&GIT_LFS_SKIP_SMUDGE=1gitpull||{printf"${RED}Gitpullfailed.${NC}\n";exit1;}elseREPO_URL="$HF_ENDPOINT/$MODEL_ID"GIT_REFS_URL="${REPO_URL}/info/refs?service=git-upload-pack"echo"TestingGIT_REFS_URLGIT_REFS_URL"response=$(curl-s-o/dev/null-w"%{http_code}""$GIT_REFS_URL")if["$response"=="401"]||["$response"=="403"];thenif[[-z"$HF_USERNAME"||-z"$HF_TOKEN"]];thenprintf"${RED}HTTPStatusCoderesponse.\nTherepositoryrequiresauthentication,but--hf_usernameand--hf_tokenisnotpassed.Pleasegettokenfromhttps://huggingface.co/settings/tokens.\nExiting.\n${NC}"exit1fiREPO_URL="https://$HF_USERNAMEHF_TOKEN@${HF_ENDPOINT#https://}/$MODEL_ID"elif["$response"!="200"];thenprintf"${RED}UnexpectedHTTPStatusCoderesponse\n${NC}"printf"${YELLOW}Executingdebugcommand:curl-v%s\nOutput{NC}\n""$GIT_REFS_URL"curl-v"$GIT_REFS_URL";printf"\n${RED}Gitclonefailed.\n${NC}";exit1fiecho"GIT_LFS_SKIP_SMUDGE=1gitclone$REPO_URL$LOCAL_DIR"GIT_LFS_SKIP_SMUDGE=1gitclone$REPO_URL$LOCAL_DIR&cd"$LOCAL_DIR"||{printf"${RED}Gitclonefailed.\n${NC}";exit1;}ensure_ownershipwhileIFS=read-rfile;dotruncate-s0"$file"done<<< $(git lfs ls-files | cut -d ' ' -f 3-)
fi
printf "\nStart Downloading lfs files, bash script:\ncd $LOCAL_DIR\n"
files=$(git lfs ls-files | cut -d ' ' -f 3-)
declare -a urls
while IFS= read -r file; do
url="$HF_ENDPOINT/$MODEL_ID/resolve/main/$file"
file_dir=$(dirname "$file")
mkdir -p "$file_dir"
if [[ "$TOOL" == "wget" ]]; then
download_cmd="wget -c \"$url\" -O \"$file\""
[[ -n "$HF_TOKEN" ]] && download_cmd="wget --header=\"Authorization: Bearer ${HF_TOKEN}\" -c \"$url\" -O \"$file\""
else
download_cmd="aria2c --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c \"$url\" -d \"$file_dir\" -o \"$(basename "$file")\""
[[ -n "$HF_TOKEN" ]] && download_cmd="aria2c --header=\"Authorization: Bearer ${HF_TOKEN}\" --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c \"$url\" -d \"$file_dir\" -o \"$(basename "$file")\""
fi
[[ -n "$INCLUDE_PATTERN" && ! "$file" == $INCLUDE_PATTERN ]] && printf "# %s\n" "$download_cmd" && continue
[[ -n "$EXCLUDE_PATTERN" && "$file" == $EXCLUDE_PATTERN ]] && printf "# %s\n" "$download_cmd" && continue
printf "%s\n" "$download_cmd"
urls+=("$url|$file")
done <<< "$files"
for url_file in "${urls[@]}"; do
IFS='|' read -r url file <<< "$url_file"
printf "${YELLOW}Start downloading ${file}.\n${NC}"
file_dir=$(dirname "$file")
if [[ "$TOOL" == "wget" ]]; then
[[ -n "$HF_TOKEN" ]] && wget --header="Authorization: Bearer ${HF_TOKEN}" -c "$url" -O "$file" || wget -c "$url" -O "$file"
else
[[ -n "$HF_TOKEN" ]] && aria2c --header="Authorization: Bearer ${HF_TOKEN}" --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c "$url" -d "$file_dir" -o "$(basename "$file")" || aria2c --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c "$url" -d "$file_dir" -o "$(basename "$file")"
fi
[[ $? -eq 0 ]] && printf "Downloaded %s successfully.\n" "$url" || { printf "${RED}Failed to download %s.\n${NC}" "$url"; exit 1; }
done
printf "${GREEN}Download completed successfully.\n${NC}"
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155
方法二:模型下载【个人使用记录】
这个代码不能保持目录结构,见下面的改进版
import datetime
import os
import threading
from huggingface_hub import hf_hub_url
from huggingface_hub.hf_api import HfApi
from huggingface_hub.utils import filter_repo_objects
# 执行命令
def execCmd(cmd):
print("命令%s开始运行%s" % (cmd, datetime.datetime.now()))
os.system(cmd)
print("命令%s结束运行%s" % (cmd, datetime.datetime.now()))
if __name__ == '__main__':
# 需下载的hf库名称
repo_id = "Salesforce/blip2-opt-2.7b"
# 本地存储路径
save_path = './blip2-opt-2.7b'
# 获取项目信息
_api = HfApi()
repo_info = _api.repo_info(
repo_id=repo_id,
repo_type="model",
revision='main',
token=None,
)
# 获取文件信息
filtered_repo_files = list(
filter_repo_objects(
items=[f.rfilename for f in repo_info.siblings],
allow_patterns=None,
ignore_patterns=None,
)
)
cmds = []
threads = []
# 需要执行的命令列表
for file in filtered_repo_files:
# 获取路径
url = hf_hub_url(repo_id=repo_id, filename=file)
# 断点下载指令
cmds.append(f'wget -c {url} -P {save_path}')
print(cmds)
print("程序开始%s" % datetime.datetime.now())
for cmd in cmds:
th = threading.Thread(target=execCmd, args=(cmd,))
th.start()
threads.append(th)
for th in threads:
th.join()
print("程序结束%s" % datetime.datetime.now())
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
保持目录结构
import datetime
import os
import threading
from pathlib import Path
from huggingface_hub import hf_hub_url
from huggingface_hub.hf_api import HfApi
from huggingface_hub.utils import filter_repo_objects
# 执行命令
def execCmd(cmd):
print("命令%s开始运行%s" % (cmd, datetime.datetime.now()))
os.system(cmd)
print("命令%s结束运行%s" % (cmd, datetime.datetime.now()))
if __name__ == '__main__':
# 需下载的hf库名称
repo_id = "Salesforce/blip2-opt-2.7b"
# 本地存储路径
save_path = './blip2-opt-2.7b'
# 创建本地保存目录
Path(save_path).mkdir(parents=True, exist_ok=True)
# 获取项目信息
_api = HfApi()
repo_info = _api.repo_info(
repo_id=repo_id,
repo_type="model",
revision='main',
token=None,
)
# 获取文件信息
filtered_repo_files = list(
filter_repo_objects(
items=[f.rfilename for f in repo_info.siblings],
allow_patterns=None,
ignore_patterns=None,
)
)
cmds = []
threads = []
# 需要执行的命令列表
for file in filtered_repo_files:
# 获取路径
url = hf_hub_url(repo_id=repo_id, filename=file)
# 在本地创建子目录
local_file = os.path.join(save_path, file)
local_dir = os.path.dirname(local_file)
Path(local_dir).mkdir(parents=True, exist_ok=True)
# 断点下载指令
cmds.append(f'wget -c {url} -P {local_dir}')
print(cmds)
print("程序开始%s" % datetime.datetime.now())
for cmd in cmds:
th = threading.Thread(target=execCmd, args=(cmd,))
th.start()
threads.append(th)
for th in threads:
th.join()
print("程序结束%s" % datetime.datetime.now())
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
数据集下载
import datetime
import os
import threading
from pathlib import Path
from huggingface_hub import HfApi
from huggingface_hub.utils import filter_repo_objects
# 执行命令
def execCmd(cmd):
print("命令%s开始运行%s" % (cmd, datetime.datetime.now()))
os.system(cmd)
print("命令%s结束运行%s" % (cmd, datetime.datetime.now()))
if __name__ == '__main__':
# 需下载的数据集ID
dataset_id = "openai/webtext"
# 本地存储路径
save_path = './webtext'
# 创建本地保存目录
Path(save_path).mkdir(parents=True, exist_ok=True)
# 获取数据集信息
_api = HfApi()
dataset_info = _api.dataset_info(
dataset_id=dataset_id,
revision='main',
token=None,
)
# 获取文件信息
filtered_dataset_files = list(
filter_repo_objects(
items=[f.rfilename for f in dataset_info.siblings],
allow_patterns=None,
ignore_patterns=None,
)
)
cmds = []
threads = []
# 需要执行的命令列表
for file in filtered_dataset_files:
# 获取路径
url = dataset_info.get_file_url(file)
# 在本地创建子目录
local_file = os.path.join(save_path, file)
local_dir = os.path.dirname(local_file)
Path(local_dir).mkdir(parents=True, exist_ok=True)
# 断点下载指令
cmds.append(f'wget -c {url} -P {local_dir}')
print(cmds)
print("程序开始%s" % datetime.datetime.now())
for cmd in cmds:
th = threading.Thread(target=execCmd, args=(cmd,))
th.start()
threads.append(th)
for th in threads:
th.join()
print("程序结束%s" % datetime.datetime.now())
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
不足之处
不支持需要授权的库。
文件太多可能会开很多线程。
创作不易,观众老爷们请留步… 动起可爱的小手,点个赞再走呗 (๑◕ܫ←๑)
欢迎大家关注笔者,你的关注是我持续更博的最大动力
原创文章,转载告知,盗版必究
♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠
|
|