找回密码
 会员注册
查看: 31|回复: 0

AI大模型RAG检索增强生成③(文本向量Word2Vec词汇映射向量空间模型-算法原理、训练步骤、应用场景、实现细节Python代码示例)

[复制链接]

2万

主题

0

回帖

7万

积分

超级版主

积分
73812
发表于 2024-9-3 00:50:59 | 显示全部楼层 |阅读模式
文章目录一、Word2Vec词汇映射向量空间模型1、Word2Vec模型简介2、连续词袋模型CBOW-算法原理3、连续词袋模型CBOW-模型训练步骤4、跳字模型Skip-gram-算法原理5、跳字模型Skip-gram-模型训练步骤6、文本向量表示7、Word2Vec文本向量的应用场景二、Word2Vec完整代码示例1、Python中实现Word2Vec模型的库2、安装tensorflow软件包3、代码示例4、执行结果一、Word2Vec词汇映射向量空间模型1、Word2Vec模型简介Word2Vec是一个将词汇映射到高维向量空间的模型,其核心思想是通过大量的文本数据来学习每个词的向量表示,使得语义相似的单词或汉字在向量空间中彼此接近;Word2Vec的训练模型:连续词袋模型CBOW跳字模型Skip-gram下面介绍上述两种模型的算法原理;2、连续词袋模型CBOW-算法原理连续词袋模型CBOW算法的目的:预测给定上下文词汇的中心词;在CBOW模型中,先给定某个词汇(中心词)的上下文,模型的目标是预测这段文字中心的词汇,也就是预测中心词;连续词袋模型CBOW通过上下文词汇的平均或加权和操作,预测中心词的向量,然后从文本向量表中查找距离该向量最近的词汇是哪个,这个词就是预测的结果,中心词;将上下文词汇对应的文本向量进行平均或加权操作后,传递给一个输出层,输出层使用softmax激活函数来预测中心词;下图中,X1X2X3等词汇,每个词汇都由向量表示,每个向量都由若干浮点数组成;输出层是多个上下文词汇,隐藏层进行平均或加权和计算操作,得到输出层的中心词对应的向量;举例说明:假设给定一句话"Thecatsitsonthemat",并选择"sits"单词作为中心词,那么我们将“The”,“cat”,“on”,“the”,“mat”作为上下文词汇;根据这些上下文词汇,预测出中心词,看是否能把"sit"单词作为中心词预测出来;3、连续词袋模型CBOW-模型训练步骤连续词袋模型CBOW训练步骤:输入层:输入层的每个节点对应一个上下文词汇,每个上下文词汇用一个编码向量表示;隐藏层:上下文词汇的编码向量通过一个权重矩阵映射到隐藏层,这些权重是模型要学习的;输出层:隐藏层的输出通过另一个权重矩阵映射到词汇表的大小,并通过softmax函数计算每个词的概率分布;该模型的训练目标是最大化预测中心词的准确率;4、跳字模型Skip-gram-算法原理跳字模型Skip-gram算法原理:给定一个中心词,预测中心词的上下文词汇;在Skip-gram模型中,给定一个中心词,模型的目标是预测这个中心词周围的上下文词汇;Skip-gram模型通过中心词的向量来预测每个上下文词汇的向量,即中心词的向量经过一个权重矩阵映射到输出层,通过softmax函数来预测上下文词汇的概率分布。举例说明:假设我们有一个句子"Thecatsitsonthemat",选择"sits"作为中心词,那么“The”,“cat”,“on”,“the”,“mat”就是上下文词汇;给定中心词"sits",进行上下文预测,看是否能预测出“The”,“cat”,“on”,“the”,“mat”等上下文词汇;5、跳字模型Skip-gram-模型训练步骤跳字模型Skip-gram-模型训练步骤:输入层:输入层的每个节点对应一个中心词,中心词用一个编码向量表示;隐藏层:中心词的独热编码向量通过一个权重矩阵映射到隐藏层,这些权重是模型要学习的;输出层:隐藏层的输出通过另一个权重矩阵映射到词汇表的大小,并通过softmax函数计算每个上下文词的概率分布;该模型的目标是最大化预测上下文的准确率;6、文本向量表示Word2Vec模型训练完成后,每个词汇将被映射到一个高维向量空间中,相似的词汇在向量空间中的距离较近;这些词向量/文本向量可以用来进行各种自然语言处理任务,如词义相似度计算、文本分类等;将下面的一段文本进行训练,#示例文本数据sentences=["Ilovemachinelearning","Deeplearningisamazing","Naturallanguageprocessingisafascinatingfield"]123456向量维度设置为50,那么就是在50维的向量空间中表示每个单词,每个单词都使用50个浮点数进行表示;下面是单词"learning"的文本向量,由50个浮点数;Word:learning,Vector:[-0.003211570.039277870.006169160.027896490.022031730.036127380.006371090.04316046-0.0498910.02915843-0.004262640.028418070.018230730.0149862-0.02141328-0.006870460.05354420.01235065-0.0463290.00192757-0.004244030.003647270.057908620.042154680.040618330.03017248-0.038083790.059791970.03251123-0.01618787-0.05283526-0.015099810.05030754-0.032248250.05769876-0.015198720.021418660.01543435-0.01191425-0.006745260.007284450.042657020.012546570.04424815-0.05862596-0.007382660.018917720.024717340.013621350.02899224]123456789107、Word2Vec文本向量的应用场景Word2Vec文本向量的应用场景如下:计算同义词:通过计算词向量之间的距离或余弦相似度,可以衡量词义的相似性;文本分类:使用文本向量表示文本的特征,可以提高文本分类器在垃圾邮件检测、情感分析等方面的性能;语言翻译:词向量帮助将源语言词汇映射到目标语言词汇,增强翻译系统的准确性和流畅性;向量检索:替代传统的"关键词检索",通过词向量改进搜索引擎的相关性排名,使得搜索结果与用户意图更加匹配,即使没有一模一样的词汇,也可以通过近义词进行检索;命名实体识别(NER):在文本中识别和分类实体名称,词向量有助于提升识别准确率;实体名称指的是人名,地名,公司名等;GPT生成文本模型:在大语言模型的文本生成任务中,如:对话生成,自动写作,词向量可以帮助生成更自然和相关的内容;二、Word2Vec完整代码示例1、Python中实现Word2Vec模型的库Python中实现了Word2Vec模型的函数库:TensorFlow:开源的机器学习库,可以用来构建Word2Vec模型,TensorFlow提供了深度学习的基础工具,可以实现Word2Vec模型;使用前先执行pipinstalltensorflow命令,安装软件包;Gensim:用于自然语言处理的库,提供了高效的Word2Vec实现;使用前先执行pipinstallgensim命令,安装软件包;Keras:高级神经网络API,可以在TensorFlow、Theano和CNTK后端上运行;Keras内置了很多功能来构建和训练模型,包括Word2Vec;使用前先执行pipinstallkeras命令,安装软件包;FastText:Facebook开发的一个库,扩展了Word2Vec的功能,并且通常更快且准确度更高;使用前先执行pipinstallfasttext命令,安装软件包;2、安装tensorflow软件包在Windows系统的cmd命令行中执行pipinstalltensorflow1命令,安装PyCharm中使用的Python函数库tensorflow软件包ython中使用pipinstall命令,安装的软件包都在PythonSDK的Lib\site-packages目录下;本次的安装目录是D:\001_Develop\022_Python\Python37_64\Lib\site-packages,其中D:\001_Develop\022_Python\Python37_64目录是Python的SDK安装位置;tensorflow库安装后有1GB,因此千万不要把Python的SDK装在C盘,系统盘不够用;3、代码示例示例代码解析:在下面的代码中,展示了tensorflow中提供的Word2Vec模型用法示例;首先,进行数据准备操作;使用Tokenizer将文本数据转换为整数序列,并生成词汇表;使用skipgrams函数生成训练对,这里我们使用了Skip-gram方法来生成上下文和目标词对;然后,构建简单的Word2VecSkip-gram模型,包括两个嵌入层和一个点积层;两个嵌入层分别对应目标词和上下文词;模型的输入是目标词和上下文词,输出的是两个词之间的相似度;再后,使用binary_crossentropy函数作为损失函数,进行模型训练;最后,从训练好的模型中,提取文本向量,并输出到命令行中;代码示例:importnumpyasnpimporttensorflowastffromtensorflow.keras.preprocessing.textimportTokenizerfromtensorflow.keras.preprocessing.sequenceimportskipgramsfromtensorflow.keras.layersimportEmbedding,Dot,Reshapefromtensorflow.keras.modelsimportModelfromtensorflow.keras.optimizersimportAdam#示例文本数据sentences=["Ilovemachinelearning","Deeplearningisamazing","Naturallanguageprocessingisafascinatingfield"]#使用Tokenizer进行词汇表创建tokenizer=Tokenizer()tokenizer.fit_on_texts(sentences)#构建词汇表word_index=tokenizer.word_index#获取词汇表中的词及其对应的索引index_word={i:wforw,iinword_index.items()}#创建索引到词的映射vocab_size=len(word_index)+1#词汇表大小#将文本数据转为整数序列sequences=tokenizer.texts_to_sequences(sentences)#将文本转换为整数序列#生成Skip-gram数据对pairs,labels=[],[]#初始化数据对和标签列表forsequenceinsequences:skipgram_pairs,skipgram_labels=skipgrams(sequence,vocabulary_size=vocab_size,window_size=2)#生成Skip-gram对pairs.extend(skipgram_pairs)#添加数据对labels.extend(skipgram_labels)#添加标签pairs=np.array(pairs)#转换为NumPy数组labels=np.array(labels)#转换为NumPy数组#模型参数embedding_dim=50#嵌入向量维度#构建模型input_target=tf.keras.Input(shape=(1,))#目标词输入层input_context=tf.keras.Input(shape=(1,))#上下文词输入层embedding=Embedding(vocab_size,embedding_dim,input_length=1,name='embedding')#嵌入层target=embedding(input_target)#目标词嵌入context=embedding(input_context)#上下文词嵌入dot_product=Dot(axes=-1)([target,context])#计算目标词和上下文词的点积output=Reshape((1,))(dot_product)#调整输出形状model=Model(inputs=[input_target,input_context],outputs=output)#创建模型model.compile(optimizer=Adam(),loss='binary_crossentropy')#编译模型,使用Adam优化器和二元交叉熵损失函数#训练模型model.fit([pairs[:,0],pairs[:,1]],labels,epochs=10,batch_size=256)#训练模型#提取词向量word_embeddings=model.get_layer('embedding').get_weights()[0]#获取词嵌入矩阵#打印词向量forword,indexinword_index.items():#遍历词汇表中的每个词print(f'Word:{word},Vector:{word_embeddings[index]}')#打印词和对应的词向量1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859604、执行结果上述代码执行结果如下:每个单词都转为了50个浮点数组成的向量值;D:\001_Develop\022_Python\Python37_64\python.exeD:/002_Project/011_Python/OpenAI/word2vec2.py2024-08-1609:28:11.076184:Itensorflow/core/platform/cpu_feature_guard.cc:193]ThisTensorFlowbinaryisoptimizedwithoneAPIDeepNeuralNetworkLibrary(oneDNN)tousethefollowingCPUinstructionsinperformance-criticaloperations:AVXAVX2Toenabletheminotheroperations,rebuildTensorFlowwiththeappropriatecompilerflags.Epoch1/101/1[==============================]-0s338ms/step-loss:4.6802Epoch2/101/1[==============================]-0s2ms/step-loss:4.5502Epoch3/101/1[==============================]-0s2ms/step-loss:4.4719Epoch4/101/1[==============================]-0s949us/step-loss:4.4127Epoch5/101/1[==============================]-0s981us/step-loss:4.3644Epoch6/101/1[==============================]-0s969us/step-loss:4.3234Epoch7/101/1[==============================]-0s2ms/step-loss:4.2877Epoch8/101/1[==============================]-0s2ms/step-loss:4.2559Epoch9/101/1[==============================]-0s2ms/step-loss:4.2272Epoch10/101/1[==============================]-0s2ms/step-loss:4.2011Word:learning,Vector:[-0.003211570.039277870.006169160.027896490.022031730.036127380.006371090.04316046-0.0498910.02915843-0.004262640.028418070.018230730.0149862-0.02141328-0.006870460.05354420.01235065-0.0463290.00192757-0.004244030.003647270.057908620.042154680.040618330.03017248-0.038083790.059791970.03251123-0.01618787-0.05283526-0.015099810.05030754-0.032248250.05769876-0.015198720.021418660.01543435-0.01191425-0.006745260.007284450.042657020.012546570.04424815-0.05862596-0.007382660.018917720.024717340.013621350.02899224]Word:is,Vector:[-0.04850127-0.01817817-0.01943211-0.01507875-0.035224460.030176160.00195579-0.030323630.04713830.01740152-0.020258710.03517399-0.04072831-0.01689353-0.036183980.007281860.00643327-0.040725860.028501840.022875980.025047630.04877659-0.0156221-0.02940094-0.010050220.00579242-0.03306862-0.003629980.02999207-0.020178720.041021160.04162875-0.00577520.03878671-0.016684190.01415582-0.017871190.00202267-0.03295140.00652844-0.02275511-0.014376720.02438227-0.03818180.01253717-0.0312217-0.002041550.041649180.019217530.00964438]Word:i,Vector:[0.043353810.005925830.0551033-0.045906170.01631795-0.048127520.04417964-0.00318619-0.010991560.000306460.055102310.035804250.002891930.03353943-0.028689910.04636407-0.003019940.026679690.005180260.03257323-0.025329740.00296220.0350619-0.001192640.02978915-0.01001480.032511990.006731610.03937319-0.041209990.014820280.049279190.03851033-0.04100788-0.009070340.028630630.026332140.04849904-0.038004950.003457590.00713671-0.015734750.02773830.004901510.029594210.010589070.058906180.050713540.000692390.0139456]Word:love,Vector:[0.021249480.03345285-0.038995180.040161550.014109330.00758267-0.00921821-0.03663526-0.03631829-0.03561198-0.011004560.03640453-0.01154441-0.01214306-0.00158718-0.0030126-0.00050348-0.03853129-0.037715860.0063320.031726550.022832060.0449295-0.031555220.031328890.03243085-0.01863536-0.012288870.006777870.01169837-0.02064442-0.016303080.00475659-0.038053560.04366293-0.002032370.03167215-0.00051814-0.018196410.02029561-0.01507660.03840049-0.00113360.01110033-0.05739464-0.037592640.009927590.03664881-0.037378240.04481837]Word:machine,Vector:[0.012567440.054939610.05156352-0.052546030.05222714-0.027871610.008843410.05859774-0.017066020.056613830.02399637-0.02899282-0.010915780.02853572-0.054350050.02165839-0.02526646-0.02812575-0.053166730.00179729-0.056483670.01757650.051521570.01999570.03839891-0.0121992-0.005521060.05587437-0.00706059-0.02177661-0.008211440.03031220.055558650.009881450.014312440.03425541-0.035622440.03591005-0.02323821-0.0334999-0.037418980.00821420.041245610.019591930.01364672-0.017618630.020247270.011307990.01236060.03641927]Word:deep,Vector:[-0.05665746-0.04792808-0.050142170.03106089-0.04355281-0.03822718-0.025909540.022627330.013332610.03697808-0.006401990.02848412-0.04587119-0.04111011-0.028025850.04949410.00170043-0.0509933-0.01448047-0.03505946-0.01690938-0.03925737-0.036212-0.051711410.02333118-0.04062278-0.05268471-0.006501610.00058350.027608670.016642050.03617666-0.018583920.03185634-0.00503922-0.04093894-0.055543980.017873060.03889327-0.0032624-0.01550952-0.027873510.01413828-0.03148517-0.006369940.01358110.04020449-0.010362720.048097130.0139291]Word:amazing,Vector:[-0.038040160.028574520.00322663-0.03340072-0.03811090.030388050.028462120.02931504-0.011978810.01147577-0.036538050.02703354-0.04092281-0.02101623-0.00295804-0.012916550.03330976-0.03112418-0.010692610.0437712-0.028104120.02468324-0.014122420.01921122-0.008587940.04074717-0.012550080.004732230.00847079-0.047504140.006318760.025893770.040847280.004787720.04552785-0.021807280.02622463-0.02766575-0.04723873-0.00096275-0.010527750.03625740.00161303-0.01221719-0.05281541-0.03373772-0.02718632-0.009686450.00077633-0.02636372]Word:natural,Vector:[0.00922634-0.00767834-0.04555733-0.03261197-0.043599120.002993910.02118976-0.02598096-0.040757250.004349790.039548580.014912080.03670142-0.016884980.00044547-0.00746624-0.005943560.00970965-0.023247470.014818420.040397510.036272280.04142744-0.031601520.029389140.02712004-0.032415670.04020814-0.010783380.033520290.043288430.004544910.021769730.01938275-0.009891980.02724446-0.05303999-0.02890654-0.00905713-0.035275480.01366975-0.010836820.051192050.03923048-0.02052928-0.028763360.006407-0.0167786-0.03914975-0.02305199]Word:language,Vector:[0.00847483-0.05559875-0.05925155-0.0355204-0.05359524-0.006082830.003203290.000667880.034305150.008940640.00096238-0.021278190.00662573-0.009539770.036188170.02098873-0.035470390.024784420.037723670.05674538-0.017687150.05667575-0.01954747-0.051687360.011834020.03491602-0.014711480.023429410.014611890.031540180.029795360.043347250.045475360.031796310.028130490.05590999-0.01982505-0.046772110.005298610.03856302-0.016711580.021637290.03889727-0.0004473-0.05948495-0.03651699-0.017272880.040681630.01857873-0.029497]Word:processing,Vector:[0.01042999-0.03210147-0.00543597-0.05209697-0.033515520.044029450.04770755-0.05743022-0.00061579-0.02664622-0.043970920.016920010.01631451-0.04232658-0.01666508-0.018258910.02082774-0.00202723-0.036568810.02678730.01689488-0.00137658-0.00539987-0.04710834-0.036279090.0566317-0.01942493-0.023412370.02579835-0.01172690.037207930.039907240.052781510.023690070.018401990.04370767-0.05255595-0.031477060.035176260.03002381-0.036264650.025836290.025602780.04363503-0.02960237-0.022475340.02859247-0.005058960.041477230.0269079]Word:a,Vector:[0.00220123-0.052231960.029455160.01754405-0.0505150.02510117-0.02950761-0.030566520.000886510.03177160.03254192-0.00572918-0.025312610.028606540.044507370.054196860.025411220.02225946-0.05497834-0.046073650.03371739-0.005863610.019247740.03487618-0.026868650.04013187-0.01317471-0.04139041-0.009586640.02934398-0.054178310.022883320.01579129-0.012115080.02824134-0.03828157-0.004272630.035035460.03345916-0.034071490.015195950.041214890.049552480.00572149-0.01740555-0.0336564-0.05183009-0.02940635-0.02089442-0.03992337]Word:fascinating,Vector:[-0.03372697-0.005480390.01430078-0.025388410.01516997-0.0042397-0.01453294-0.059133770.051970760.012402730.018468390.027465650.02233218-0.05126255-0.00907730.0167577-0.038195330.005483730.027312380.055247130.045528810.010243290.00757937-0.02946985-0.023686640.03069277-0.024400170.005954320.047704520.02805043-0.020247390.00269523-0.04128293-0.02501465-0.046414920.015219880.024749290.00230619-0.04900512-0.01796082-0.0155594-0.034217470.02724069-0.0107807-0.009760590.01740771-0.026397460.0161720.059626070.05553653]Word:field,Vector:[0.00322471-0.030186270.014378220.03609617-0.03801164-0.015107450.03879017-0.016521770.01412087-0.02735655-0.02743557-0.03221446-0.050865580.032243840.025048180.03592138-0.028612460.01564104-0.02967609-0.0399616-0.01911848-0.036902750.029651650.0108167-0.04724862-0.004199070.02764304-0.02756451-0.055507810.01215103-0.018376570.046818230.016229750.042636130.008366770.01799515-0.00060863-0.001256580.016338610.02095414-0.00010253-0.003895270.015011-0.03505156-0.00920992-0.01243817-0.04472306-0.04413529-0.03762042-0.03157695]Processfinishedwithexitcode0123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 会员注册

本版积分规则

QQ|手机版|心飞设计-版权所有:微度网络信息技术服务中心 ( 鲁ICP备17032091号-12 )|网站地图

GMT+8, 2025-1-13 17:45 , Processed in 0.966436 second(s), 26 queries .

Powered by Discuz! X3.5

© 2001-2025 Discuz! Team.

快速回复 返回顶部 返回列表