博客内容Blog Content
大语言模型LLM之文本数据处理 Text Data Processing for Large Language Models (LLMs)
有了LLM的基础之后,若要实现自己的LLM,首先需要对文本进行一定处理,将其转成计算机能理解的嵌入向量数据 After understanding the fundamentals of LLMs, if you want to build your own LLM, the first step is to process the text data and convert it into embedding vector data that can be understood by computers.
2.0 总览 Overview
2. working with text data.ipynb
本章主要介绍了大型语言模型(LLM)训练中所需的文本处理方法,包括将文本拆分为单词和子词标记(tokens),使用字节对编码(Byte Pair Encoding, BPE)作为高级标记化方法,对标记化后的数据应用滑动窗口方法采样训练样本,以及将标记转换为输入模型的数值向量表示,从而为模型训练提供支持。
This chapter focuses on the text processing methods required for training large language models (LLMs), including splitting text into words and subword tokens, using Byte Pair Encoding (BPE) as an advanced tokenization method, applying the sliding window approach to sample training examples from tokenized data, and converting tokens into numerical vector representations as inputs for the model training.
2.1 嵌入概念 Concepts of Embeddings
嵌入概念 Embedding Concepts
深度神经网络模型(包括LLMs)无法直接处理原始文本。因此,我们需要一种方法将单词表示为连续值的向量。注意,不同类型的信息媒介(视频、音频、文本)转换不同。
Deep neural network models (including LLMs) cannot process raw text directly. Therefore, we need a method to represent words as continuous value vectors. It’s important to note that different types of information media (e.g., video, audio, text) are transformed differently.
从本质上来说,嵌入(embedding)是一种将离散对象(如单词、句子、图像)映射到连续向量空间中的方法。嵌入的主要目的是将非数值化数据转换为神经网络可以处理的格式。
At its core, an embedding is a method for mapping discrete objects (such as words, sentences, or images) into a continuous vector space. The primary purpose of embeddings is to convert non-numeric data into a format that neural networks can process.
word2vec生成单词嵌入 Word2Vec for Generating Word Embeddings
为生成单词嵌入,已经开发了多种算法和框架。其中一个较早且最受欢迎的例子是 Word2Vec 方法。Word2Vec 通过训练神经网络架构,根据目标单词预测上下文,或根据上下文预测目标单词,从而生成单词嵌入。其核心思想是:出现在相似上下文中的单词往往具有相似的含义。因此,当单词嵌入被投射到二维空间进行可视化时,语义相似的词会聚集在一起
Several algorithms and frameworks have been developed to generate word embeddings. One of the earliest and most popular examples is the Word2Vec method. Word2Vec trains a neural network architecture to either predict a word based on its context or predict the context based on a word.
The key idea is that words appearing in similar contexts tend to have similar meanings. When word embeddings are projected into a two-dimensional space for visualization, semantically similar words often cluster together.
虽然我们可以使用预训练模型(如 Word2Vec)为机器学习模型生成嵌入,但大型语言模型(LLMs)通常会生成它们自己的嵌入,这些嵌入构成输入层的一部分,并在训练过程中不断更新。好处是,这些嵌入会针对具体的任务和数据进行优化。
While pretrained models like Word2Vec can be used to generate embeddings for machine learning models, large language models (LLMs) typically generate their own embeddings. These embeddings form part of the input layer and are continuously updated during training. The benefit of this approach is that the embeddings are optimized for specific tasks and datasets.
另外,高纬度嵌入数据规模较大,以具体示例而言,
最小的 GPT-2 模型(117M 和 125M 参数)使用的是 768 维的嵌入大小。
最大的 GPT-3 模型(175B 参数)使用的是 12,288 维的嵌入大小。
High-dimensional embeddings result in large-scale data. For example:
The smallest GPT-2 model (117M and 125M parameters) uses an embedding size of 768 dimensions.
The largest GPT-3 model (175B parameters) uses an embedding size of 12,288 dimensions.