近年来,随着大模型技术的发展,大模型的语境理解能力和处理复杂关系现能力提升明显。使用大模型进行基于文本的实体关系抽取比传统的深度学习方法(LSTM+Attention)在精度方面有了很大的提升。
基于此,我们可以将一整本书“喂给”大模型,让它来告诉我们书里面有哪些主要人物,以及他们之间的相互关系。本文就利用国产大模型文心4.5 + LangChain工具链,实现这个想法。
-
jieba: 分词工具,中文文本分词,也可以使用hanlp等NLP文本处理工具
-
-
langchain-openai: 为了简化与大模型交互,获取大模型的返回结果,使用langchain-openai的ChatOpenAI接口部署大模型
-
下面就介绍具体步骤。本文使用经典名著《三国演义》的前几章节进行demo。原始数据和代码感兴趣的同学可以在文章末尾留言获取。
该部分主要完成文本分块和分词处理。 因为大模型对输入的token长度有限制,需要将文本按照一定的长度(Token的长度)进行分块处理。为保证提取的实体关系之间有更好的语义交叉连接质量,文本块之间要有适当的重叠,文本块开头的重叠部分与末尾的截断部分最好是完整的句子。所以文本块的大小和重叠部分的大小要根据当前的文本内容动态调整。 因为我们使用的是中文,一般完整的句子是以“。”, “?”等符号结尾,我们分块时也要考虑到。核心代码如下:
def text2paragraphs(text): result = [i.strip() for i in text.split('rn') if i] print(f"该文本可分为{len(result)}个段落!") return resultdef is_sentence_end(token): return token in ['。', '!', '?',"”"]def find_sentence_boundary_forward(tokens, chunk_size): end = len(tokens) for i in range(chunk_size, len(tokens)): if is_sentence_end(tokens[i]): end = i + 1 break return end def find_sentence_boundary_backward(tokens, start): for i in range(start - 1, -1, -1): if is_sentence_end(tokens[i]): return i + 1 return 0
def chunk_text(text, chunk_size=300, overlap=50): if chunk_size <= overlap: raise ValueError("chunk_size must be greater than overlap.") paragraphs = text2paragraphs(text) chunks = [] buffer = [] i = 0 while i < len(paragraphs): while len(buffer) < chunk_size and i < len(paragraphs): tokens = jieba.lcut(paragraphs[i]) buffer.extend(tokens) i += 1 while len(buffer) >= chunk_size: end = find_sentence_boundary_forward(buffer, chunk_size) chunk = buffer[:end] chunks.append(chunk) start_next = find_sentence_boundary_backward(buffer, end - overlap) if start_next==0: start_next = find_sentence_boundary_backward(buffer, end-1) if start_next==0: start_next = end - overlap buffer=buffer[start_next:]
if buffer: last_chunk = chunks[len(chunks)-1] rest = ''.join(buffer) temp = ''.join(last_chunk[len(last_chunk)-len(rest):]) if temp!=rest: chunks.append(buffer)
return chunks
此处以百度文心一言最新发布的ERNIE-4.5为例。使用Langchain_OpenAI ChatOpenAI进行大模型配置,核心代码如下:
import osfrom langchain.prompts import PromptTemplatefrom langchain.chains import SequentialChain, LLMChainfrom langchain_openai import ChatOpenAIfrom langchain_core.messages import HumanMessage# 配置环境变量,AI_STUDIO_API_KEY可以从个人账号中获取os.environ["AI_STUDIO_API_KEY"] = "XXX" # 替换为你的API_KEYos.environ["MODEL_URL"] = "https://aistudio.baidu.com/llm/lmapi/v3"os.environ["DeepSeek_MODEL"] = "deepseek-r1"os.environ["ERNIE_MODEL"] = "ERNIE-4.5-8K-preview"# 配置大模型llm = ChatOpenAI( base_url=os.environ.get("MODEL_URL"), api_key=os.environ.get("AI_STUDIO_API_KEY"), model=os.environ.get("ERNIE_MODEL"), max_tokens=2048,)
另外使用promt提示词的方式,对大模型提取的实体以及之间的关系做具体要求,此处的prompt参考微软开源的GraphRAG项目,如下:
# 该prompt参考微软GraphRAGRAGsystem_template="""-目标- 给定相关的文本文档和实体类型列表,从文本中识别出这些类型的所有实体以及所识别实体之间的所有关系。 -步骤- 1.识别所有实体。对于每个已识别的实体,提取以下信息: -entity_name:实体名称-entity_type:以下类型之一:[{entity_types}]-entity_description:对实体属性和活动的综合描述 将每个实体格式化为("entity"{tuple_delimiter}<entity_name>{tuple_delimiter}<entity_type>{tuple_delimiter}<entity_description>2.从步骤1中识别的实体中,识别彼此*明显相关*的所有实体配对(source_entity, target_entity)。 对于每对相关实体,提取以下信息: -source_entity:源实体的名称,如步骤1中所标识的 -target_entity:目标实体的名称,如步骤1中所标识的-relationship_type:关系类型,确保关系类型的一致性和通用性,使用更通用和无时态的关系类型-relationship_description:解释为什么你认为源实体和目标实体是相互关联的 -relationship_strength:一个数字评分,表示源实体和目标实体之间关系的强度 将每个关系格式化为("relationship"{tuple_delimiter}<source_entity>{tuple_delimiter}<target_entity>{tuple_delimiter}<relationship_type>{tuple_delimiter}<relationship_description>{tuple_delimiter}<relationship_strength>) 3.实体和关系的所有属性用中文输出,步骤1和2中识别的所有实体和关系输出为一个列表。使用**{record_delimiter}**作为列表分隔符。 4.完成后,输出{completion_delimiter}###################### -示例- ###################### Example 1:Entity_types: [person, technology, mission, organization, location]Text:while Alex clenched his jaw, the buzz of frustration dull against the backdrop of Taylor's authoritarian certainty. It was this competitive undercurrent that kept him alert, the sense that his and Jordan's shared commitment to discovery was an unspoken rebellion against Cruz's narrowing vision of control and order.Then Taylor did something unexpected. They paused beside Jordan and, for a moment, observed the device with something akin to reverence. “If this tech can be understood..." Taylor said, their voice quieter, "It could change the game for us. For all of us.”The underlying dismissal earlier seemed to falter, replaced by a glimpse of reluctant respect for the gravity of what lay in their hands. Jordan looked up, and for a fleeting heartbeat, their eyes locked with Taylor's, a wordless clash of wills softening into an uneasy truce.It was a small transformation, barely perceptible, but one that Alex noted with an inward nod. They had all been brought here by different paths################Output:("entity"{tuple_delimiter}"Alex"{tuple_delimiter}"person"{tuple_delimiter}"Alex is a character who experiences frustration and is observant of the dynamics among other characters."){record_delimiter}("entity"{tuple_delimiter}"Taylor"{tuple_delimiter}"person"{tuple_delimiter}"Taylor is portrayed with authoritarian certainty and shows a moment of reverence towards a device, indicating a change in perspective."){record_delimiter}("entity"{tuple_delimiter}"Jordan"{tuple_delimiter}"person"{tuple_delimiter}"Jordan shares a commitment to discovery and has a significant interaction with Taylor regarding a device."){record_delimiter}("entity"{tuple_delimiter}"Cruz"{tuple_delimiter}"person"{tuple_delimiter}"Cruz is associated with a vision of control and order, influencing the dynamics among other characters."){record_delimiter}("entity"{tuple_delimiter}"The Device"{tuple_delimiter}"technology"{tuple_delimiter}"The Device is central to the story, with potential game-changing implications, and is revered by Taylor."){record_delimiter}("relationship"{tuple_delimiter}"Alex"{tuple_delimiter}"Taylor"{tuple_delimiter}"workmate"{tuple_delimiter}"Alex is affected by Taylor's authoritarian certainty and observes changes in Taylor's attitude towards the device."{tuple_delimiter}7){record_delimiter}("relationship"{tuple_delimiter}"Alex"{tuple_delimiter}"Jordan"{tuple_delimiter}"workmate"{tuple_delimiter}"Alex and Jordan share a commitment to discovery, which contrasts with Cruz's vision."{tuple_delimiter}6){record_delimiter}("relationship"{tuple_delimiter}"Taylor"{tuple_delimiter}"Jordan"{tuple_delimiter}"workmate"{tuple_delimiter}"Taylor and Jordan interact directly regarding the device, leading to a moment of mutual respect and an uneasy truce."{tuple_delimiter}8){record_delimiter}("relationship"{tuple_delimiter}"Jordan"{tuple_delimiter}"Cruz"{tuple_delimiter}"workmate"{tuple_delimiter}"Jordan's commitment to discovery is in rebellion against Cruz's vision of control and order."{tuple_delimiter}5){record_delimiter}("relationship"{tuple_delimiter}"Taylor"{tuple_delimiter}"The Device"{tuple_delimiter}"study"{tuple_delimiter}"Taylor shows reverence towards the device, indicating its importance and potential impact."{tuple_delimiter}9){completion_delimiter}"""
from langchain_core.messages import HumanMessage, SystemMessagefrom langchain.prompts import ( ChatPromptTemplate, MessagesPlaceholder, HumanMessagePromptTemplate, SystemMessagePromptTemplate)chat_prompt = ChatPromptTemplate.from_messages( [system_message_prompt, MessagesPlaceholder("chat_history"), human_message_prompt])chain = chat_prompt | llmtuple_delimiter = " : "record_delimiter = "n"completion_delimiter = "nn"entity_types = ["人物", "职位", "兵器", "战役", "地点"]chat_history = []
import timetuple_delimiter = " : "record_delimiter = "n"completion_delimiter = "nn"entity_types = ["人物", "职位", "兵器", "战役", "地点"]chat_history = []chat_prompt = ChatPromptTemplate.from_messages([system_message_prompt, MessagesPlaceholder("chat_history"), human_message_prompt])chain = chat_prompt | llm3input_text = ''.join(chunks[0])print("原文如下>>> n")print(input_text)t0 = time.time()answer = chain.invoke({ "chat_history": chat_history, "entity_types": entity_types, "tuple_delimiter": tuple_delimiter, "record_delimiter": record_delimiter, "completion_delimiter": completion_delimiter, "input_text": input_text})t1= time.time()print("该模型耗时:",t1-t0,"秒")print("n")print(answer.content)print("n")
原文如下>>>却说孙坚被刘表围住,亏得程普、黄盖、韩当三将死救得脱,折兵大半,夺路引兵回江东。自此孙坚与刘表结怨。且说袁绍屯兵河内,缺少粮草。冀州牧韩馥,遣人送粮以资军用。谋士逢纪说绍曰:“大丈夫纵横天下,何待人送粮为食!冀州乃钱粮广盛之地,将军何不取之?”绍曰:“未有良策。”纪曰:“可暗使人驰书与公孙瓒,令进兵取冀州,约以夹攻,瓒必兴兵。韩馥无谋之辈,必请将军领州事;就中取事,唾手可得。”绍大喜,即发书到瓒处。瓒得书,见说共攻冀州,平分其地,大喜,即日兴兵。绍却使人密报韩馥。馥慌聚荀谌、辛评二谋士商议。谌曰:“公孙瓒将燕、代之众,长驱而来,其锋不可当。兼有刘备、关、张助之,难以抵敌。今袁本初智勇过人,手下名将极广,将军可请彼同治州事,彼必厚待将军,无患公孙瓒矣。”韩馥即差别驾关纯去请袁绍。长史耿武谏曰:“袁绍孤客穷军,仰我鼻息,譬如婴儿在股掌之上,绝其乳哺,立可饿死。奈何欲以州事委之?此引虎入羊群也。”馥曰:“吾乃袁氏之故吏,才能又不如本初。古者择贤者而让之,诸君何嫉妒耶?”耿武叹曰:“冀州休矣!”于是弃职而去者三十余人。独耿武与关纯伏于城外,以待袁绍。数日后,绍引兵至。耿武、关纯拔刀而出,欲刺杀绍。绍将颜良立斩耿武,文丑砍死关纯。绍入冀州,以馥为奋威将军,以田丰、沮授、许攸、逢纪分掌州事,尽夺韩馥之权。馥懊悔无及,遂弃下家小,匹马往投陈留太守张邈去了。却说公孙瓒知袁绍已据冀州,遣弟公孙越来见绍,欲分其地。绍曰:“可请汝兄自来,吾有商议。”越辞归。行不到五十里,道旁闪出一彪军马,口称:“我乃董丞相家将也!”乱箭射死公孙越。从人逃回见公孙瓒,报越已死。瓒大怒曰:“袁绍诱我起兵攻韩馥,他却就里取事;今又诈董卓兵射死吾弟,此冤如何不报!”尽起本部兵,杀奔冀州来。绍知瓒兵至,亦领军出。二军会于磐河之上:绍军于磐河桥东,瓒军于桥西。瓒立马桥上,大呼曰:“背义之徒,何敢卖我!该模型耗时: 122.43553447723389 秒**("entity" : "孙坚" : "人物" : "孙坚被刘表围住,后得程普、黄盖、韩当三将死救得脱,自此与刘表结怨。")("entity" : "刘表" : "人物" : "刘表围住孙坚,与孙坚结怨。")("entity" : "程普" : "人物" : "程普是孙坚的部将,参与救援孙坚。")("entity" : "黄盖" : "人物" : "黄盖是孙坚的部将,参与救援孙坚。")("entity" : "韩当" : "人物" : "韩当是孙坚的部将,参与救援孙坚。")("entity" : "江东" : "地点" : "孙坚被刘表围住后,夺路引兵回江东。")("entity" : "袁绍" : "人物" : "袁绍屯兵河内,缺少粮草,后听取逢纪建议,图谋冀州。")("entity" : "河内" : "地点" : "袁绍屯兵之处。")("entity" : "冀州" : "地点" : "冀州牧韩馥所在地,袁绍图谋之地。")("entity" : "韩馥" : "人物" : "冀州牧,遣人送粮以资袁绍军用,后被袁绍用计夺取冀州。")("entity" : "逢纪" : "人物" : "袁绍的谋士,建议袁绍图谋冀州。")("entity" : "公孙瓒" : "人物" : "被袁绍暗使人驰书约其进兵取冀州,后知袁绍已据冀州,遣弟公孙越来见绍欲分其地。")("entity" : "荀谌" : "人物" : "韩馥的谋士,建议韩馥请袁绍同治州事。")("entity" : "辛评" : "人物" : "韩馥的谋士,与荀谌一同商议应对公孙瓒之策。")("entity" : "刘备" : "人物" : "助公孙瓒攻冀州。")("entity" : "关羽" : "人物" : "助公孙瓒攻冀州,与刘备、张飞一同。")("entity" : "张飞" : "人物" : "助公孙瓒攻冀州,与刘备、关羽一同。")("entity" : "关纯" : "人物" : "韩馥差别驾去请袁绍,后被文丑砍死。")("entity" : "耿武" : "人物" : "韩馥的长史,谏阻韩馥请袁绍同治州事,后被颜良立斩。")("entity" : "颜良" : "人物" : "袁绍的将领,立斩耿武。")("entity" : "文丑" : "人物" : "袁绍的将领,砍死关纯。")("entity" : "奋威将军" : "职位" : "韩馥被袁绍入冀州后所任的职位。")("entity" : "田丰" : "人物" : "袁绍入冀州后,分掌州事之一。")("entity" : "沮授" : "人物" : "袁绍入冀州后,分掌州事之一。")("entity" : "许攸" : "人物" : "袁绍入冀州后,分掌州事之一。")("entity" : "陈留" : "地点" : "韩馥弃下家小后往投陈留太守张邈之处。")("entity" : "张邈" : "人物" : "陈留太守,韩馥往投之人。")("entity" : "公孙越" : "人物" : "公孙瓒之弟,被董丞相家将乱箭射死。")("entity" : "董丞相" : "人物" : "家将射死公孙越,未明确指出具体身份,但可推断为当时有权势的丞相,如董卓。")("entity" : "磐河" : "地点" : "公孙瓒与袁绍二军会战之处。")**("relationship" : "孙坚" : "刘表" : "敌对" : "孙坚被刘表围住,后结怨。" : 8)("relationship" : "孙坚" : "程普" : "部将" : "程普是孙坚的部将,参与救援孙坚。" : 9)("relationship" : "孙坚" : "黄盖" : "部将" : "黄盖是孙坚的部将,参与救援孙坚。" : 9)("relationship" : "孙坚" : "韩当" : "部将" : "韩当是孙坚的部将,参与救援孙坚。" : 9)("relationship" : "孙坚" : "江东" : "撤退至" : "孙坚被刘表围住后,夺路引兵回江东。" : 7)("relationship" : "袁绍" : "逢纪" : "谋士" : "逢纪是袁绍的谋士,建议袁绍图谋冀州。" : 8)("relationship" : "袁绍" : "冀州" : "图谋" : "袁绍听取逢纪建议,图谋冀州。" : 8)("relationship" : "袁绍" : "韩馥" : "敌对-利用" : "袁绍用计夺取韩馥的冀州。" : 7)("relationship" : "韩馥" : "荀谌" : "谋士" : "荀谌是韩馥的谋士,建议韩馥请袁绍同治州事。" : 7)("relationship" : "韩馥" : "辛评" : "谋士" : "辛评是韩馥的谋士,与荀谌一同商议应对公孙瓒之策。" : 7)("relationship" : "公孙瓒" : "刘备" : "盟友" : "刘备助公孙瓒攻冀州。" : 6)("relationship" : "公孙瓒" : "关羽" : "盟友" : "关羽助公孙瓒攻冀州。" : 6)("relationship" : "公孙瓒" : "张飞" : "盟友" : "张飞助公孙瓒攻冀州。" : 6)("relationship" : "韩馥" : "关纯" : "派遣" : "韩馥差别驾关纯去请袁绍。" : 7)("relationship" : "韩馥" : "耿武" : "部属" : "耿武是韩馥的长史,谏阻韩馥请袁绍同治州事。" : 7)("relationship" : "耿武" : "颜良" : "敌对" : "颜良立斩耿武。" : 9)("relationship" : "关纯" : "文丑" : "敌对" : "文丑砍死关纯。" : 9)("relationship" : "袁绍" : "奋威将军" : "任命" : "袁绍入冀州后,任命韩馥为奋威将军。" : 7)("relationship" : "袁绍" : "田丰" : "分掌州事" : "袁绍入冀州后,田丰分掌州事之一。" : 7)("relationship" : "袁绍" : "沮授" : "分掌州事" : "袁绍入冀州后,沮授分掌州事之一。" : 7)("relationship" : "袁绍" : "许攸" : "分掌州事" : "袁绍入冀州后,许攸分掌州事之一。" : 7)("relationship" : "韩馥" : "张邈" : "投奔" : "韩馥弃下家小后往投陈留太守张邈。" : 6)("relationship" : "公孙瓒" : "公孙越" : "兄弟" : "公孙越是公孙瓒之弟。" : 9)("relationship" : "公孙越" : "董丞相" : "敌对" : "公孙越被董丞相家将乱箭射死。" : 8)("relationship" : "公孙瓒" : "袁绍" : "敌对" : "公孙瓒与袁绍因冀州问题敌对。" : 8)("relationship" : "公孙瓒" : "磐河" : "会战" : "公孙瓒与袁绍二军会战于磐河之上。" : 7)("relationship" : "袁绍" : "磐河" : "会战" : "公孙瓒与袁绍二军会战于磐河之上。" : 7)**
对比原文和提取的关系,可以看到大模型可以非常完美的识别出其中人物,地点等实体。对实体之间的关系也提取的非常到位。
3.识别实体关系,并写入pandas的dataframe数据结果
将大模型的输出结果进行正则匹配后提取关系并写入结构化的文件中,以用于后续的可视化工作。代码如下:
import reimport pandas as pd# 使用正则表达式提取结构化数据pattern = r'''( # 匹配开始括号"relationship"s*:s* # 固定前缀"([^"]+)"s*:s* # 捕获组1:source(匹配除"外的任意字符)"([^"]+)"s*:s* # 捕获组2:target"([^"]+)"s*:s* # 捕获组3:type"((?:[^"]|\")*)"s*:s* # 捕获组4:description(允许转义引号)(d+) # 捕获组5:weight(数字)) # 匹配结束括号'''result_matches = []for text in results: # 使用正则表达式查找所有匹配项(启用详细模式和忽略空格) matches = re.findall(pattern, text, re.VERBOSE) result_matches.extend(matches)df = pd.DataFrame(result_matcheget', 'type', 'description', 'weight'])
下面使用Networkx创建关系图谱,并将上面Dataframe中的每行创建一条边,使用关系的label作为边的label,并进行可视化
import networkx as nximport matplotlib.pyplot as plt# Create a knowledge graphG = nx.Graph()for _, row in df.iterrows(): G.add_edge(row['source'], row['target'], label=row['type'], weight=row['weight'])# 绘制节点(实体)和边(关系)以及它们的标签# Visualize the knowledge graphpos = nx.spring_layout(G, seed=42, k=0.9)labels = nx.get_edge_attributes(G, 'label')plt.figure(figsize=(20, 10))nx.draw(G, pos, with_labels=True, font_size=10, node_size=700, node_color='lightblue', edge_color='gray', alpha=0.6)nx.draw_networkx_edge_labels(G, pos, edge_labels=labels, font_size=8, label_pos=0.3, verticalalignment='baseline')plt.title('Relation for SanGuo')plt.show()