使用可组合的工作流转换表格数据

编程入门行业动态更新时间:2024-10-04 11:24:02

使用可<a href=https://www.elefans.com/category/jswz/34/1769978.html style= 组合的工作流转换表格数据"/>

使用可组合的工作流转换表格数据

本教程系列将涵盖txtai的主要用例，这是一个 AI 驱动的语义搜索平台。该系列的每章都有相关代码，可也可以在colab 中使用。
colab 地址

txtai 执行机器学习工作流来转换数据并构建人工智能驱动的语义搜索应用程序。txtai 支持处理非结构化和结构化数据。结构化或表格数据按行和列分组。这可以是电子表格、返回 JSON 或 XML 的 API 调用，甚至是键值对列表。

本文将通过示例介绍如何使用工作流和表格管道来转换和索引结构化数据。

安装依赖

安装txtai和所有依赖项。我们将安装 api、管道和工作流可选的附加包。

pip install txtai[api,pipeline,similarity]

CSV 工作流

第一个示例将转换和索引一个 CSV 文件。该COVID-19开放研究数据集（CORD-19）是医疗用品覆盖COVID-19的库。此工作流读取输入 CSV 并构建语义搜索索引。
第一步是在本地下载数据集。

# Get CORD-19 metadata file
!wget .csv
!head -1 metadata.csv > input.csv
!tail -10000 metadata.csv >> input.csv

下一部分创建一个由表格管道组成的简单工作流。表格管道构建了一个 (id, text, tag) 元组列表，可以轻松加载到 Embeddings 索引中。对于此示例，我们将使用url列作为 id，将title列作为文本列。textcolumns 参数采用列列表以支持索引多列中的文本内容。
处理文件 input.csv 并显示前 5 行。

from txtai.pipeline import Tabular
from txtai.workflow import Task, Workflow# Create tabular instance mapping input.csv fields
tabular = Tabular("url", ["title"])# Create workflow
workflow = Workflow([Task(tabular)])# Print 5 rows of input.csv via workflow
list(workflow(["input.csv"]))[:5]

[('.1016/j.cmpb.2021.106469; /','Computer simulation of the dynamics of a spatial susceptible-infected-recovered epidemic model with time delays in transmission and treatment.',None),('/; .36849/jdd.5544','Understanding the Potential Role of Abrocitinib in the Time of SARS-CoV-2',None),('.1186/1471-2458-8-42; /',"Can the concept of Health Promoting Schools help to improve students' health knowledge and practices to combat the challenge of communicable diseases: Case study in Hong Kong?",None),('/; =s5; ; .1016/j.eng.2020.07.018','Buying time for an effective epidemic response: The impact of a public holiday for outbreak control on COVID-19 epidemic spread',None),('.1093/pcmedi/pbab016','The SARS-CoV-2 spike L452R-E484Q variant in the Indian B.1.617 strain showed significant reduction in the neutralization activity of immune sera',None)]

接下来，我们获取工作流输出，构建 Embeddings 索引并运行搜索查询。

from txtai.embeddings import Embeddings# Embeddings with sentence-transformers backend
embeddings = Embeddings({"method": "transformers", "path": "sentence-transformers/paraphrase-mpnet-base-v2"})# Index subset of CORD-19 data
data = list(workflow(["input.csv"]))
embeddings.index(data)for uid, _ in embeddings.search("insulin"):title = [text for url, text, _ in data if url == uid][0]print(title, uid)

Importance of diabetes management during the COVID-19 pandemic. .1080/00325481.2021.1978704; /
Position Statement on How to Manage Patients with Diabetes and COVID-19 /; .15605/jafes.035.01.03
Successful blood glucose management of a severe COVID-19 patient with diabetes: A case report /; .1097/md.0000000000020844

该示例搜索了术语insulin。最重要的结果提到了糖尿病和血糖，它们是与糖尿病密切相关的术语。

JSON 服务工作流

下一个示例构建一个工作流，该工作流对远程 URL 运行查询、检索结果，然后转换表格数据并为其编制索引。此示例从Hacker News 头版获取排名靠前的结果。

下面展示了如何构建 ServiceTask 并打印第一个 JSON 结果。有关如何配置 ServiceTask 的详细信息可以在txtai 的文档中找到。

from txtai.workflow import ServiceTaskservice = ServiceTask(url="", method="get", params={"tags": None}, batch=False, extract="hits")
workflow = Workflow([service])list(workflow(["front_page"]))[4]

{'_highlightResult': {'author': {'matchLevel': 'none','matchedWords': [],'value': 'makerdiety'},'title': {'matchLevel': 'none','matchedWords': [],'value': 'Tips For Making a Popular Open Source Project in 2021'},'url': {'matchLevel': 'none','matchedWords': [],'value': '/make-popular-open-source-projects/'}},'_tags': ['story', 'author_makerdiety', 'story_29197806', 'front_page'],'author': 'makerdiety','comment_text': None,'created_at': '2021-11-12T10:06:45.000Z','created_at_i': 1636711605,'num_comments': 53,'objectID': '29197806','parent_id': None,'points': 138,'story_id': None,'story_text': None,'story_title': None,'story_url': None,'title': 'Tips For Making a Popular Open Source Project in 2021','url': '/make-popular-open-source-projects/'}

接下来，我们将使用表格管道映射 JSON 数据。url将用作 id 列和title要索引的文本。

from txtai.workflow import Task# Recreate service applying the tabular pipeline to each result
service = ServiceTask(action=tabular, url="", method="get", params={"tags": None}, batch=False, extract="hits")
workflow = Workflow([service])list(workflow(["front_page"]))[4]

('/make-popular-open-source-projects/','Tips For Making a Popular Open Source Project in 2021',None)

正如我们之前所做的那样，让我们构建一个 Embeddings 索引并运行搜索查询。

# Embeddings with sentence-transformers backend
embeddings = Embeddings({"method": "transformers", "path": "sentence-transformers/paraphrase-mpnet-base-v2"})# Index Hacker News front page
data = list(workflow(["front_page"]))
embeddings.index(data)for uid, _ in embeddings.search("programming"):title = [text for url, text, _ in data if url == uid][0]print(title, uid)

A guide to organizing settings in Django /
Useful sed scripts and patterns 
Tips For Making a Popular Open Source Project in 2021 /make-popular-open-source-projects/

XML 服务工作流

txtai 的 ServiceTask 可以使用 JSON 和 XML。此示例针对arXiv API运行查询，转换结果并将其索引以进行搜索。

下面显示了如何构建 ServiceTask 并打印第一个 XML 结果。

service = ServiceTask(url="", method="get", params={"search_query": None, "max_results": 25}, batch=False, extract=["feed", "entry"])
workflow = Workflow([service])list(workflow(["all:aliens"]))[:1]

[OrderedDict([('id', '.01522v3'),('updated', '2021-09-06T14:18:23Z'),('published', '2021-02-01T18:27:12Z'),('title','If Loud Aliens Explain Human Earliness, Quiet Aliens Are Also Rare'),('summary',"If life on Earth had to achieve n 'hard steps' to reach humanity's level,\nthen the chance of this event rose as time to the n-th power. Integrating this\nover habitable star formation and planet lifetime distributions predicts >99%\nof advanced life appears after today, unless n<3 and max planet duration\n<50Gyr. That is, we seem early. We offer this explanation: a deadline is set by\n'loud' aliens who are born according to a hard steps power law, expand at a\ncommon rate, change their volumes' appearances, and prevent advanced life like\nus from appearing in their volumes. 'Quiet' aliens, in contrast, are much\nharder to see. We fit this three-parameter model of loud aliens to data: 1)\nbirth power from the number of hard steps seen in Earth history, 2) birth\nconstant by assuming a inform distribution over our rank among loud alien birth\ndates, and 3) expansion speed from our not seeing alien volumes in our sky. We\nestimate that loud alien civilizations now control 40-50% of universe volume,\neach will later control ~10^5 - 3x10^7 galaxies, and we could meet them in\n~200Myr - 2Gyr. If loud aliens arise from quiet ones, a depressingly low\ntransition chance (~10^-4) is required to expect that even one other quiet\nalien civilization has ever been active in our galaxy. Which seems bad news for\nSETI. But perhaps alien volume appearances are subtle, and their expansion\nspeed lower, in which case we predict many long circular arcs to find in our\nsky."),('author',[OrderedDict([('name', 'Robin Hanson')]),OrderedDict([('name', 'Daniel Martin')]),OrderedDict([('name', 'Calvin McCarter')]),OrderedDict([('name', 'Jonathan Paulson')])]),('arxiv:comment',OrderedDict([('@xmlns:arxiv', ''),('#text', 'To appear in Astrophysical Journal')])),('link',[OrderedDict([('@href', '.01522v3'),('@rel', 'alternate'),('@type', 'text/html')]),OrderedDict([('@title', 'pdf'),('@href', '.01522v3'),('@rel', 'related'),('@type', 'application/pdf')])]),('arxiv:primary_category',OrderedDict([('@xmlns:arxiv', ''),('@term', 'q-bio.OT'),('@scheme', '')])),('category',[OrderedDict([('@term', 'q-bio.OT'),('@scheme', '')]),OrderedDict([('@term', 'physics.pop-ph'),('@scheme','')])])])]

接下来，我们将使用表格管道映射 XML 数据。id将用作 id 列和title要索引的文本。

from txtai.workflow import Task# Create tablular pipeline with new mapping
tabular = Tabular("id", ["title"])# Recreate service applying the tabular pipeline to each result
service = ServiceTask(action=tabular, url="", method="get", params={"search_query": None, "max_results": 25}, batch=False, extract=["feed", "entry"])
workflow = Workflow([service])list(workflow(["all:aliens"]))[:1]

[('.01522v3','If Loud Aliens Explain Human Earliness, Quiet Aliens Are Also Rare',None)]

正如我们之前所做的那样，让我们构建一个 Embeddings 索引并运行搜索查询。

# Embeddings with sentence-transformers backend
embeddings = Embeddings({"method": "transformers", "path": "sentence-transformers/paraphrase-mpnet-base-v2"})# Index Hacker News front page
data = list(workflow(["all:aliens"]))
embeddings.index(data)for uid, _ in embeddings.search("alien radio signals"):title = [text for url, text, _ in data if url == uid][0]print(title, uid)

Calculating the probability of detecting radio signals from aliencivilizations .0011v2
Field Trial of Alien Wavelengths on GARR Optical Network .04278v1
Do alien particles exist, and can they be detected? .07403v1

无需代码即可构建工作流！

下一个示例展示了如何通过 API 配置构建上述相同工作流之一。这是一种构建 txtai 索引工作流程的无代码方式！

# Index settings
writable: true
embeddings:path: sentence-transformers/nli-mpnet-base-v2# Tabular pipeline
tabular:idcolumn: idtextcolumns: - title# Workflow definitions
workflow:index:tasks:- task: serviceaction: tabularurl: =25method: getparams:search_query: nullbatch: falseextract: [feed, entry]- action: upsert

此工作流再次运行 arXiv 查询并索引文章标题。工作流配置的操作与之前在 Python 中配置的操作相同。

让我们启动一个 API 实例

!killall -9 uvicorn
!CONFIG=workflow.yml nohup uvicorn "txtai.api:app" &> api.log &
!sleep 30
!cat api.log

INFO:     Started server process [921]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)

接下来我们将执行工作流。txtai 具有针对JavaScript、Java、Rust和Golang 的API 绑定。但为简单起见，我们将仅通过 cURL 运行命令。

# Execute workflow via API call
!curl -X POST "http://localhost:8000/workflow" -H  "accept: application/json" -H  "Content-Type: application/json" -d "{\"name\":\"index\",\"elements\":[\"all:aliens\"]}"

[[".01522v3","If Loud Aliens Explain Human Earliness, Quiet Aliens Are Also Rare",null],["","AliEnFS - a Linux File System for the AliEn Grid Services",null],["","AliEn - EDG Interoperability in ALICE",null],[".05559v1","Oumuamua Is Not a Probe Sent to our Solar System by an Alien\n  Civilization",null],[".3979v1","Robust transitivity and density of periodic points of partially\n  hyperbolic diffeomorphisms",null],[".09210v1","Sampling alien species inside and outside protected areas: does it\n  matter?",null],["","The AliEn system, status and perspectives",null],[".0011v2","Calculating the probability of detecting radio signals from alien\n  civilizations",null],[".04278v1","Field Trial of Alien Wavelengths on GARR Optical Network",null],[".00529v1","Open Category Detection with PAC Guarantees",null],[".3640v1","The Study of Climate on Alien Worlds",null],[".6805v2","Aliens on Earth. Are reports of close encounters correct?",null],[".05078v1","The Imprecise Search for Habitability",null],[".2613v1","Resurgence, Stokes phenomenon and alien derivatives for level-one linear\n  differential systems",null],[".03394v1","That is not dead which can eternal lie: the aestivation hypothesis for\n  resolving Fermi's paradox",null],[".02294v1","Alien Calculus and non perturbative effects in Quantum Field Theory",null],[".0653v1","General and alien solutions of a functional equation and of a functional\n  inequality",null],[".06180v1","Are Alien Civilizations Technologically Advanced?",null],[".05387v1","Simultaneous x, y Pixel Estimation and Feature Extraction for Multiple\n  Small Objects in a Scene: A Description of the ALIEN Network",null],[".4034v1","The q-analogue of the wild fundamental group (II)",null],["","Expanding advanced civilizations in the universe",null],["","AliEn Resource Brokers",null],["","The Renormalization of Composite Operators in Yang-Mills Theories Using\n  General Covariant Gauge",null],[".0501v1","Alienation in Italian cities. Social network fragmentation from\n  collective data",null],[".07403v1","Do alien particles exist, and can they be detected?",null]]

数据现在已编入索引。请注意，索引配置具有 upsert 操作。每个工作流调用将插入新行或更新现有行。可以使用系统 cron 安排此调用以定期执行并构建 arXiv 文章标题的索引。

现在索引已准备就绪，让我们运行搜索。

# Run a search
!curl -X GET "http://localhost:8000/search?query=radio&limit=3" -H  "accept: application/json"

[{"id":".0011v2","score":0.40350067615509033},{"id":".04278v1","score":0.34062114357948303},{"id":".05387v1","score":0.22262515127658844}]

向工作流程添加翻译步骤

接下来，我们将重新创建工作流程，添加一个额外的步骤，在索引之前将文本翻译成法语。此工作流运行 arXiv 查询，翻译结果并构建法语标题的语义索引。

# Index settings
writable: true
embeddings:path: sentence-transformers/nli-mpnet-base-v2# Tabular pipeline
tabular:idcolumn: idtextcolumns: - title# Translation pipeline
translation:# Workflow definitions
workflow:index:tasks:- task: serviceaction: tabularurl: =25method: getparams:search_query: nullbatch: falseextract: [feed, entry]- action: translationargs: [fr]- action: upsert

!killall -9 uvicorn
!CONFIG=workflow.yml nohup uvicorn "txtai.api:app" &> api.log &
!sleep 30
!cat api.log

INFO:     Started server process [945]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)

和以前一样，我们将运行索引工作流和搜索。

# Execute workflow via API call
!curl -s -X POST "http://localhost:8000/workflow" -H  "accept: application/json" -H  "Content-Type: application/json" -d "{\"name\":\"index\",\"elements\":[\"all:aliens\"]}" > /dev/null# Run a search
!curl -X GET "http://localhost:8000/search?query=radio&limit=3" -H  "accept: application/json"

[{"id":".0011v2","score":0.5328004956245422},{"id":".4034v1","score":0.2441330999135971},{"id":".01522v3","score":0.22881504893302917}]

在 Python 中运行 YAML 工作流

工作流 YAML 文件也可以直接在 Python 中执行。在这种情况下，所有输入数据都在 Python 中本地传递，而不是通过网络接口传递。下面的部分展示了如何做到这一点！

import yamlfrom txtai.api import APIwith open("workflow.yml") as config:workflow = yaml.safe_load(config)app = API(workflow)# Run the workflow
data = list(app.workflow("index", ["all:aliens"]))# Run a search
for result in app.search("radio", None):text = [row[1] for row in data if row[0] == result["id"]][0]print(result["id"], result["score"], text)

.0011v2 0.5328004956245422 Calcul de la probabilité de détection des signaux radio de l'étrangercivilisations
.4034v1 0.2441330999135971 Le q-analogue du groupe fondamental sauvage (II)
.01522v3 0.22881504893302917 Si les étrangers louds expliquent le début de l'humanité, les étrangers tranquilles sont aussi rares
 0.21307508647441864 Alien - EDG Interopérabilité en ALICE
.2613v1 0.19786792993545532 Résurgence, phénomène Stokes et dérivés extraterrestres pour le niveau 1 linéairesystèmes différentiels
.3979v1 0.1915999799966812 Transitivité robuste et densité des points périodiques en partiedifféomorphismes hyperboliques
.04278v1 0.19029255211353302 Essai sur le terrain des longueurs d'onde aliens sur le réseau optique GARR
.07403v1 0.17492449283599854 Les particules exotiques existent-elles et peuvent-elles être détectées?
.05387v1 0.17426751554012299 Simultanée x, y Pixel Estimation et Extraction de Caractéristiques pour MultiplePetits objets dans une scène : une description du réseau ALIEN
 0.17286795377731323 AliEnFS - un système de fichiers Linux pour les services AliEn Grid