Shell脚本将文本解析为两个单独的字符串(Shell Script to Parse text into two separate strings)

编程入门行业动态更新时间:2024-10-11 01:17:05

我的目标是使用shell文件来解析来自wit.ai的文本，我似乎无法正确使用它，因为字符串（命名data ）可能会有很大的不同。我一直在尝试使用sed命令，但没有运气。服务器的响应看起来像这样（但请记住它的大小可能不同）：

data= {"status":"ok"}{"_text":"testing","msg_id":"56a26ccf-f324-455f-ba9b-db21c8c7ed50","outcomes":[{"_text":"testing","confidence":0.289,"entities":{},"intent":"weather"}]}

我想解析成两个名为text和intent字符串。

期望的结果应该是两个字符串，如下所示

text= "testing" intent= "weather"

我到目前为止的代码是：

data='{"status":"ok"}{"_text":"testing","msg_id":"56a26ccf-f324-455f-ba9b-db21c8c7ed50","outcomes":[{"_text":"testing","confidence":0.289,"entities":{},"intent":"weather"}$ text=$(echo $data | cut -d"," -f1 ) #removes text down to testing but leaves a quote at the end text=$(echo "${text::-1}") # this line removes the quote echo $data echo $text

目前的结果是： {"status":"ok"}{"_text":"testing

我很接近我只需要删除{"status":"ok"}{"_text":"所以我留下了testing 。我很接近，但我不能想到这最后一部分了。

My goal is to use a shell file to parse text from wit.ai and I cannot seem to get it right because the string (named data) can be vastly different. I've been trying to use a sed command but no luck. The response from the sever looks like this (but keep in mind it could be different in size):

data= {"status":"ok"}{"_text":"testing","msg_id":"56a26ccf-f324-455f-ba9b-db21c8c7ed50","outcomes":[{"_text":"testing","confidence":0.289,"entities":{},"intent":"weather"}]}

I would like to parse into two strings named text and intent.

The desired result should be two strings as follows

text= "testing" intent= "weather"

The code I have thus far is:

The current result is: {"status":"ok"}{"_text":"testing

I am close I just need to remove {"status":"ok"}{"_text":" so I am left with testing. I am close but I cant figure this last part out.

最满意答案

处理JSON的正确方法是使用解析器。有很多选择，例如：

jq ，“grep，sed＆awk for JSON” JSON.sh ，一个用Bash编写的解析器（并在www.json.org上正式推荐） json_pp ，Perl中的一款漂亮的打印机

所有这些和你的data是他们抱怨它是畸形的; 如果它们可以工作，您可以直接查询您的数据，如上述链接工具的所有教程中所示。

既然你做不到，我们就会直接回到文本中。我们可以用grep -o提取感兴趣的数据，它只返回它匹配的内容：

$ grep -o -e '"_text":"[^"]*"' -e '"intent":"[^"]*"'<<< "$data" "_text":"testing" "_text":"testing" "intent":"weather"

正则表达式位"[^"]*"表示”引用，然后是零或更多非引号，然后是另一个引号“ - 一种匹配两个引号之间的所有内容的方法，非贪婪。

为了进一步处理这个问题，我们可以用uniq去除重复行，然后使用sed删除引号和下划线，最后用等号和标签替换冒号：

$ grep -o -e '"_text":"[^"]*"' -e '"intent":"[^"]*"'<<< "$data" | uniq | sed -r 's/"_?(.*)":(.*)/\1=\t\2/' text= "testing" intent= "weather"

The proper way to deal with JSON is to use a parser. There are tons of options, for example:

jq, the "grep, sed & awk for JSON" JSON.sh, a parser written in Bash (and officially recommended on www.json.org) json_pp, a pretty printer in Perl

The problem with all these and your data is that they complain that it is malformed; if they would work, you could query your data directly, as demonstrated in all the tutorials of above linked tools.

Since you can't, we're back to fiddling around with the text directly. We could extract the data of interest with grep -o, which return only what it matches:

$ grep -o -e '"_text":"[^"]*"' -e '"intent":"[^"]*"'<<< "$data" "_text":"testing" "_text":"testing" "intent":"weather"

The regex bit "[^"]*" means "a quote, then zero or more non-quotes, then another quote" – a way to match everything between two quotes, non-greedily.

To process this further, we can get rid of the duplicate line with uniq, then use sed to remove the quotes and underscores and finally replace the colons with equals signs and a tab:

$ grep -o -e '"_text":"[^"]*"' -e '"intent":"[^"]*"'<<< "$data" | uniq | sed -r 's/"_?(.*)":(.*)/\1=\t\2/' text= "testing" intent= "weather"

更多推荐

本文发布于:2023-07-17 14:04:00，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1145678.html