格式化Postgres用于嵌套弹性搜索结构的JSON输出

编程入门行业动态更新时间:2024-10-26 14:34:08

本文介绍了格式化Postgres用于嵌套弹性搜索结构的JSON输出的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述我已经意识到，使用SQL数据库（Postgres）是将关系数据（24个CSV文件中的40+ GB）移植到Elasticsearch中的最有效的方法之一是使用嵌套结构。但是我仍然有一些问题与Postgres的JSON输出的格式：1）不需要的换行符（\\\），2）不需要的标题行和3）不需要的日期格式。以下是一个基本的示例：

file1 id，age，gender，wave 1， 49，M，1 2,72，F，0 file2 id，time，event1 1,2095-04-20 12:28:55 ，V39 1,2095-04-21 2：27：45，T21 2,2094-05-17 18：17：25，V39 file3 id，time，event2 1,2095-04-22 3：48：53，P90 2,2094-05-18 1：28：23，RT4 2,2094- 05-18 4：23：53，W3

将这些CSV添加到名为论坛并运行的模式这个SQL代码：

与f_1 as（ SELECT id，json_agg（file1。*）AS标签 FROM forum.file1 GROUP BY id ），f_2 as（ SELECT id，json_agg（file2。*）AS标签 FROM forum.file2 GROUP BY id ），f_3 as（ SELECT id，json_agg（file3。*）AS标签 FROM forum.file3 GROUP BY id ） SELECT （'{id：|| a.id），（''file1：'|| a.tag），（''file2：'|| b.tag），（''file3 || c.tag ||'}'） FROM f_1 AS a，f_2 AS b，f_3 AS c WHERE b.id = a.id AND c.id = a.id;

我得到这个输出（pgAdminIII - 导出到文件 - 没有引用）：

？列？，？列？，？列？，？列？ {id：1，file1：[{id：1，age：49，gender：M，wave：1}]，file2 {id：1，time：2095-04-20T12：28：55，event1：V39}， {id：1，time 04-21T02：27：45\" ，事件1： T21}]， file3的：[{ ID：1，时间： 2095-04-22T03：48：53，事件2 ：P90}}} {id：2，file1：[{id：2，age：72，gender：F，wave ]，文件2：[{ ID：2 时间： 2094-05-17T18：17：25，事件1： V39}]， file3的：[{ ID： 2，time：2094-05-18T01：28：23，event2：RT4}， {id：2，time：2094-05-18T04：23 ：53，event2：W3}}}

你可以看到， id有多行数据。我需要所有的数据在一行上给定一个id（即没有\\\）。我没有花很多时间但想改变的其他一些小问题是第一行是不需要的，我想摆脱？列？，？列？，？列？，？列？，而无需在完成处理后打开文件。理想情况下，我也希望在日期输出中没有 T 。我应该能够容纳Elasticsearch中的 T ，但到目前为止还没有接受它。这是Postgres的输出，它可以用于输入到Elasticsearch（使用 stream2es 和嵌套的映射结构）：

{id：1，file1：[{id：1，age gender：M，wave：1}]，file2：[{id：1，time：2095-04-20 12:28:55，event1 V39}，{id：1，time：2095-04-21 02:27:45，event1：T21}]，file3：[{id time：2095-04-22 03:48:53，event2：P90}}} {id：2，file1：[{id age：72，gender：F，wave：0}]，file2：[{id：2，time：2094-05-17 18:17:25 ，event1：V39}]，file3：[{id：2，time：2094-05-18 01:28:23，event2：RT4}，{ id：2，time：2094-05-18 04:23:53，event2：W3}}}

添加 to_json 可以修复不需要的换行符，但它会添加 \代替 stream2es解析器不喜欢的内容：

SELECT to_json（'{id：|| a.id），to_json（'file1：'|| a.tag json），to_json（'file2：'|| b.tag :: json），to_json（'file3：'|| c.tag :: json ||'}'）

{\ id\ ：1， \ file1\ ：[{\ id\ ：1，\ age\： 49，\gender\：\M\，\wave \：1}] ...

es2stream异常：线程stream dispatcher中的异常java.lang.ClassCastException：java.lang.String不能转换为clojure.lang.IPersistentMap

解决方案

全部选中一列（而不是四列）。功能 format（）将有助于你更清楚地写下来。使用

regexp_replace（str，'（\d\d\d\d- \d\d -\d\d）T'，'\1'，'g'）

更正日期格式和

替换（str，e'\\\'，''）

跳过换行符。

使用 COPY命令以简化问题：

COPY（ with f_1 as（ SELECT id，json_agg（file1。*）AS tag FROM forum.file1 GROUP BY id ），f_2 as（ SELECT id，json_agg（file2。*）AS标签 FROM forum.file2 GROUP BY id ），f_3 as SELECT id，json_agg（file3。*）AS标签 FROM forum.file3 GROUP BY id ） SELECT replace（ regexp_replace（格式（'{id：％ s，file1：％s，file2：％s，file3：％s}'， a.id，a.tag，b.tag，c.tag）， '（\d\d\d\d- \d\d- \\ n'，''） FROM f_1 AS a，f_2 AS b，f_3 AS c WHERE b.id = a.id AND c.id = a.id ）TO' /全/路径/到/你/文件;

使用命令行预加每行数据你可以使用一个功能返回两行的技巧。某些部分格式可以移动到该功能。

创建或替换函数format_data_line（命令文本，data_str文本）返回setof文本语言plpgsql作为$$ 开始返回下一个命令; return next replace（ regexp_replace（data_str，'（\d\d\d\d- \d\d-\d\\ \\ d）T'，'\1'，'g'）， e'\\\'，''）; end $$; COPY（ with f_1 as（ SELECT id，json_agg（file1。*）AS标签 FROM forum.file1 GROUP BY id ），f_2 as（ SELECT id，json_agg（file2。*）AS标签 FROM forum.file2 GROUP BY id ），f_3 as（ SELECT id，json_agg（file3。*）AS标签 FROM forum.file3 GROUP BY id ） SELECT format_data_line（'my command' ，格式（'{id：％s，file1：％s，file2：％s，file3：％s}'， a.id，a.tag ，b.tag，c.tag）） FROM f_1 AS a，f_2 AS b，f_3 AS c WHERE b.id = a.id AND c.id = a.id ）TO'/ full / path / to / your / file';

I've come to realize that using a SQL database (Postgres) is one of the most efficient ways to port my relational data (40+ GB across 24 CSV files) into Elasticsearch with a nested structure. However I'm still having a couple issues with the formatting of my JSON output from Postgres: 1) undesired line feeds (\n), 2) undesired header line and 3) undesired date format. Here is a basic example to demonstrate:

file1 id,age,gender,wave 1,49,M,1 2,72,F,0 file2 id,time,event1 1,2095-04-20 12:28:55,V39 1,2095-04-21 2:27:45,T21 2,2094-05-17 18:17:25,V39 file3 id,time,event2 1,2095-04-22 3:48:53,P90 2,2094-05-18 1:28:23,RT4 2,2094-05-18 4:23:53,W3

after adding these CSVs to a schema named forum and running this SQL code:

with f_1 as( SELECT id, json_agg(file1.*) AS tag FROM forum.file1 GROUP BY id ), f_2 as ( SELECT id, json_agg(file2.*) AS tag FROM forum.file2 GROUP BY id ), f_3 as ( SELECT id, json_agg(file3.*) AS tag FROM forum.file3 GROUP BY id ) SELECT ('{"id":' || a.id), ('"file1":' || a.tag), ('"file2":' || b.tag), ('"file3":' || c.tag ||'}') FROM f_1 AS a, f_2 AS b, f_3 AS c WHERE b.id = a.id AND c.id = a.id;

I get this output (pgAdminIII - Export to file - no quoting):

?column?,?column?,?column?,?column? {"id":1,"file1":[{"id":1,"age":49,"gender":"M","wave":1}],"file2":[{"id":1,"time":"2095-04-20T12:28:55","event1":"V39"}, {"id":1,"time":"2095-04-21T02:27:45","event1":"T21"}],"file3":[{"id":1,"time":"2095-04-22T03:48:53","event2":"P90"}]} {"id":2,"file1":[{"id":2,"age":72,"gender":"F","wave":0}],"file2":[{"id":2,"time":"2094-05-17T18:17:25","event1":"V39"}],"file3":[{"id":2,"time":"2094-05-18T01:28:23","event2":"RT4"}, {"id":2,"time":"2094-05-18T04:23:53","event2":"W3"}]}

You can see that for a given id there is data on multiple lines. I need all of the data to be on one line for a given id (i.e. no \n's). A couple other minor issues which I haven't spent much time on but would like to change are the first row isn't needed, I'd like to get rid of the ?column?,?column?,?column?,?column? without having to open the file after it's done processing. Ideally I'd also prefer that there was no T in the date output. I should be able to accommodate the T in Elasticsearch but thus far haven't gotten it to accept it. This is the output I desire from Postgres which works for input into Elasticsearch (using stream2es and a nested mapping structure):

{"id":1,"file1":[{"id":1,"age":49,"gender":"M","wave":1}],"file2":[{"id":1,"time":"2095-04-20 12:28:55","event1":"V39"},{"id":1,"time":"2095-04-21 02:27:45","event1":"T21"}],"file3":[{"id":1,"time":"2095-04-22 03:48:53","event2":"P90"}]} {"id":2,"file1":[{"id":2,"age":72,"gender":"F","wave":0}],"file2":[{"id":2,"time":"2094-05-17 18:17:25","event1":"V39"}],"file3":[{"id":2,"time":"2094-05-18 01:28:23","event2":"RT4"},{"id":2,"time":"2094-05-18 04:23:53","event2":"W3"}]}

Adding to_json does fix the undesired line feeds but it adds \" in place of " which the stream2es parser doesn't like:

SELECT to_json('{"id":' || a.id), to_json('"file1":' || a.tag::json), to_json('"file2":' || b.tag::json), to_json('"file3":' || c.tag::json ||'}')

"{\"id\":1","\"file1\":[{\"id\":1,\"age\":49,\"gender\":\"M\",\"wave\":1}]"...

es2stream exception: Exception in thread "stream dispatcher" java.lang.ClassCastException: java.lang.String cannot be cast to clojure.lang.IPersistentMap

解决方案

Select all in one column (instead of four). The function format() will help you to write it down more clearly. Use

regexp_replace (str, '(\d\d\d\d-\d\d-\d\d)T', '\1 ', 'g')

to correct the date format and

replace (str, e' \n ', '')

to skip newline chars.

Use COPY command to simplify the issue:

COPY ( with f_1 as( SELECT id, json_agg(file1.*) AS tag FROM forum.file1 GROUP BY id ), f_2 as ( SELECT id, json_agg(file2.*) AS tag FROM forum.file2 GROUP BY id ), f_3 as ( SELECT id, json_agg(file3.*) AS tag FROM forum.file3 GROUP BY id ) SELECT replace( regexp_replace( format('{"id":%s,"file1":%s,"file2":%s,"file3":%s}', a.id, a.tag, b.tag, c.tag), '(\d\d\d\d-\d\d-\d\d)T', '\1 ', 'g'), e' \n ', '') FROM f_1 AS a, f_2 AS b, f_3 AS c WHERE b.id = a.id AND c.id = a.id ) TO '/full/path/to/your/file';

To prepend each line of data with a command line you can use a trick with a function returning two rows. Some part of formatting can be moved to the function on the occasion.

create or replace function format_data_line(command text, data_str text) returns setof text language plpgsql as $$ begin return next command; return next replace( regexp_replace(data_str, '(\d\d\d\d-\d\d-\d\d)T', '\1 ', 'g'), e' \n ', ''); end $$; COPY ( with f_1 as( SELECT id, json_agg(file1.*) AS tag FROM forum.file1 GROUP BY id ), f_2 as ( SELECT id, json_agg(file2.*) AS tag FROM forum.file2 GROUP BY id ), f_3 as ( SELECT id, json_agg(file3.*) AS tag FROM forum.file3 GROUP BY id ) SELECT format_data_line( 'my command', format('{"id":%s,"file1":%s,"file2":%s,"file3":%s}', a.id, a.tag, b.tag, c.tag)) FROM f_1 AS a, f_2 AS b, f_3 AS c WHERE b.id = a.id AND c.id = a.id ) TO '/full/path/to/your/file';

更多推荐

格式化Postgres用于嵌套弹性搜索结构的JSON输出

本文发布于:2023-10-16 10:40:45，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1497336.html