将XML数据从Google BigQuery中的一个表转换为同一表中另一列中的JSON数据

编程入门行业动态更新时间:2024-10-24 16:29:19

本文介绍了将XML数据从Google BigQuery中的一个表转换为同一表中另一列中的JSON数据的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我在Google BigQuery中有下表(此处仅显示几行):

I have the following table in Google BigQuery (only a few lines are shown here):

id loaded_date data 1 2019-10-25 <collection><row><field name="Item Key" type="text" value="Haircolour - Avstemming kunder - OMT" /><field name="Created" type="datetime" value="2019-10-25 17:35:17Z" /><field name="Type" type="text" value="Session Provisioning Failure" /></row></collection> 2 2019-10-25 <collection><row><field name="Item Key" type="text" value="Haircolour - Avstemming kunder - OMT" /><field name="Created" type="datetime" value="2019-10-25 17:51:32Z" /><field name="Type" type="text" value="Session Provisioning Failure" /></row></collection> 3 2019-02-23 <collection><row><field name="Item Key" type="text" value="Haircolour - Hent klienter til kø" /><field name="Last Generation Time" type="datetime" value="2019-02-23 11:00:36Z" /><field name="Priority" type="number" value="-3" /></row></collection>

我的数据列为XML格式.我想在此表中添加第四列，例如 data_json ，其中包含与 data 列中相同的数据，但格式为JSON.

My data column is in XML format. I would like to add a fourth column to this table for example called data_json containing the same data as in the data column but in JSON format.

这意味着我想得到以下结果:

This means that I would like to end up with the following results:

id loaded_date data data_json 1 2019-10-25 Same data as before {"collection": {"row": {"field": [{"-name": "Item Key","-type": "text","-value": "Haircolour - Avstemming kunder - OMT"},{"-name": "Created","-type": "datetime","-value": "2019-10-25 17:35:17Z"},{"-name": "Type","-type": "text","-value": "Session Provisioning Failure"}]}}} 2 2019-10-25 Same data as before {"collection": {"row": {"field": [{"-name": "Item Key","-type": "text","-value": "Haircolour - Avstemming kunder - OMT"},{"-name": "Created","-type": "datetime","-value": "2019-10-25 17:51:32Z"},{"-name": "Type","-type": "text","-value": "Session Provisioning Failure"}]}}} 3 2019-02-23 Same data as before {"collection": {"row": {"field": [{"-name": "Item Key","-type": "text","-value": "Haircolour - Hent klienter til kø"},{"-name": "Last Generation Time","-type": "datetime","-value": "2019-02-23 11:00:36Z"},{"-name": "Priority","-type": "number","-value": "-3"}]}}}

有没有一种方法可以直接在BIgquery中使用SQL或使用Python?

Is there a way to do that using SQL directly in BIgquery, or using Python?

谢谢

推荐答案

为了更新BigQuery中的数据，您可以查看数据操作语言，但要考虑到它有自己的配额.在您的情况下，我会考虑从现有的表创建一个新表，并在Python中处理XML字段，以便将其解析为JSON格式.

In order to update data in BigQuery you can take a look at Data Manipulation Language, but take into account that it has its own quotas. In your case, I would consider creating a new table from the existing one, and treating the XML field in Python in order to parse it to JSON format.

我已使用Python的Google Cloud Client库重现了工作流程，并且可以与下面的附加代码一起正常使用.此代码的工作方式如下:

I have reproduced the workflow on my end, using Google Cloud Client libraries for Python and it works properly with the attached code below. This code works as follows:

将表格CSV文件导出到GCS存储桶
将CSV文件从GCS存储桶下载到您的计算机
将列追加到名为"JSON_data"的输入数据框
将"JSON_data"列中的XML列数据"解析为JSON格式
使用新数据创建新的BigQuery表

为了创建BigQuery表，我遵循了这个 StackOverflow线程.

In order to create the BigQuery table I have followed this StackOverflow thread.

您将必须设置自己的变量(bucket_name，项目，dataset_id，table_id，位置).请记住，将GCS存储桶与BigQuery数据集放在同一区域.

You will have to set your own variables (bucket_name, project, dataset_id, table_id, location). Remember to have your GCS bucket in the same region as your BigQuery dataset.

import xmltodict, json from google.cloud import bigquery from google.cloud import storage import pandas as pd #Define bigquery Client client = bigquery.Client() #Extract job bucket_name = <YOUR_BUCKET_NAME> project = <YOUR_PROJECT_ID> dataset_id = <YOUR_DATASET_ID> table_id = <YOUR_TABLE_ID> location = <YOUR_TABLE_LOCATION> def export_dataset(bucket_name, dataset_id, project, table_id): destination_uri = "gs://{}/{}".format(bucket_name, "bq_table.csv") dataset_ref = client.dataset(dataset_id, project=project) table_ref = dataset_ref.table(table_id) extract_job = client.extract_table( table_ref, destination_uri, # Location must match that of the source table. location=location, ) # API request extract_job.result() # Waits for job to complete. print( "Exported {}:{}.{} to {}".format(project, dataset_id, table_id, destination_uri) ) #Execute export job export_dataset(bucket_name, dataset_id, project, table_id) #-------------------------------------------- #Retrieve CSV file from GCS bucket source_blob_name = "bq_table.csv" destination_file_name = "bq_table.csv" def download_blob(bucket_name, source_blob_name, destination_file_name): """Downloads a blob from the bucket.""" storage_client = storage.Client() bucket = storage_client.get_bucket(bucket_name) blob = bucket.blob(source_blob_name) blob.download_to_filename(destination_file_name) print('Blob {} downloaded to {}.'.format( source_blob_name, destination_file_name)) #Download CSV from bucket download_blob(bucket_name, source_blob_name, destination_file_name) #-------------------------------------------- #Declare XML column name XML_col = 'data' #Read CSV as Pandas DF df = pd.read_csv('bq_table.csv') #Append JSON_data column df['JSON_data'] = '' #Transform XML and save in Array JSON_arr = [json.dumps(xmltodict.parse(df[XML_col].values[i])) for i in range(len(df[XML_col]))] #Set transformed data to column JSON_data df.loc[:,'JSON_data'] = JSON_arr #df to CSV - Generete output file df.to_csv('new_data.csv', index=False, sep=',') #---------------------------------------------- #Now we will create the new table with the new CSV csv_path='gs://{}/new_data.csv'.format(bucket_name) new_table='new_table' #Define schema for table schema = [ bigquery.SchemaField("id", "INTEGER"), bigquery.SchemaField("loaded_date", "DATE"), bigquery.SchemaField("JSON_data", "STRING"), ] #stackoverflow/questions/44947369/load-the-csv-file-into-big-query-auto- detect-schema-using-python-api def insertTable(datasetName, tableName, csvFilePath, schema=None): """ This function creates a table in given dataset in our default project and inserts the data given via a csv file. :param datasetName: The name of the dataset to be created :param tableName: The name of the dataset in which the table needs to be created :param csvFilePath: The path of the file to be inserted :param schema: The schema of the table to be created :return: returns nothing """ csv_file = open(csvFilePath, 'rb') dataset_ref = client.dataset(datasetName) from google.cloud.bigquery import Dataset dataset = Dataset(dataset_ref) table_ref = dataset.table(tableName) if schema is not None: table = bigquery.Table(table_ref,schema) else: table = bigquery.Table(table_ref) try: client.delete_table(table) except: pass table = client.create_table(table) from google.cloud.bigquery import LoadJobConfig job_config = LoadJobConfig() table_ref = dataset.table(tableName) job_config.source_format = 'CSV' job_config.skip_leading_rows = 1 job_config.autodetect = True job = client.load_table_from_file( csv_file, table_ref, job_config=job_config) job.result() insertTable(dataset_id, new_table, 'new_data.csv', schema)

请让我知道这是否对您有用.

Please, let me know if this worked for you.

更多推荐

将XML数据从Google BigQuery中的一个表转换为同一表中另一列中的JSON数据

本文发布于:2023-10-27 00:13:03，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1531755.html