如何处理张量流中的大量数据?

编程入门行业动态更新时间:2024-10-26 23:35:21

本文介绍了如何处理张量流中的大量数据?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

对于我的项目，我有大量数据，大约60GB的数据分散到npy文件中，每个文件包含大约1GB，每个文件包含大约750k记录和标签.

For my project I have large amounts of data, about 60GB spread into npy files, each holding about 1GB, each containing about 750k records and labels.

每个记录是345个float32，标签是5个float32.

Each record is a 345 float32 and the labels are 5 float32.

我也阅读了tensorflow数据集文档和队列/线程文档，但是我无法弄清楚如何最好地处理输入以进行训练，以及如何保存模型和权重以进行将来的预测.

I read the tensorflow dataset documentation and the queues / threads documentation as well but I can't figure out how to best handle the input for training and then how save the model and weights for future predicting.

我的模型很简单，看起来像这样:

My model is pretty straight forward, it looks like this:

x = tf.placeholder(tf.float32, [None, 345], name='x') y = tf.placeholder(tf.float32, [None, 5], name='y') wi, bi = weight_and_bias(345, 2048) hidden_fc = tf.nn.sigmoid(tf.matmul(x, wi) + bi) wo, bo = weight_and_bias(2048, 5) out_fc = tf.nn.sigmoid(tf.matmul(hidden_fc, wo) + bo) loss = tf.reduce_mean(tf.squared_difference(y, out_fc)) train_op = tf.train.AdamOptimizer().minimize(loss)

我训练神经网络的方式是一次随机读取文件，然后使用经过改组的numpy数组索引每个文件，并手动创建每个批次以使用feed_dict填充train_op.从我读到的所有内容来看，这是非常低效的，我应该以某种方式将其替换为数据集或队列和线程，但是正如我所说的，文档没有帮助.

The way I was training my neural net was reading the files one at a time in a random order then using a shuffled numpy array to index each file and manually creating each batch to feed the train_op using feed_dict. From everything I read this is very inefficient and I should somehow replace it with datasets or queue and threads but as I said the documentation was of no help.

那么，在张量流中处理大量数据的最佳方法是什么?

So, what is the best way to handle large amounts of data in tensorflow?

另外，作为参考，我的数据在2个操作步骤中被保存到numpy文件中:

Also, for reference, my data was saved to a numpy file in a 2 operation step:

with open('datafile1.npy', 'wb') as fp: np.save(data, fp) np.save(labels, fp)

推荐答案

npy文件的实用程序确实在内存中分配了整个数组.我建议您将所有的numpy数组转换为 TFRecords格式并在训练中使用这些文件.这是在tensorflow中读取大型数据集的最有效方法之一.

The utilities for npy files indeed allocate the whole array in memory. I'd recommend you to convert all of your numpy arrays to TFRecords format and use these files in training. This is one of the most efficient ways to read large dataset in tensorflow.

转换为TFRecords

Convert to TFRecords

def array_to_tfrecords(X, y, output_file): feature = { 'X': tf.train.Feature(float_list=tf.train.FloatList(value=X.flatten())), 'y': tf.train.Feature(float_list=tf.train.FloatList(value=y.flatten())) } example = tf.train.Example(features=tf.train.Features(feature=feature)) serialized = example.SerializeToString() writer = tf.python_io.TFRecordWriter(output_file) writer.write(serialized) writer.close()

处理图像的完整示例可以是在此处找到.

A complete example that deals with images can be found here.

阅读TFRecordDataset

Read TFRecordDataset

def parse_proto(example_proto): features = { 'X': tf.FixedLenFeature((345,), tf.float32), 'y': tf.FixedLenFeature((5,), tf.float32), } parsed_features = tf.parse_single_example(example_proto, features) return parsed_features['X'], parsed_features['y'] def read_tfrecords(file_names=("file1.tfrecord", "file2.tfrecord", "file3.tfrecord"), buffer_size=10000, batch_size=100): dataset = tf.contrib.data.TFRecordDataset(file_names) dataset = dataset.map(parse_proto) dataset = dataset.shuffle(buffer_size) dataset = dataset.repeat() dataset = dataset.batch(batch_size) return tf.contrib.data.Iterator.from_structure(dataset.output_types, dataset.output_shapes)