类方法返回迭代器(Class method return iterator)

我实现了一个迭代器类，如下所示：

import numpy as np import time class Data: def __init__(self, filepath): # Computationaly expensive print("Computationally expensive") time.sleep(10) print("Done!") def __iter__(self): return self def __next__(self): return np.zeros((2,2)), np.zeros((2,2)) count = 0 for batch_x, batch_y in Data("hello.csv"): print(batch_x, batch_y) count = count + 1 if count > 5: break count = 0 for batch_x, batch_y in Data("hello.csv"): print(batch_x, batch_y) count = count + 1 if count > 5: break

但是构造函数的计算量很大，并且可能会多次调用for循环。例如，在上面的代码中，构造函数被调用两次（每个for循环创建一个新的Data对象）。

如何分隔构造函数和迭代器？我希望有以下代码，其中构造函数只调用一次：

data = Data(filepath) for batch_x, batch_y in data.get_iterator(): print(batch_x, batch_y) for batch_x, batch_y in data.get_iterator(): print(batch_x, batch_y)

I implemented an iterator class as following:

import numpy as np import time class Data: def __init__(self, filepath): # Computationaly expensive print("Computationally expensive") time.sleep(10) print("Done!") def __iter__(self): return self def __next__(self): return np.zeros((2,2)), np.zeros((2,2)) count = 0 for batch_x, batch_y in Data("hello.csv"): print(batch_x, batch_y) count = count + 1 if count > 5: break count = 0 for batch_x, batch_y in Data("hello.csv"): print(batch_x, batch_y) count = count + 1 if count > 5: break

However the constructor is computationally expensive, and the for loop might be called multiple times. For example, in above code the constructor is called twice (each for loop create a new Data object).

How do I separate constructor and iterator? I am hoping to have the following code, where constructor is called once only:

data = Data(filepath) for batch_x, batch_y in data.get_iterator(): print(batch_x, batch_y) for batch_x, batch_y in data.get_iterator(): print(batch_x, batch_y)

最满意答案

您可以直接遍历可迭代对象，因为for..in不需要任何其他内容：

data = Data(filepath) for batch_x, batch_y in data: print(batch_x, batch_y) for batch_x, batch_y in data: print(batch_x, batch_y)

也就是说，根据你如何实现__iter__() ，这可能是错误的。

例如：

坏

class Data: def __init__(self, filepath): self._items = load_items(filepath) self._i = 0 def __iter__(self): return self def __next__(self): if self._i >= len(self._items): # Or however you check if data is available raise StopIteration result = self._items[self._i] self._i += 1 return result

因为那时你不能迭代同一个对象两次，因为self._i仍然指向循环的结尾。

好十岁上下

class Data: def __init__(self, filepath): self._items = load_items(filepath) def __iter__(self): self._i = 0 return self def __next__(self): if self._i >= len(self._items): raise StopIteration result = self._items[self._i] self._i += 1 return result

这会在您每次要迭代时重置索引，修复上述内容。如果您在同一对象上嵌套迭代，则无效。

更好

要解决此问题，请将迭代状态保存在单独的迭代器对象中：

class Data: class Iter: def __init__(self, data): self._data = data self._i = 0 def __next__(self): if self._i >= len(self._data._items): # check for available data raise StopIteration result = self._data._items[self._i] self._i = self._i + 1 def __init__(self, filepath): self._items = load_items(filepath) def __iter__(self): return self.Iter(self)

这是最灵活的方法，但如果您可以使用以下任何一种方法，那就不必要了。

简单，使用yield

如果您使用Python的生成器，该语言将负责跟踪您的迭代状态，即使嵌套循环，它也应该正确执行：

class Data: def __init__(self, filepath): self._items= load_items(filepath) def __iter__(self): for it in self._items: # Or whatever is appropriate yield return it

简单，传递给底层的iterable

如果“计算成本高昂”的部分将所有数据加载到内存中，您可以直接使用缓存数据。

class Data: def __init__(self, filepath): self._items = load_items(filepath) def __iter__(self): return iter(self._items)

You can just iterate over an iterable object directly, for..in doesn't require anything else:

data = Data(filepath) for batch_x, batch_y in data: print(batch_x, batch_y) for batch_x, batch_y in data: print(batch_x, batch_y)

That said, depending on how you implement __iter__(), this could be buggy.

E.g.:

Bad

class Data: def __init__(self, filepath): self._items = load_items(filepath) self._i = 0 def __iter__(self): return self def __next__(self): if self._i >= len(self._items): # Or however you check if data is available raise StopIteration result = self._items[self._i] self._i += 1 return result

Because then you couldn't iterate over the same object twice, as self._i would still point at the end of the loop.

Good-ish

class Data: def __init__(self, filepath): self._items = load_items(filepath) def __iter__(self): self._i = 0 return self def __next__(self): if self._i >= len(self._items): raise StopIteration result = self._items[self._i] self._i += 1 return result

This resets the index every time you're about to iterate, fixing the above. This won't work if you're nesting iteration over the same object.

Better

To fix that, keep the iteration state in a separate iterator object:

class Data: class Iter: def __init__(self, data): self._data = data self._i = 0 def __next__(self): if self._i >= len(self._data._items): # check for available data raise StopIteration result = self._data._items[self._i] self._i = self._i + 1 def __init__(self, filepath): self._items = load_items(filepath) def __iter__(self): return self.Iter(self)

This is the most flexible approach, but it's unnecessarily verbose if you can use either of the below ones.

Simple, using yield

If you use Python's generators, the language will take care of keeping track of iteration state for you, and it should do so correctly even when nesting loops:

class Data: def __init__(self, filepath): self._items= load_items(filepath) def __iter__(self): for it in self._items: # Or whatever is appropriate yield return it