用于大数据处理的python(python for large data processing)

编程入门 行业动态 更新时间:2024-10-23 08:24:27
用于大数据处理的python(python for large data processing)

我是python的新手,并且已经能够根据表格上回答的类似问题回答我的大部分问题,但是我已经陷入困境,我可以使用一些帮助。

我有一个简单的嵌套for循环脚本,它生成一个字符串输出。 我接下来需要做的是让每个分组都经过模拟,基于数字值,字符串也将匹配。

真的,我的问题是如何以最好的方式解决这个问题? 我不知道多线程是否会工作,因为字符串生成,然后需要进行模拟,一次一个。 我正在阅读关于队列的内容,并不确定他们是否可以进入队列进行存储,然后按照他们进入队列的顺序进行模拟。

无论我所做的研究如何,我都乐于向任何人提出有关此事的建议。

谢谢!

编辑:我没有找到如何做模拟的答案,而是在模拟计算时存储组合的方法

X = ["a","b"] Y = ["c","d","e"] Z= ["f","g"] for A in itertools.combinations(X,1): for B in itertools.combinations(Y,2): for C in itertools.combinations(Z, 2): D = A + B + C print(D)

I relatively new to python, and have been able to answer most of my questions based on similar problems answered on forms, but I'm at a point where I'm stuck an could use some help.

I have a simple nested for loop script that generates an output of strings. What I need to do next is have each grouping go through a simulation, based on numerical values that the strings will be matched too.

really my question is how do I go about this in the best way? Im not sure if multithreading will work since the strings are generated and then need to undergo the simulation, one set at a time. I was reading about queue's and wasn't sure if they could be passed into queue's for storage and then undergo the simulation, in the same order they entered the queue.

Regardless of the research I've done I'm open to any suggestion anyone can make on the matter.

thanks!

edit: Im not look for an answer on how to do the simulation, but rather a way to store the combinations while simulations are being computed

example

X = ["a","b"] Y = ["c","d","e"] Z= ["f","g"] for A in itertools.combinations(X,1): for B in itertools.combinations(Y,2): for C in itertools.combinations(Z, 2): D = A + B + C print(D)

最满意答案

正如在评论中暗示的那样, 多处理模块就是你正在寻找的。 由于全局解释器锁定(GIL),线程无法为您提供帮助,GIL一次只能执行一个Python线程。 特别是,我会看看多处理池 。 这些对象为您提供了一个接口,让一个子进程池与主进程并行工作,您可以返回并稍后检查结果。

您的示例代码片段可能如下所示:

import multiprocessing X = ["a","b"] Y = ["c","d","e"] Z= ["f","g"] pool = multiprocessing.pool() # by default, this will create a number of workers equal to # the number of CPU cores you have available combination_list = [] # create a list to store the combinations for A in itertools.combinations(X,1): for B in itertools.combinations(Y,2): for C in itertools.combinations(Z, 2): D = A + B + C combination_list.append(D) # append this combination to the list results = pool.map(simulation_function, combination_list) # simulation_function is the function you're using to actually run your # simulation - assuming it only takes one parameter: the combination

对pool.map的调用是阻塞的 - 这意味着一旦你调用它,主进程中的执行将停止,直到所有模拟完成,但它是并行运行它们。 完成后,无论您的模拟函数返回的results是否可用,都与输入参数位于combination_list中的顺序相同。

如果您不想等待它们,也可以在池中使用apply_async并将结果存储以供日后查看:

import multiprocessing X = ["a","b"] Y = ["c","d","e"] Z= ["f","g"] pool = multiprocessing.pool() result_list = [] # create a list to store the simulation results for A in itertools.combinations(X,1): for B in itertools.combinations(Y,2): for C in itertools.combinations(Z, 2): D = A + B + C result_list.append(pool.apply_async( simulation_function, args=(D,))) # note the extra comma - args must be a tuple # do other stuff # now iterate over result_list to check the results when they're ready

如果使用此结构, result_list将充满multiprocessing.AsyncResult对象 ,它们允许您检查它们是否准备好了result.ready() ,如果它已准备好,则使用result.get()检索结果。 这种方法会在组合计算完成后立即启动模拟,而不是等到所有计算都开始处理它们。 缺点是管理和检索结果会有点复杂。 例如,您必须确保结果已准备就绪或准备好捕获异常,您需要准备好捕获可能在worker函数中引发的异常等。有关注意事项在文档中进行了相当好的解释。

如果计算组合实际上并不需要很长时间,并且您不介意主进程停止,直到他们准备就绪,我建议使用pool.map方法。

As was hinted at in the comments, the multiprocessing module is what you're looking for. Threading won't help you because of the Global Interpreter Lock (GIL), which limits execution to one Python thread at a time. In particular, I would look at multiprocessing pools. These objects give you an interface to have a pool of subprocesses do work for you in parallel with the main process, and you can go back and check on the results later.

Your example snippet could look something like this:

import multiprocessing X = ["a","b"] Y = ["c","d","e"] Z= ["f","g"] pool = multiprocessing.pool() # by default, this will create a number of workers equal to # the number of CPU cores you have available combination_list = [] # create a list to store the combinations for A in itertools.combinations(X,1): for B in itertools.combinations(Y,2): for C in itertools.combinations(Z, 2): D = A + B + C combination_list.append(D) # append this combination to the list results = pool.map(simulation_function, combination_list) # simulation_function is the function you're using to actually run your # simulation - assuming it only takes one parameter: the combination

The call to pool.map is blocking - meaning that once you call it, execution in the main process will halt until all the simulations are complete, but it is running them in parallel. When they complete, whatever your simulation function returns will be available in results, in the same order that the input arguments were in the combination_list.

If you don't want to wait for them, you can also use apply_async on your pool and store the result to look at later:

import multiprocessing X = ["a","b"] Y = ["c","d","e"] Z= ["f","g"] pool = multiprocessing.pool() result_list = [] # create a list to store the simulation results for A in itertools.combinations(X,1): for B in itertools.combinations(Y,2): for C in itertools.combinations(Z, 2): D = A + B + C result_list.append(pool.apply_async( simulation_function, args=(D,))) # note the extra comma - args must be a tuple # do other stuff # now iterate over result_list to check the results when they're ready

If you use this structure, result_list will be full of multiprocessing.AsyncResult objects, which allow you to check if they are ready with result.ready() and, if it's ready, retrieve the result with result.get(). This approach will cause the simulations to be kicked off right when the combination is calculated, instead of waiting until all of them have been calculated to start processing them. The downside is that it's a little more complicated to manage and retrieve the results. For example, you have to make sure the result is ready or be ready to catch an exception, you need to be ready to catch exceptions that may have been raised in the worker function, etc. The caveats are explained pretty well in the documentation.

If calculating the combinations doesn't actually take very long and you don't mind your main process halting until they're all ready, I suggest the pool.map approach.

更多推荐

本文发布于:2023-08-07 03:55:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1460792.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:数据处理   python   large   data   processing

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!