使用亚马逊的日期管道来备份S3存储桶

编程入门 行业动态 更新时间:2024-10-23 09:31:51
使用亚马逊的日期管道来备份S3存储桶 - 如何跳过现有文件并避免不必要的覆盖?(Using Amazon's Date Pipeline to backup S3 bucket — how to skip existing files and avoid unnecessary overwriting?)

我正在使用Amazon的Date Pipeline将S3存储桶复制到另一个存储桶。 这是一个非常简单的设置,每晚运行。 但是,每次后续运行都会反复复制相同的文件 - 我只是跳过现有文件并仅复制新文件,因为此备份将来会变得非常大。 有没有办法做到这一点??

I'm using Amazon's Date Pipeline to copy and S3 bucket to another bucket. It's a pretty straightforward setup, and runs nightly. However, every subsequent run copies the same files over and over--I'd rather it just skip existing files and copy only the new ones, as this backup is going to get quite large in the future. Is there a way to do this??

最满意答案

看看这个线程 ,似乎无法与默认的CopyActivity进行同步:

你绝对可以使用Data Pipeline将一个S3目录复制到另一个目录,但需要注意的是,如果使用CopyActivity,它将是完全复制,而不是rsync。 因此,如果您对大量文件进行操作,其中只有一小部分已更改,则CopyActivity将不是最有效的方法。

您也可以编写自己的逻辑来执行diff,然后只对其进行同步,并使用CommandRunnerActivity来安排和管理它。

我认为它们实际上是指ShellCommandActivity ,它允许您安排shell命令运行。

我不能给你一个确切的配置示例,但这里是一个命令示例,您可以使用常规cron作业来同步两个桶: aws s3 sync s3://source_bucket s3://target_bucket 。

应该可以使用ShellCommandActivity运行它。 另请参阅AWS Data Pipeline中的ShellCommandActivity以及此处的答案评论。

更新 :@trevorhinesley对最终解决方案的评论(管道启动的默认实例使用了一些旧的aws cli,其中没有sync命令):

对于遇到此问题的任何人,我必须启动EC2实例,然后复制它使用的AMI ID(当您在EC2下的Instances菜单中选择它时,它位于实例列表下方的信息中)。 我在数据管道中使用了该图像ID并修复了它!

Looking at this thread, it seems to be not possible to do the sync with default CopyActivity:

You can definitely use Data Pipeline to copy one S3 directory to another, with the caveat that, if you use the CopyActivity, it'll be a fully copy, not an rsync. So if you're operating on a large number of files where only a small fraction have changed, the CopyActivity wouldn't be the most efficient way to do it.

You could also write your own logic to perform the diff and then only sync that, and use the CommandRunnerActivity to schedule and manage it.

I think they are actually refer to the ShellCommandActivity which allows you to schedule the shell command to run.

I can't give you an exact configuration example, but here is the example of command you can run with regular cron job to sync two buckets: aws s3 sync s3://source_bucket s3://target_bucket.

It should be possible to run it with ShellCommandActivity. Check also ShellCommandActivity in AWS Data Pipeline, and the comments to the answer here.

Update: the comment by @trevorhinesley with final solution (the default instance launched by the pipeline uses some old aws cli where there is no sync command):

For anyone who comes across this, I had to fire up an EC2 instance, then copy the AMI ID that it used (it's in the info below the list of instances when you select it in the Instances menu under EC2). I used that image ID in the data pipeline and it fixed it!

更多推荐

本文发布于:2023-07-29 20:37:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1319562.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:亚马逊   道来   备份   日期

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!