问候,
我已经从先前的团队接手,并编写了处理csv文件的ETL作业.我在ubuntu上结合使用了shell脚本和perl. csv文件很大;他们以压缩档案的形式到达.解压缩后,许多都超过30Gb-是的,这是G
I've taken over from a prior team and writing ETL jobs which process csv files. I use a combination of shell scripts and perl on ubuntu. The csv files are huge; they arrive as zipped archives. Unzipped, many are more than 30Gb - yes, that's a G
旧版进程是在cron上运行的批处理作业,它完全解压缩每个文件,读取文件的第一行并将其复制到配置文件中,然后重新压缩整个文件.有时候这会花费许多小时的处理时间,毫无益处.
Legacy process is a batch job running on cron that unzips each file entirely, reads and copies the first line of it into a config file, then re-zips the entire file. Some days this takes many many hours of processing time, for no benefit.
您能建议一种方法,仅从压缩存档中的每个文件中提取第一行(或前几行),而无需完全解压缩存档吗?
Can you suggest a method to only extract the first line (or first few lines) from each file inside a zipped archive, without fully unpacking the archives?
推荐答案unzip 命令行实用程序具有-p选项,该选项将文件转储至标准输出.只需将其通过管道传输到 head ,它就不会麻烦将整个文件提取到磁盘上
The unzip command line utility has a -p option which dumps a file to standard out. Just pipe that into head and it'll not bother extracting the whole file to disk.
或者,来自 perldoc IO::Compress::Zip :
Alternatively, from perldoc IO::Compress::Zip:
my ($status, $bufferRef); my $member = $zip->memberNamed( 'xyz.txt' ); $member->desiredCompressionMethod( COMPRESSION_STORED ); $status = $member->rewindData(); die "error $status" unless $status == AZ_OK; while ( ! $member->readIsDone() ) { ( $bufferRef, $status ) = $member->readChunk(); die "error $status" if $status != AZ_OK && $status != AZ_STREAM_END; # do something with $bufferRef: print $$bufferRef; } $member->endRead();进行修改以适应需要,即遍历文件列表$zip->memberNames(),并且仅读取前几行.
Modify to suit, i.e. by iterating over the file list $zip->memberNames(), and only reading the first few lines.
更多推荐
对压缩存档中的文本文件运行"head",而无需解压缩存档
发布评论