有没有办法优化我的Powershell函数从大文件中删除模式匹配?(Is there a way to optimise my Powershell function for removing pat

编程入门 行业动态 更新时间:2024-10-22 21:44:09
没有办法优化我的Powershell函数从大文件中删除模式匹配?(Is there a way to optimise my Powershell function for removing pattern matches from a large file?)

我有一个大文本文件(~20K行,每行约80个字符)。 我还有一个较大的数组(约1500项)包含我希望从大文本文件中删除的模式的对象。 注意,如果数组中的模式出现在输入文件的一行中,我希望删除整行,而不仅仅是模式。

输入文件为CSVish,其行类似于:

A;AAA-BBB;XXX;XX000029;WORD;WORD-WORD-1;00001;STRING;2015-07-01;;010;

我搜索输入文件中每一行的数组中的模式类似于

XX000029

上面一行的一部分。

我实现这一目标的一些天真的功能目前看起来像这样:

function Remove-IdsFromFile { param( [Parameter(Mandatory=$true,Position=0)] [string]$BigFile, [Parameter(Mandatory=$true,Position=1)] [Object[]]$IgnorePatterns ) try{ $FileContent = Get-Content $BigFile }catch{ Write-Error $_ } $IgnorePatterns | ForEach-Object { $IgnoreId = $_.IgnoreId $FileContent = $FileContent | Where-Object { $_ -notmatch $IgnoreId } Write-Host $FileContent.count } $FileContent | Set-Content "CleansedBigFile.txt" }

这有效,但速度

我怎样才能更快?

I've got a large text file (~20K lines, ~80 characters per line). I've also got a largish array (~1500 items) of objects containing patterns I wish to remove from the large text file. Note, if the pattern from the array appears on a line in the input file, I wish to remove the entire line, not just the pattern.

The input file is CSVish with lines similar to:

A;AAA-BBB;XXX;XX000029;WORD;WORD-WORD-1;00001;STRING;2015-07-01;;010;

The pattern in the array which I search each line in the input file for resemble the

XX000029

part of the line above.

My somewhat naïve function to achieve this goal looks like this currently:

function Remove-IdsFromFile { param( [Parameter(Mandatory=$true,Position=0)] [string]$BigFile, [Parameter(Mandatory=$true,Position=1)] [Object[]]$IgnorePatterns ) try{ $FileContent = Get-Content $BigFile }catch{ Write-Error $_ } $IgnorePatterns | ForEach-Object { $IgnoreId = $_.IgnoreId $FileContent = $FileContent | Where-Object { $_ -notmatch $IgnoreId } Write-Host $FileContent.count } $FileContent | Set-Content "CleansedBigFile.txt" }

This works, but is slow.

How can I make it quicker?

最满意答案

function Remove-IdsFromFile {
    param(
        [Parameter(Mandatory=$true,Position=0)]
        [string]$BigFile,
        [Parameter(Mandatory=$true,Position=1)]
        [Object[]]$IgnorePatterns
    )

    # Create the pattern matches
    $regex = ($IgnorePatterns | ForEach-Object{[regex]::Escape($_)}) -join "|"

    If(Test-Path $BigFile){
    $reader = New-Object  System.IO.StreamReader($BigFile)

    $line=$reader.ReadLine()
    while ($line -ne $null)
    {
        # Check if the line should be output to file
        If($line -notmatch $regex){$line | Add-Content "CleansedBigFile.txt"}

        # Attempt to read the next line. 
        $line=$reader.ReadLine()
    }

    $reader.close()

    } Else {
        Write-Error "Cannot locate: $BigFile"
    }
}
 

StreamReader是读取大型文本文件的首选方法之一。 我们还使用正则表达式构建模式字符串以匹配。 对于模式字符串,如果存在正则表达式控制字符,我们使用[regex]::Escape()作为预防措施。 不得不猜测,因为我们只看到一个模式字符串。

如果$IgnorePatterns可以很容易地转换为字符串,那么这应该就好了。 $regex一小部分样本将是:

XX000029|XX000028|XX000027

如果从数据库中填充$IgnorePatterns您可能对此控制较少,但由于我们正在使用正则表达式,您可以通过实际使用正则表达式(而不是仅仅是一个大的替代匹配)来减少该模式集,如上例所示。 例如,您可以将其减少到XX00002[7-9] 。

我不知道正则表达式本身是否会提供1500种可能的性能提升。 StreamReader应该是这里的焦点。 然而,我确实通过在输出中使用Add-Content来玷污水域,而输出也没有得到任何奖励( 也可以使用流编写器)。

读者和作家

我仍然需要测试它以确保它的工作原理,但这只是使用streamreader和streamwriter 。 如果它确实更好用,我只是要替换上面的代码。

function Remove-IdsFromFile { param( [Parameter(Mandatory=$true,Position=0)] [string]$BigFile, [Parameter(Mandatory=$true,Position=1)] [Object[]]$IgnorePatterns ) # Create the pattern matches $regex = ($IgnorePatterns | ForEach-Object{[regex]::Escape($_)}) -join "|" If(Test-Path $BigFile){ # Prepare the StreamReader $reader = New-Object System.IO.StreamReader($BigFile) #Prepare the StreamWriter $writer = New-Object System.IO.StreamWriter("CleansedBigFile.txt") $line=$reader.ReadLine() while ($line -ne $null) { # Check if the line should be output to file If($line -notmatch $regex){$writer.WriteLine($line)} # Attempt to read the next line. $line=$reader.ReadLine() } # Don't cross the streams! $reader.Close() $writer.Close() } Else { Write-Error "Cannot locate: $BigFile" } }

您可能需要对流进行一些错误预防,但它似乎确实有效。

function Remove-IdsFromFile {
    param(
        [Parameter(Mandatory=$true,Position=0)]
        [string]$BigFile,
        [Parameter(Mandatory=$true,Position=1)]
        [Object[]]$IgnorePatterns
    )

    # Create the pattern matches
    $regex = ($IgnorePatterns | ForEach-Object{[regex]::Escape($_)}) -join "|"

    If(Test-Path $BigFile){
    $reader = New-Object  System.IO.StreamReader($BigFile)

    $line=$reader.ReadLine()
    while ($line -ne $null)
    {
        # Check if the line should be output to file
        If($line -notmatch $regex){$line | Add-Content "CleansedBigFile.txt"}

        # Attempt to read the next line. 
        $line=$reader.ReadLine()
    }

    $reader.close()

    } Else {
        Write-Error "Cannot locate: $BigFile"
    }
}
 

StreamReader is one of the preferred methods to read large text files. We also use regex to build pattern string to match based on. With the pattern string we use [regex]::Escape() as a precaution if regex control characters are present. Have to guess since we only see one pattern string.

If $IgnorePatterns can easily be cast as strings this should working in place just fine. A small sample of what $regex looks like would be:

XX000029|XX000028|XX000027

If $IgnorePatterns is populated from a database you might have less control over this but since we are using regex you might be able to reduce that pattern set by actually using regex (instead of just a big alternative match) like in my example above. You could reduce that to XX00002[7-9] for instance.

I don't know if the regex itself will provide an performance boost with 1500 possibles. The StreamReader is supposed to be the focus here. However I did sully the waters by using Add-Content to the output which does not get any awards for being fast either (could use a stream writer in its place).

Reader and Writer

I still have to test this to be sure it works but this just uses streamreader and streamwriter. If it does work better I am just going to replace the above code.

function Remove-IdsFromFile { param( [Parameter(Mandatory=$true,Position=0)] [string]$BigFile, [Parameter(Mandatory=$true,Position=1)] [Object[]]$IgnorePatterns ) # Create the pattern matches $regex = ($IgnorePatterns | ForEach-Object{[regex]::Escape($_)}) -join "|" If(Test-Path $BigFile){ # Prepare the StreamReader $reader = New-Object System.IO.StreamReader($BigFile) #Prepare the StreamWriter $writer = New-Object System.IO.StreamWriter("CleansedBigFile.txt") $line=$reader.ReadLine() while ($line -ne $null) { # Check if the line should be output to file If($line -notmatch $regex){$writer.WriteLine($line)} # Attempt to read the next line. $line=$reader.ReadLine() } # Don't cross the streams! $reader.Close() $writer.Close() } Else { Write-Error "Cannot locate: $BigFile" } }

You might need some error prevention in there for the streams but it does appear to work in place.

更多推荐

本文发布于:2023-07-24 01:51:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1240124.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:没有办法   函数   大文件   模式   Powershell

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!