我有一个csv文件,其中包含60多个列和2000000行,我试图计算每个变量(每个列)的空值数量,然后对新行求和整个csv中null值的总数.例如,如果我们在输入中得到了这个文件:
I've got a csv file that contain more than 60 columns and 2 000 000 lines, I'm trying to count the number of null value per variable (per column) then to do the sum of that new row to get the number total of null value in the entire csv. For example if we got this file in input:
我们希望在输出中看到另一个文件:
We expect this other file in output:
我知道如何计算每行的空值数量,但是我没有弄清楚如何计算每列的空值数量.
I know how to count the number of null value per line but, I didn't figure out how to count the number of null value per column.
推荐答案必须有一种更好的方法来做到这一点,但是我制作了一个真正讨厌的JavaScript来完成这项工作.
There has to be a better way to do this, but I made a really nasty JavaScript which does the job.
对于不同的列类型,它有一些问题,因为它没有设置列类型. (应该将所有列设置为整数,但是我不知道这是否可以从JavaScript中实现.)
It has some problems for different column types, as it doesn't set the column type. (It should set all columns to integer, but I don't know if that is possible from JavaScript.)
您必须先运行Identify last row in a stream,然后将其保存到last列中(或更改脚本).
You have to run Identify last row in a stream first, and save it to the column last (or change the script).
var nulls; var seen; if (!seen) { // Initialize array seen = 1; nulls = []; for (var i = 0; i < getInputRowMeta().size(); i++) { nulls[i] = 0; } } for (var i = 0; i < getInputRowMeta().size(); i++) { if (row[i] == null) { nulls[i] += 1; } // Hack to find empty strings else if (getInputRowMeta().getValueMeta(i).getType() == 2 && row[i].length() == 0) { nulls[i] += 1; } } // Don't store any values trans_Status = SKIP_TRANSFORMATION; // Only store the nulls at the last row if (last == true) { putRow(nulls); }更多推荐
用pentaho计算每列的空值数量
发布评论