我正在开发一个数据库自我项目。我有一个输入文件,来自: http://ir.dcs.gla。 ac.uk/resources/test_collections/cran/
在处理成1400个独立文件后,每个文件命名为 00001.txt 。 .. 01400.txt ...),然后在它们上应用停止之后,我们将它们分别存储在特定文件夹中,可以调用 StemmedFolder ,其格式如下:
在 StemmedFolder:
调查 aerodynam wing slipstream brenckman experiment investig aerodynam wingStemmedFolder: 00756.txt包括:
注释 eddi viscos compress mix flow lu ting$ b b
等等....
我写了代码:
{我可以提供我的代码这4个部分,以防有人需要看看如何实现或更改或任何编辑}
$每个文件的b $ b输出将导致单独的文件。 (1400,每个名为 00001.txt , 00002.txt ...)可以调用 FrequenceyFolder strong>使用以下格式:
在 FrequenceyFolder: 00001.txt包括:
00001,aerodynam,2 00001,agre,3 00001,angl,1 00001,attack,7 00001,basi,4 ....< FrequenceyFolder: 00999.txt包括:
00999,aerodynam,5 00999,评估,1 00999,电梯,3 00999,比率,2 00999,结果,9 .... $ c $ 中的 包含: 01400,减去,1 01400,支持,1 01400,理论,1 01400,theori,1 01400,.....
strong> :
我需要再次合并这1400个文件,输出一个txt文件,看起来像这样的格式与一些计算:
'airodynam'totalFrequency = 3docs:[[Doc_00001,5],[Doc_01344,4],[Doc_00123,3]] 'book'totalFrequncy = 2docs:[[Doc_00562,6],[Doc_01111,1] .... .... 'result'totalFrequency = 1doc:[[Doc_00010,5]] .... .... 'zzzz'totalFrequency = 1doc:[[Doc_01235,1]]
感谢您花费时间阅读这篇长文章
解决方案code> 列表的映射。
Map< String,List< FileInformation> statistics = new HashMap<>()
在上面的映射中,键将是字,值将是 List< FileInformation> 对象描述包含单词的单个文件的统计。 FileInformation 类可以声明如下:
class FileInformation { int occurrenceCount; String fileName; // getters和setters }填充上面的映射,请使用以下步骤:
一旦您已经填充 Map ,打印统计信息应该是一块蛋糕。
for(String word:statistics.keySet()){ List< FileInformation> fileInfos = statistics.get(word); for(FileInformation fileInfo:fileInfos){ //总结单词的occureneceCount以获得总频率} }
I am working on a database self project. I have an input file got from: ir.dcs.gla.ac.uk/resources/test_collections/cran/
After processing into 1400 separate file, each named 00001.txt,... 01400.txt...) and after applying Stemming on them, I will store them separately in a specific folder lets call it StemmedFolder with the following format:
in StemmedFolder: 00001.txt includes:
investig aerodynam wing slipstream brenckman experiment investig aerodynam wingin StemmedFolder: 00756.txt includes:
remark eddi viscos compress mix flow lu tingAnd so on....
I wrote the codes that do:
{I can provide my codes for these 4 sections in case somebody needs to see how is the implementation or change or any edit}
output of each file will be result to a separate file. (1400, each named 00001.txt, 00002.txt...) in a specific folder lets call it FrequenceyFolder with the following format:
in FrequenceyFolder: 00001.txt includes:
00001,aerodynam,2 00001,agre,3 00001,angl,1 00001,attack,7 00001,basi,4 ....in FrequenceyFolder: 00999.txt includes:
00999,aerodynam,5 00999,evalu,1 00999,lift,3 00999,ratio,2 00999,result,9 ....in FrequenceyFolder: 01400.txt includes:
01400,subtract,1 01400,support,1 01400,theoret,1 01400,theori,1 01400,.....
______________
Now my question:
I need to combine these 1400 files again to output a txt file that looks like this format with some calculation:
'aerodynam' totalFrequency=3docs: [[Doc_00001,5],[Doc_01344,4],[Doc_00123,3]] 'book' totalFrequncy=2docs: [[Doc_00562,6],[Doc_01111,1] .... .... 'result' totalFrequency=1doc: [[Doc_00010,5]] .... .... 'zzzz' totalFrequency=1doc: [[Doc_01235,1]]
Thanks for spending time reading this long post
解决方案You can use a Map of List.
Map<String,List<FileInformation>> statistics = new HashMap<>()
In the above map, the key will be the word and the value will be a List<FileInformation> object describing the statistics of individual files containing the word. The FileInformation class can be declared as follows :
class FileInformation { int occurrenceCount; String fileName; //getters and setters }To populate the above Map, use the following steps :
Once you have the Map populated, printing the statistics should be a piece of cake.
for(String word : statistics.keySet()) { List<FileInformation> fileInfos = statistics.get(word); for(FileInformation fileInfo : fileInfos) { //sum up the occureneceCount for the word to get the total frequency } }
更多推荐
Hashmap单键持有一个类。计数键和检索计数器
发布评论