用于在运行时基于大型数据集提供随机 对的Java数据结构(Java data structure for providing random pair based on a large data se

用于在运行时基于大型数据集提供随机对的Java数据结构(Java data structure for providing random pair based on a large data set at run-time)

有没有一种聪明的方法可以创建一个'JSON-like'结构的String-Float对，'key'不需要，因为数据将被随机抓取 - 尽管0-n的递增密钥可能有助于随机检索相关数据。由于数据集的大小（10k对值），我需要将其保存到外部文件类型。

原因是我的数据将如何编译。为了节省手动输入数据的人员，该项目将基于excel，保存为CSV，使用临时java程序解析为文件格式（例如jJSON），可以将其添加到我的项目资源文件夹中。然后我可以从这个集合中检索数据，而我的应用程序不必在创建应用程序时手动将巨大的数组加载到内存中。我可以很容易地将CSV解析为在运行时“填充”一个数组（或类似的） - 但是我担心在移动设备上，内存开销会很大吗？

我已经回顾了以下问题的答案：用于解析大数据文件的合适Java数据结构和用于在磁盘上有效存储整数对集的数据结构选项？并且未能得出明确的结论。

我已经尝试保存到.JSON文件，但不确定我是否可以请求随机条目，加上这对于保持简单的结构似乎相当麻烦。是我需要关注搜索的treeMap或hashtable。

为了向我的查询提供一些上下文，我的应用程序将在android上运行，并且需要引用一个定义（大约500个字符的字符串）和一个转换因子（一个Float）。我需要检索随机数据条目。用户在会话期间只能发出2或3个请求 - 因此在将10k元素数组加载到内存中时没有意义。 QUERY：在Android手机上潜在的现代技术很容易咀嚼这种类型的查询，如果我在运行时分析数百万条目，它可能只是一个问题？

如果这将提供所需的功能，我愿意使用SQLlite来保存我的数据。请注意，数据集必须来自excel（CSV，TXT等）的易于导出的文件格式。

您可以给我的任何建议将不胜感激。

Is there a smart way to create a 'JSON-like' structure of String - Float pairs, 'key' not needed as data will be grabbed randomly - although an incremented key from 0-n might aid random retrieval of associated data. Due to the size of data set (10k pairs of values), I need this to be saved out to an external file type.

The reason is how my data will be compiled. To save someone entering data into an array manually the item will be excel based, saved out to CSV, parsed using a temporary java program to a file format (for example jJSON) which can be added to my project resources folder. I can then retrieve data from this set, without my application having to manually load a huge array into memory upon application creation. I can quite easily parse the CSV to 'fill-up' an array (or similar) at run-time - but I fear that on a mobile device, the memory overhead will be significant?

I have reviewed the answers to: Suitable Java data structure for parsing large data file and Data structure options for efficiently storing sets of integer pairs on disk? and have not been able to draw a definitive conclusion.

I have tried saving to a .JSON file, however not sure if I can request a random entry, plus this seems quite cumbersome for holding a simple structure. Is a treeMap or hashtable where I need to be focusing my search.

To provide some context to my query, my application will be running on android, and needs to reference a definition (approx 500 character String) and a conversion factor (an Float). I need to retrieve a random data entry. The user may only make 2 or 3 requests during a session - therefore see no point in loading a 10k element array into memory. QUERY: potentially modern day technology on android phones will easily munch through this type of query, and its perhaps only an issue if I am parsing millions of entries at run-time?

I am open to using SQLlite to hold my data if this will provide the functionality required. Please note that the data set must be derived from an easily exportable file format from excel (CSV, TXT etc).

Any advice you can give me would be much appreciated.

最满意答案

这是一种可能的设计，需要最小的内存占用，同时提供快速访问：

从逗号分隔或制表符分隔值的数据文件开始，以便在数据对之间存在换行符。

保留与数据文件中行的索引对应的long值数组。当您知道这些行的位置时，可以使用InputStream.skip()前进到所需的行。这充分利用了skip()通常比read InputStream快得多的事实。

你会有一些在初始化时运行的设置代码来索引行。

增强是仅对每第n行进行索引，以使数组更小。因此，如果n为100并且您正在访问第1003行，则将第10个索引跳到第1000行，然后读取另外两行以到达第1003行。这允许您调整数组的大小以使用更少的内存。

我认为这是一个有趣的问题，所以我整理了一些代码来测试我的想法。它使用了我从一些拥有大约36K行数据的大数据网站下载的4MB CSV文件样本。大多数线都超过100个字符。

以下是设置阶段的代码段：

long start = SystemClock.elapsedRealtime(); int lineCount = 0; try (InputStream in = getResources().openRawResource(R.raw.fl_insurance_sample)) { int index = 0; int charCount = 0; int cIn; while ((cIn = in.read()) != -1) { charCount++; char ch = (char) cIn; // this was for debugging if (ch == '\n' || ch == '\r') { lineCount++; if (lineCount % MULTIPLE == 0) { index = lineCount / MULTIPLE; if (index == mLines.length) { mLines = Arrays.copyOf(mLines, mLines.length + 100); } mLines[index] = charCount; } } } mLines = Arrays.copyOf(mLines, index+1); } catch (IOException e) { Log.e(TAG, "error reading raw resource", e); } long elapsed = SystemClock.elapsedRealtime() - start;

我发现我的数据文件实际上是由回车符而不是换行符分隔的。它必须是在Apple计算机上创建的。因此测试'\r'以及'\n' 。

以下是访问该行的代码片段：

long start = SystemClock.elapsedRealtime(); int ch; int line = Integer.parseInt(editText.getText().toString().trim()); if (line < 1 || line >= mLines.length ) { mTextView.setText("invalid line: " + line + 1); } line--; int index = (line / MULTIPLE); in.skip(mLines[index]); int rem = line % MULTIPLE; while (rem > 0) { ch = in.read(); if (ch == -1) { return; // readLine will fail } else if (ch == '\n' || ch == '\r') { rem--; } } BufferedReader reader = new BufferedReader(new InputStreamReader(in)); String text = reader.readLine(); long elapsed = SystemClock.elapsedRealtime() - start;

我的测试程序使用EditText以便我可以输入行号。

因此，为了让您对性能有所了解，第一阶段平均大约1600ms来读取整个文件。我使用了MULTIPLE值10.访问文件中的最后一条记录平均大约30ms。

我认为，只需29312字节的内存占用就可以实现30ms的访问速度。

您可以在GitHub上看到示例项目。

Here's one possible design that requires a minimal memory footprint while providing fast access:

Start with a data file of comma-separated or tab-separated values so you have line breaks between your data pairs.

Keep an array of long values corresponding to the indexes of the lines in the data file. When you know where the lines are, you can use InputStream.skip() to advance to the desired line. This leverages the fact that skip() is typically quite a bit faster than read for InputStreams.

You would have some setup code that would run at initialization time to index the lines.

An enhancement would be to only index every nth line so that the array is smaller. So if n is 100 and you're accessing line 1003, you take the 10th index to skip to line 1000, then read past two more lines to get to line 1003. This allows you to tune the size of the array to use less memory.

I thought this was an interesting problem, so I put together some code to test my idea. It uses a sample 4MB CSV file that I downloaded from some big data website that has about 36K lines of data. Most of the lines are longer than 100 chars.

Here's code snippet for the setup phase:

long start = SystemClock.elapsedRealtime(); int lineCount = 0; try (InputStream in = getResources().openRawResource(R.raw.fl_insurance_sample)) { int index = 0; int charCount = 0; int cIn; while ((cIn = in.read()) != -1) { charCount++; char ch = (char) cIn; // this was for debugging if (ch == '\n' || ch == '\r') { lineCount++; if (lineCount % MULTIPLE == 0) { index = lineCount / MULTIPLE; if (index == mLines.length) { mLines = Arrays.copyOf(mLines, mLines.length + 100); } mLines[index] = charCount; } } } mLines = Arrays.copyOf(mLines, index+1); } catch (IOException e) { Log.e(TAG, "error reading raw resource", e); } long elapsed = SystemClock.elapsedRealtime() - start;

I discovered my data file was actually separated by carriage returns rather than line feeds. It must have been created on an Apple computer. Hence the test for '\r' as well as '\n'.

Here's a snippet from the code to access the line:

long start = SystemClock.elapsedRealtime(); int ch; int line = Integer.parseInt(editText.getText().toString().trim()); if (line < 1 || line >= mLines.length ) { mTextView.setText("invalid line: " + line + 1); } line--; int index = (line / MULTIPLE); in.skip(mLines[index]); int rem = line % MULTIPLE; while (rem > 0) { ch = in.read(); if (ch == -1) { return; // readLine will fail } else if (ch == '\n' || ch == '\r') { rem--; } } BufferedReader reader = new BufferedReader(new InputStreamReader(in)); String text = reader.readLine(); long elapsed = SystemClock.elapsedRealtime() - start;

My test program used an EditText so that I could input the line number.

So to give you some idea of performance, the first phase averaged around 1600ms to read through the entire file. I used a MULTIPLE value of 10. Accessing the last record in the file averaged about 30ms.

To get down to 30ms access with only a 29312-byte memory footprint is pretty good, I think.

You can see the sample project on GitHub.

更多推荐