Solr是否可以保留将HTML文档格式设置为结果的格式?

编程入门 行业动态 更新时间:2024-10-27 06:24:59
本文介绍了Solr是否可以保留将HTML文档格式设置为结果的格式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

如何在HTML文档中维护HTML文档的原始格式.Solr给出的结果?

How do I maintain the Original formatting of the HTML document in the results given by Solr?

我正试图在我的一个公司网站中提供搜索功能,该网站拥有数百万个文档,并且都没有类似的格式,因此很难单独格式化每个文档.

I am trying to provide search functionality in one of my companies website that is having millions of documents and all are not having similar formatting, So it is hard to individually format each document.

我正在apache网站上使用 Solr 4.1夜间构建,该站点已对solr-提供内置支持细胞和蒂卡.也就是说,我不需要分别配置它们.

I am using Solr 4.1 nightly builds at apache site which is having inbuilt support for solr-cell and tika. i.e. i need not to separately configure them.

solr-cell或tika可以在任何地方保留这些格式吗?

does solr-cell or tika retains these formatting anywhere?

如果它不保留格式,那么我需要使用solr的 resourcename 字段从物理文件位置获取每个文档,并应用突出显示和其他solr现成的功能,但是此过程是太乏味了.

If it does not retain the formatting then I'll need to fetch each document from physical file location using resourcename field of solr and apply the highlights and other solr ready made functionality, But this process is too tedious.

如果我必须使用Jayendra在答案中建议的"HTMLStripCharFilterFactory",可以将什么用作请求处理程序?在这种情况下,我还可以提取元数据标签吗?

What can i use as a Request Handler if i have to use "HTMLStripCharFilterFactory" as suggested by Jayendra in the answer? also can i extract metadata tags in that case?

有人可以指导我吗!

感谢您的支持.!!!

Thank you for all your support.!!!

推荐答案

带有Tika的Solr Cell不保留文档的原始格式.您只会从通过Tika提交给Solr的文档中提取文本.

Solr Cell with Tika does not maintain the original formatting of the document. You would get only the extracted text from the documents fed to Solr through Tika.

否则,您必须将html文档作为普通的Solr字段提供,并应用 HTMLStripCharFilterFactory 过滤器以保留两个副本.

Else you have to feed the html document as a normal Solr field and apply HTMLStripCharFilterFactory filter to maintain both copies.

当storage = true时,Solr将使用HTML字段维护原始文档.但是,对于搜索(indexed = true),搜索将仅在内容而不是html元素上进行.

Solr will maintain the Original Document with HTML fields when stored=true. However, for Search (indexed=true) the search will only happen on the Content and not on the html elements.

更多推荐

Solr是否可以保留将HTML文档格式设置为结果的格式?

本文发布于:2023-11-28 14:47:54,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1642847.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:设置为   文档格式   格式   Solr   HTML

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!