获取URL的HTML并进行解析,以便可以离线查看网站

编程入门 行业动态 更新时间:2024-10-24 14:19:03
本文介绍了获取URL的HTML并进行解析,以便可以离线查看网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

大家好, 今天我有一些与html解析有关的业务.要求的结果是:使用java.URL类从www.google/获取所有html内容,并设置一个文件,该文件可用于离线查看网站.原来最大的问题是从< img></img>中获取"诸如src的html元素属性.标签,来自的href 标签等.到目前为止,我已经通过使用正则表达式和BufferedReader/Writer类获得了src属性.代码示例:

Hello all, today I had some business with html parsing. The requested result was: using the java.URL class get all html content from www.google/ and set up a file which can be used to view the website offline. The greatest problem turned out to be "fetching" the html elements attributes like src from an <img></img> tag, href from an tag etc. So far I have got to the src attribute by using regular expressions and BufferedReader/Writer classes. A code sample:

URL google = new URL("www.google/"); BufferedReader in = new BufferedReader(new InputStreamReader(google .openStream())); BufferedWriter wr; String s = null; Pattern p; p = Patternpile(".*<img[^>]*src=\"([^\"]*)",Pattern.CASE_INSENSITIVE); Matcher m; try { wr = new BufferedWriter(new FileWriter("D:/HTMLFile.txt")); while ((s = in.readLine()) != null) { m = p.matcher(s); wr.write(s); while(m.find()) { System.out.println(m.group(1)); } } in.close(); } catch (IOException ex) { Logger.getLogger(JavaNetworking.class.getName()).log(Level.SEVERE, null, ex); }

对于此特定URL,输出为:"/textinputassistant/tia.png" 我想问的是,有人可以提供一个更好的例子吗?我在各种论坛上都读到regex + java是一个可笑的怪物.我想到的是一种算法,可以减轻经验丰富的程序员的负担,与我不同:)...就在这里. -从URL中读取所有html -复制到字符串变量 -在字符串中搜索< img" -当< img"> -复制到新的字符串变量 -搜索"src"或"href"属性 -提取属性值(System.out.println("..")暂时可以正常使用) 我认为这是一个防止白痴的问题,因为我认为这样可以解决问题,但我仍然认为最好是从由更大的专业人士组成的社区中寻求帮助:)

For this particular URL the output is: "/textinputassistant/tia.png" What I wanted to ask, is can someone give a better example on how to do this? I read on various forums that regex + java is a hidious monster, sort of speak. I have an algorithm in mind that could lighten stuff up for an experienced programmer, unlike me :)...here it is. - read all html from the URL - copy to a string variable - search in string for "<img" - when "<img"> - copy to new string variable - search for "src" or "href" attribute - extract the attributes value (System.out.println("..") will do just fine for now) I see this is an idiot-proof problem since I think that this could work out just fine like this, but still I think it''s better to ask for an oppinion from a community made of waaay bigger professionals :)

推荐答案

请在此处阅读: RegEx教程 [ ^ ] @ vogella 并且请在这里做一些研究,我们过去曾经有过这样的事情. Please read here: RegEx Tutorial[^] @ vogella And please do some research here, we had such things in the past.

更多推荐

获取URL的HTML并进行解析,以便可以离线查看网站

本文发布于:2023-06-07 21:56:21,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/568477.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:离线   网站   URL   HTML

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!