使用Socket我可以向服务器发送http请求并获取html响应。 我的目标是让每个图像都可以是png, jpeg, gif,或任何其他图像类型。
但是,通过查看来自不同网站的回复,我注意到有些图片不使用html的<img>标签,而是使用CSS。 如何提取<img>图像和css图像(例如background-image )? 使用正则表达式从<img>获取这些图像网址是否合适?
请不要将我引用到像Apache HttpClient这样的http类。 我的问题不在http协议上。
Using Socket I can send http request to server and get the html response. My objective is to get each image may it be png, jpeg, gif, or any other image types.
However, by looking at the responses from different websites, I noticed that some images do not use html's <img> tag, and instead may be in CSS. How can I extract both <img> images and css images (e.g. background-image)? Is it good to use regex to get those images urls from <img>?
Please do not refer me to http classes like Apache HttpClient. My problem is not on http protocol.
最满意答案
正如其他答案已经提到的那样,理想情况下,您将使用一种了解如何解析,渲染和递归HTTP资源的工具(即.html / css / js / png / gif / jpg / etc)。
话虽如此,如果你感觉特别自虐(我怀疑你是),你可以自己做...
这不是一个完美的解决方案,但如果我要使用钝器攻击它,我会使用正则表达式(我不会详细介绍正则表达式,它已经广泛记录在互联网上 )。 我的过程是:
HTTP GET我的基页。 去掉所有符合“资源”定义的字符串(使用正则表达式)。 (可选)递归这些资源,以获取更多字符串。您已经提到可以执行HTTP请求/响应(使用套接字),因此我不会在此处介绍。
瞧!
/** * Regular expression to match file types - .js/.css/.png/.jpg/.gif */ public static final Pattern resources = Pattern.compile("([^\"'\n({}]+\\.(js|css|png|jpg|gif))", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE); /** * Pulls out "resources" from the provided text. */ public static Set<String> findResources(URL url, String text) { Matcher matcher = resources.matcher(text); Set<String> resources = new HashSet<>(); while (matcher.find()) { String resource = matcher.group(1); String urlStr = url.toString(); int endIndex = urlStr.lastIndexOf("/") + 1; String parentPath = endIndex > 0 ? urlStr.substring(0, endIndex) : urlStr; String fqResource = resource.startsWith("//") ? url.getProtocol() + ":" + resource : resource.startsWith("http") ? resource : resource.startsWith("/") ? getBaseUrl(url) + resource : parentPath + resource; if (fqResource.contains("?")) { fqResource = fqResource.substring(0, fqResource.indexOf("?")); } resources.add(fqResource); } return resources; }正则表达式:查找以css/js/png/gif/jpg结尾的css/js/png/gif/jpg良好的字符串
方法:从给定文本(也称为“http响应”)中检索所有匹配的字符串,尝试构建完全限定的URL,并返回一组数据。
我在这里上传了一个完整的例子 (带有示例输出)。 玩的开心!
As other answers have already mentioned, ideally you would use a tool that understands how to parse, render and recurse HTTP resources (i.e. .html/css/js/png/gif/jpg/etc).
That being said, if you were feeling particularly masochistic (and I suspect you are), you could do this yourself...
It's not a perfect solution, but if I was going to attack this with a blunt instrument, I'd use regular expressions (I won't go into the specifics of regex, it's already widely documented on the interwebs). My process would be:
HTTP GET my base page. Strip out all strings that match your definition of a "resource" (using regex). Optionally recurse those resources, for more strings.You've already mentioned that you can perform HTTP request/responses (using Sockets), so I won't cover that here.
Voila!
/** * Regular expression to match file types - .js/.css/.png/.jpg/.gif */ public static final Pattern resources = Pattern.compile("([^\"'\n({}]+\\.(js|css|png|jpg|gif))", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE); /** * Pulls out "resources" from the provided text. */ public static Set<String> findResources(URL url, String text) { Matcher matcher = resources.matcher(text); Set<String> resources = new HashSet<>(); while (matcher.find()) { String resource = matcher.group(1); String urlStr = url.toString(); int endIndex = urlStr.lastIndexOf("/") + 1; String parentPath = endIndex > 0 ? urlStr.substring(0, endIndex) : urlStr; String fqResource = resource.startsWith("//") ? url.getProtocol() + ":" + resource : resource.startsWith("http") ? resource : resource.startsWith("/") ? getBaseUrl(url) + resource : parentPath + resource; if (fqResource.contains("?")) { fqResource = fqResource.substring(0, fqResource.indexOf("?")); } resources.add(fqResource); } return resources; }The regular expression: looks for well formed strings ending in css/js/png/gif/jpg
The method: retrieves all matching strings from the given text (aka 'http response'), tries to build a fully qualified URL, and returns a Set of the data.
I've uploaded a full example here (with sample output). Have fun!
更多推荐
发布评论