java抓取(省、市(区号\邮编)、县)数据(缺少***邮政或区号 修改版)

编程入门 行业动态 更新时间:2024-10-11 15:24:14

java抓取(省、市(<a href=https://www.elefans.com/category/jswz/34/1340898.html style=区号\邮编)、县)数据(缺少***邮政或区号 修改版)"/>

java抓取(省、市(区号\邮编)、县)数据(缺少***邮政或区号 修改版)

相信大家一定看过类似的文章,比如我就看了这一篇(感谢):.

我就是把这里的代码粘过来的,但是,有一个小问题,就是会有:"缺少***邮政或区号!"的错误.

当然,看过之后会发现,代码本身没什么问题,问题是什么呢?我们把url粘到浏览器就知道了.

按照源代码的方法,打个断点(具体就不说了),比如,查询北京的信息,url是这样的:

.asp?action=area2zone&area=%E5%8C%97%E4%BA%AC,

然后,你会发现,你啥都查不到.

不急,去:.asp   手动查北京的信息,你会发现,地址是这样的:

.asp?action=area2zone&area=%B1%B1%BE%A9,

显而易见,只是area的值不一样.

猜测:编码格式问题.

原文章有这样一句:String encode = URLEncoder.encode(var);

试着改成这样:String encode = URLEncoder.encode(var, "GBK");

然后,问题竟然解决了!我的天!

然后再进行进行略微修改,就是下面代码了(结果有偏差,海南省没有市,海口,三亚等分配到了县级别的文件,这个是天气预报网站的错误,爬虫出来的结果就是这样了).

声明:这是别人的成果,我只是把我遇到的问题说明给大家,感谢原创!

另附修改后的代码(想省事的可以直接粘贴):

import java.io.BufferedReader;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.UnsupportedEncodingException;
import java.URL;
import java.URLEncoder;
import java.util.Map.Entry;
import java.util.TreeMap;import org.dom4j.Document;
import org.dom4j.DocumentHelper;
import org.dom4j.Element;
import org.dom4j.io.OutputFormat;
import org.dom4j.io.XMLWriter;
import org.junit.Test;public class GetAreaXmlTest {/*** @param var*            城市名称* @return string数组,0表示邮编 1表示区号* @throws UnsupportedEncodingException*/@SuppressWarnings("deprecation")private String[] getZipCode(String var) {String[] code = new String[2];String zipCode_S = "邮编:";String zipCode_E = "&nbsp;";String qhCode_S = "区号:";String qhCode_E = "</td>";try {String encode = URLEncoder.encode(var, "GBK");URL url = new URL(".asp?area=" + encode + "&action=area2zone");// .asp?action=area2zone&area=%B1%B1%BE%A9// %E5%8C%97%E4%BA%ACBufferedReader br = new BufferedReader(new InputStreamReader(url.openStream(), "GBK"));for (String line; (line = br.readLine()) != null;) {int zipNum = line.indexOf(zipCode_S);if (zipNum > 1) {String str = line.substring(zipNum + zipCode_S.length());str = str.substring(0, str.indexOf(zipCode_E));code[0] = str;}int qhNum = line.indexOf(qhCode_S);if (qhNum > 1) {String str = line.substring(qhNum + qhCode_S.length());str = str.substring(0, str.indexOf(qhCode_E));code[1] = str;break;}}} catch (Exception e) {System.out.println(var + "\t错误" + e.toString());}return code;}/*** 主程序* * @throws Exception*/@Testpublic void main() {// 1:获取所有省份TreeMap<String, String> provincesBuffer = getAddressInfo("//data/city3jdata/china.html");Element prcEle = DocumentHelper.createElement("Provinces");// 2:根据省份获取城市Element citysEle = DocumentHelper.createElement("Citys");// 3:根据省份城市获取区、县Element distEle = DocumentHelper.createElement("Districts");int p = 1;int c = 1;int d = 1;for (Entry<String, String> prc : provincesBuffer.entrySet()) {Element province = DocumentHelper.createElement("Province");province.addAttribute("ID", "" + (p)).addAttribute("ProvinceName", prc.getValue()).addText(prc.getValue());// 获取邮政编号TreeMap<String, String> cityBuffer = getAddressInfo("/" + prc.getKey() + ".html");for (Entry<String, String> citys : cityBuffer.entrySet()) {Element city = DocumentHelper.createElement("City");String[] zipCode = getZipCode(citys.getValue());if (zipCode[0] == null || zipCode[1] == null)System.out.println("缺少" + citys.getValue() + "邮政或区号!");city.addAttribute("ID", "" + c).addAttribute("CityName", citys.getValue()).addAttribute("PID", p + "").addAttribute("ZipCode", zipCode[0]).addAttribute("AreaCode", zipCode[1]).addText(citys.getValue());TreeMap<String, String> distsBuffer = getAddressInfo("/" + prc.getKey() + "" + citys.getKey()+ ".html");for (Entry<String, String> dists : distsBuffer.entrySet()) {String value = dists.getValue();if (value.equals(citys.getValue()))continue;Element district = DocumentHelper.createElement("District");district.addAttribute("ID", "" + (d++)).addAttribute("DistrictName", dists.getValue()).addAttribute("CID", c + "").addText(dists.getValue());distEle.add(district);}citysEle.add(city);c++;}prcEle.add(province);p++;}// 4:保存到本地saveInf("e:\\Provinces.xml", prcEle);saveInf("e:\\Citys.xml", citysEle);saveInf("e:\\Districts.xml", distEle);}/*** 保存xml* * @param savePath*            xml保存路径* @param varEle*            根元素*/private void saveInf(String savePath, Element varEle) {Document varDoc = DocumentHelper.createDocument();varDoc.add(varEle);try {XMLWriter xmlwri = new XMLWriter(new FileOutputStream(new File(savePath)),new OutputFormat("\t", true, "UTF-8"));xmlwri.write(varDoc);xmlwri.close();} catch (Exception e) {System.out.println(savePath + "失败,原因如下");throw new RuntimeException(e);}}/*** 获取信息* * @param address*            url路径* @return key :信息编号 value:信息名称*/private TreeMap<String, String> getAddressInfo(String address) {TreeMap<String, String> china = new TreeMap<String, String>();BufferedReader br = null;String buffer = null;try {URL url = new URL(address);br = new BufferedReader(new InputStreamReader(url.openStream(), "UTF-8"));buffer = br.readLine();} catch (Exception e) {System.out.println("错误:" + e.getMessage());} finally {if (br != null)try {br.close();} catch (IOException e) {e.printStackTrace();}}if (buffer == null)return china;buffer = buffer.replaceAll("\\{|\\}|\"", "");String[] splits = buffer.split(",");for (String sp : splits) {String[] split = sp.split(":");if (split != null && split.length == 2)china.put(split[0], split[1]);elseSystem.out.println(address);}buffer = null;return china;}}


更多推荐

java抓取(省、市(区号\邮编)、县)数据(缺少***邮政或区号 修改版)

本文发布于:2024-03-13 22:28:52,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1735013.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:区号   修改版   邮编   邮政   数据

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!