python3使用builtwith识别网站使用相关技术

编程入门 行业动态 更新时间:2024-10-26 16:22:02

python3使用builtwith识别网站使用相关<a href=https://www.elefans.com/category/jswz/34/1770192.html style=技术"/>

python3使用builtwith识别网站使用相关技术

原作者:


1. 首先通过pip install builtwith安装builtwith

[plain] view plain copy print ?
  1. C:\Users\Administrator>pip install builtwith  
  2. Collecting builtwith  
  3.   Downloading builtwith-1.3.2.tar.gz  
  4. Installing collected packages: builtwith  
  5.   Running setup.py install for builtwith ... done  
  6. Successfully installed builtwith-1.3.2  
C:\Users\Administrator>pip install builtwith
Collecting builtwithDownloading builtwith-1.3.2.tar.gz
Installing collected packages: builtwithRunning setup.py install for builtwith ... done
Successfully installed builtwith-1.3.2

2. 在pycharm中新建工程并输入下面测试代码
[plain] view plain copy print ?
  1. import builtwith  
  2. tech_used = builtwith.parse('')  
  3. print(tech_used)  
import builtwith
tech_used = builtwith.parse('')
print(tech_used)

运行会得到下面的错误:
[plain] view plain copy print ?
  1. C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy  
  2. Traceback (most recent call last):  
  3.   File "F:/python/first/FirstPy", line 1, in <module>  
  4.     import builtwith  
  5.   File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 43  
  6.     except Exception, e:  
  7.                     ^  
  8. SyntaxError: invalid syntax  
  9.   
  10.   
  11. Process finished with exit code 1  
C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy
Traceback (most recent call last):File "F:/python/first/FirstPy", line 1, in <module>import builtwithFile "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 43except Exception, e:^
SyntaxError: invalid syntaxProcess finished with exit code 1

原因是builtwith是基于2.x版本的,需要修改几个地方,在pycharm出错信息中双击出错文件,进行修改,主要修改下面三种:
1. Python2中的 “Exception ,e”的写法已经不支持,需要修改为“Exception as e”。
2. Python2中print后的表达式在Python3中都需要用括号括起来。
3. builtwith中使用的是Python2中的urllib2工具包,这个工具包在Python3中是不存在的,需要修改urllib2相关的代码。
1和2容易修改,下面主要针对第3点进行修改:
首先将import urllib2替换为下面的代码:
[plain] view plain copy print ?
  1. import urllib.request  
  2. import urllib.error  
import urllib.request
import urllib.error
然后将urllib2的相关方法替换如下:
[plain] view plain copy print ?
  1. request = urllib.request.Request(url, None, {'User-Agent': user_agent})  
  2. response = urllib.request.urlopen(request)  
request = urllib.request.Request(url, None, {'User-Agent': user_agent})
response = urllib.request.urlopen(request)

再次运行项目,遇到下面错误:

[plain] view plain copy print ?
  1. C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy  
  2. Traceback (most recent call last):  
  3.   File "F:/python/first/FirstPy", line 3, in <module>  
  4.     builtwith.parse('')  
  5.   File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 62, in builtwith  
  6.     if contains(html, snippet):  
  7.   File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 105, in contains  
  8.     return repile(regex.split('\\;')[0], flags=re.IGNORECASE).search(v)  
  9. TypeError: cannot use a string pattern on a bytes-like object  
  10.   
  11.   
  12. Process finished with exit code 1  
C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy
Traceback (most recent call last):File "F:/python/first/FirstPy", line 3, in <module>builtwith.parse('')File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 62, in builtwithif contains(html, snippet):File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 105, in containsreturn repile(regex.split('\\;')[0], flags=re.IGNORECASE).search(v)
TypeError: cannot use a string pattern on a bytes-like objectProcess finished with exit code 1

  
这是因为urllib返回的数据格式已经发生了改变,需要进行转码,将下面的代码:
[plain] view plain copy print ?
  1. if html is None:  
  2.     html = response.read()  
if html is None:html = response.read()
修改为
[plain] view plain copy print ?
  1. if html is None:  
  2.      html = response.read()  
  3.      html = html.decode('utf-8')  
if html is None:html = response.read()html = html.decode('utf-8')

再次运行得到最终结果如下:
[plain] view plain copy print ?
  1. C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy  
  2. {'javascript-frameworks': ['jQuery']}  
  3.   
  4.   
  5. Process finished with exit code 0  
C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy
{'javascript-frameworks': ['jQuery']}Process finished with exit code 0

但是如果把网站换成 'www.163',运行再次报错如下:
[plain] view plain copy print ?
  1. C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy  
  2. Error: 'utf-8' codec can't decode byte 0xcd in position 500: invalid continuation byte  
  3. Traceback (most recent call last):  
  4.   File "F:/python/first/FirstPy", line 2, in <module>  
  5.     tech_used = builtwith.parse('')  
  6.   File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 63, in builtwith  
  7.     if contains(html, snippet):  
  8.   File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 106, in contains  
  9.     return repile(regex.split('\\;')[0], flags=re.IGNORECASE).search(v)  
  10. TypeError: cannot use a string pattern on a bytes-like object  
  11.   
  12.   
  13.   
  14. Process finished with exit code 1  
C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy
Error: 'utf-8' codec can't decode byte 0xcd in position 500: invalid continuation byte
Traceback (most recent call last):File "F:/python/first/FirstPy", line 2, in <module>tech_used = builtwith.parse('')File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 63, in builtwithif contains(html, snippet):File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 106, in containsreturn repile(regex.split('\\;')[0], flags=re.IGNORECASE).search(v)
TypeError: cannot use a string pattern on a bytes-like objectProcess finished with exit code 1

似乎还是编码的问题,将编码设置成 ‘GBK’,运行成功如下:
[plain] view plain copy print ?
  1. C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy  
  2. {'web-servers': ['Nginx']}  
  3.   
  4.   
  5. Process finished with exit code 0  
C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy
{'web-servers': ['Nginx']}Process finished with exit code 0

所以不同的网站需要用不同的解码方式么?下面介绍一种判别网站编码格式的方法。
我们需要安装一个叫chardet的工具包,如下:
[plain] view plain copy print ?
  1. C:\Users\Administrator>pip install chardet  
  2. Collecting chardet  
  3.   Downloading chardet-2.3.0-py2.py3-none-any.whl (180kB)  
  4.     100% |████████████████████████████████| 184kB 616kB/s  
  5. Installing collected packages: chardet  
  6. Successfully installed chardet-2.3.0  
  7.   
  8.   
  9. C:\Users\Administrator>  
C:\Users\Administrator>pip install chardet
Collecting chardetDownloading chardet-2.3.0-py2.py3-none-any.whl (180kB)100% |████████████████████████████████| 184kB 616kB/s
Installing collected packages: chardet
Successfully installed chardet-2.3.0C:\Users\Administrator>

将byte数据传入chardet的detect方法后会得到一个Dict,里面有两个值,一个是置信值,一个是编码方式
[plain] view plain copy print ?
  1. {'encoding': 'utf-8', 'confidence': 0.99}  
{'encoding': 'utf-8', 'confidence': 0.99}

将builtwith对应的代码做下面修改:
[plain] view plain copy print ?
  1. encode_type = chardet.detect(html)  
  2.   if encode_type['encoding'] == 'utf-8':  
  3.     html = html.decode('utf-8')  
  4.   else:  
  5.     html = html.decode('gbk')  
encode_type = chardet.detect(html)if encode_type['encoding'] == 'utf-8':html = html.decode('utf-8')else:html = html.decode('gbk')

记得 import chardet!!!!
加入chardet判断字符编码的方式后,就能适配网站了~~~~

[plain] view plain copy print ?
  1. <pre code_snippet_id="2219916" snippet_file_name="blog_20170221_4_9816438" name="code" class="plain"><pre code_snippet_id="2219916" snippet_file_name="blog_20170221_4_9816438"></pre>  
  2. <pre></pre>  
  3. <pre></pre>  
  4. <pre></pre>  
  5. <pre></pre>  
  6.      
  7. </pre>  
[plain] view plaincopyprint?
  1. <pre code_snippet_id="2219916" snippet_file_name="blog_20170221_4_9816438"></pre>  
  2. <pre></pre>  
  3. <pre></pre>  
  4. <pre></pre>  
  5. <pre></pre>  
  6.      

    更多推荐

    python3使用builtwith识别网站使用相关技术

    本文发布于:2024-02-12 04:44:23,感谢您对本站的认可!
    本文链接:https://www.elefans.com/category/jswz/34/1686078.html
    版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
    本文标签:技术   网站   builtwith

    发布评论

    评论列表 (有 0 条评论)
    草根站长

    >www.elefans.com

    编程频道|电子爱好者 - 技术资讯及电子产品介绍!