collect_links module¶
脚本功能:广度遍历采集链接(友情链接采集工具)
涉及:树的广度遍历(队列),代理模式,多线程,生成者消费者,跨线程数据共享-队列(阻塞)
帮助:python collect_links.py -h
使用示例:python collect_links.py -s 'https://hexo.yuanjh.cn' -suf '/links'
程序执行步骤
1,https://hexo.yuanjh.cn => (1,https://hexo.yuanjh.cn/links)
2,(1,https://hexo.yuanjh.cn/links) => [(2,http://xxx.yy.com/links),(2,https://zz.ff.cn/links)]
3,[(2,http://xxx.yy.com/links),(2,https://zz.ff.cn/links)]=> [(3,http://xxx.yy.zz/links),(3,https://zz.ff.zz/links)]
=> 循环此步骤
-
class
collect_links.
CollectLinks
(seed_url: object, suffix: object, max_count: object = 100, max_depth: object = 10)[源代码]¶ 基类:
object
递归采集链接
变量: - seed_url (str) -- 种子链接
- suffix (str) -- 后缀
- max_count (int) -- 最大采集链接个数
- max_depth (int) -- 最大采集链接深度
-
headers
= {'Accept': 'application/json, text/javascript, */*; q=0.01', 'Accept-Encoding': 'gzip, deflate, br', 'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,ja;q=0.7,tr;q=0.6,fr;q=0.5,zh-TW;q=0.4', 'Connection': 'keep-alive', 'Cookie': 'BIDUPSID=8C26E1690527F4CB4ED508565EBE810E; PSTM=1586487982; BAIDUID=8C26E1690527F4CBE9EBFA9A228B6F9B:FG=1; BD_HOME=1; H_PS_PSSID=30971_1422_21088_30839_31186_31217_30823_31163; BD_UPN=123353', 'Referer': 'https://www.baidu.com/', 'Sec-Fetch-Dest': 'empty', 'Sec-Fetch-Mode': 'cors', 'Sec-Fetch-Site': 'same-origin', 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36'}¶