摘要:前兩天有人私信我,讓我爬這個網(wǎng)站,上的失蹤兒童信息,準備根據(jù)失蹤兒童的失蹤時的地理位置來更好的尋找失蹤兒童,這種事情本就應該義不容辭如果對網(wǎng)站服務器造成負荷,還請諒解。這次依然是用第三方爬蟲包,還有,來爬取信息。
前兩天有人私信我,讓我爬這個網(wǎng)站,http://bbs.baobeihuijia.com/f...上的失蹤兒童信息,準備根據(jù)失蹤兒童的失蹤時的地理位置來更好的尋找失蹤兒童,這種事情本就應該義不容辭,如果對網(wǎng)站服務器造成負荷,還請諒解。
這次依然是用第三方爬蟲包BeautifulSoup,還有Selenium+Chrome,Selenium+PhantomJS來爬取信息。
通過分析網(wǎng)站的框架,依然分三步來進行。
步驟一:獲取http://bbs.baobeihuijia.com/f...這個版塊上的所有分頁頁面鏈接
步驟二:獲取每一個分頁鏈接上所發(fā)的帖子的鏈接
步驟三:獲取每一個帖子鏈接上要爬取的信息,編號,姓名,性別,出生日期,失蹤時身高,失蹤時間,失蹤地點,以及是否報案
起先用的BeautifulSoup,但是被管理員設置了網(wǎng)站重定向,然后就采用selenium的方式,在這里還是對網(wǎng)站管理員說一聲抱歉。
1、獲取http://bbs.baobeihuijia.com/f...這個版塊上的所有分頁頁面鏈接
通過分析:發(fā)現(xiàn)分頁的頁面鏈接處于
[python]?view plain?copy 1.def?GetALLPageUrl(siteUrl):?? 2.????#設置代理IP訪問?? 3.????#代理IP可以上http://http.zhimaruanjian.com/獲取?? 4.????proxy_handler=urllib.request.ProxyHandler({"https":"111.76.129.200:808"})?? 5.????proxy_auth_handler=urllib.request.ProxyBasicAuthHandler()?? 6.????opener?=?urllib.request.build_opener(urllib.request.HTTPHandler,?proxy_handler)?? 7.????urllib.request.install_opener(opener)?? 8.????#獲取網(wǎng)頁信息?? 9.????req=request.Request(siteUrl,headers=headers1?or?headers2?or?headers3)?? 10.????html=urlopen(req)?? 11.????bsObj=BeautifulSoup(html.read(),"html.parser")?? 12.????html.close()?? 13.????#http://bbs.baobeihuijia.com/forum-191-1.html變成http://bbs.baobeihuijia.com,以便組成頁面鏈接?? 14.????siteindex=siteUrl.rfind("/")?? 15.????tempsiteurl=siteUrl[0:siteindex+1]#http://bbs.baobeihuijia.com/?? 16.????tempbianhaoqian=siteUrl[siteindex+1:-6]#forum-191-?? 17.?? 18.????#爬取想要的信息?? 19.????bianhao=[]#存儲頁面編號?? 20.????pageUrl=[]#存儲頁面鏈接?? 21.????templist1=bsObj.find("div",{"class":"pg"})?? 22.????for?templist2?in?templist1.findAll("a",href=re.compile("forum-([0-9]+)-([0-9]+).html")):?? 23.????????lianjie=templist2.attrs["href"]?? 24.????????#print(lianjie)?? 25.????????index1=lianjie.rfind("-")#查找-在字符串中的位置?? 26.????????index2=lianjie.rfind(".")#查找.在字符串中的位置?? 27.????????tempbianhao=lianjie[index1+1:index2]?? 28.????????bianhao.append(int(tempbianhao))?? 29.????bianhaoMax=max(bianhao)#獲取頁面的最大編號?? 30.?? 31.????for?i?in?range(1,bianhaoMax+1):?? 32.????????temppageUrl=tempsiteurl+tempbianhaoqian+str(i)+".html"#組成頁面鏈接?? 33.????????#print(temppageUrl)?? 34.????????pageUrl.append(temppageUrl)?? 35.????return?pageUrl#返回頁面鏈接列表?? Selenium形式: [python]?view plain?copy 1.#得到當前板塊所有的頁面鏈接?? 2.#siteUrl為當前版塊的頁面鏈接?? 3.def?GetALLPageUrl(siteUrl):?? 4.????#設置代理IP訪問?? 5.????#代理IP可以上http://http.zhimaruanjian.com/獲取?? 6.????proxy_handler=urllib.request.ProxyHandler({"post":"123.207.143.51:8080"})?? 7.????proxy_auth_handler=urllib.request.ProxyBasicAuthHandler()?? 8.????opener?=?urllib.request.build_opener(urllib.request.HTTPHandler,?proxy_handler)?? 9.????urllib.request.install_opener(opener)?? 10.?? 11.????try:?? 12.????????#掉用第三方包selenium打開瀏覽器登陸?? 13.????????#driver=webdriver.Chrome()#打開chrome?? 14.???????driver=webdriver.Chrome()#打開無界面瀏覽器Chrome?? 15.???????#driver=webdriver.PhantomJS()#打開無界面瀏覽器PhantomJS?? 16.???????driver.set_page_load_timeout(10)?? 17.???????#driver.implicitly_wait(30)?? 18.???????try:?? 19.???????????driver.get(siteUrl)#登陸兩次?? 20.???????????driver.get(siteUrl)?? 21.???????except?TimeoutError:?? 22.???????????driver.refresh()?? 23.?? 24.???????#print(driver.page_source)?? 25.???????html=driver.page_source#將瀏覽器執(zhí)行后的源代碼賦給html?? 26.????????#獲取網(wǎng)頁信息?? 27.????#抓捕網(wǎng)頁解析過程中的錯誤?? 28.???????try:?? 29.???????????#req=request.Request(tieziUrl,headers=headers5)?? 30.???????????#html=urlopen(req)?? 31.???????????bsObj=BeautifulSoup(html,"html.parser")?? 32.???????????#print(bsObj.find("title").get_text())?? 33.???????????#html.close()?? 34.???????except?UnicodeDecodeError?as?e:?? 35.???????????print("-----UnicodeDecodeError?url",siteUrl)?? 36.???????except?urllib.error.URLError?as?e:?? 37.???????????print("-----urlError?url:",siteUrl)?? 38.???????except?socket.timeout?as?e:?? 39.???????????print("-----socket?timout:",siteUrl)?? 40.?? 41.?? 42.?? 43.???????while(bsObj.find("title").get_text()?==?"頁面重載開啟"):?? 44.???????????print("當前頁面不是重加載后的頁面,程序會嘗試刷新一次到跳轉后的頁面 ")?? 45.???????????driver.get(siteUrl)?? 46.???????????html=driver.page_source#將瀏覽器執(zhí)行后的源代碼賦給html?? 47.???????????bsObj=BeautifulSoup(html,"html.parser")?? 48.????except?Exception?as?e:?? 49.?? 50.????????driver.close()?#?Close?the?current?window.?? 51.????????driver.quit()#關閉chrome瀏覽器?? 52.????????#time.sleep()?? 53.?? 54.????driver.close()?#?Close?the?current?window.?? 55.????driver.quit()#關閉chrome瀏覽器?? 56.?? 57.?? 58.????#http://bbs.baobeihuijia.com/forum-191-1.html變成http://bbs.baobeihuijia.com,以便組成頁面鏈接?? 59.????siteindex=siteUrl.rfind("/")?? 60.????tempsiteurl=siteUrl[0:siteindex+1]#http://bbs.baobeihuijia.com/?? 61.????tempbianhaoqian=siteUrl[siteindex+1:-6]#forum-191-?? 62.?? 63.????#爬取想要的信息?? 64.????bianhao=[]#存儲頁面編號?? 65.????pageUrl=[]#存儲頁面鏈接?? 66.?? 67.????templist1=bsObj.find("div",{"class":"pg"})?? 68.????#if?templist1==None:?? 69.????????#return?? 70.????for?templist2?in?templist1.findAll("a",href=re.compile("forum-([0-9]+)-([0-9]+).html")):?? 71.????????if?templist2==None:?? 72.????????????continue?? 73.????????lianjie=templist2.attrs["href"]?? 74.????????#print(lianjie)?? 75.????????index1=lianjie.rfind("-")#查找-在字符串中的位置?? 76.????????index2=lianjie.rfind(".")#查找.在字符串中的位置?? 77.????????tempbianhao=lianjie[index1+1:index2]?? 78.????????bianhao.append(int(tempbianhao))?? 79.????bianhaoMax=max(bianhao)#獲取頁面的最大編號?? 80.?? 81.????for?i?in?range(1,bianhaoMax+1):?? 82.????????temppageUrl=tempsiteurl+tempbianhaoqian+str(i)+".html"#組成頁面鏈接?? 83.????????print(temppageUrl)?? 84.????????pageUrl.append(temppageUrl)?? 85.????return?pageUrl#返回頁面鏈接列表??
2.獲取每一個分頁鏈接上所發(fā)的帖子的鏈接
每個帖子的鏈接都位于href下
所以寫了以下的代碼:
BeautifulSoup形式:
[python]?view plain?copy 1.#得到當前版塊頁面所有帖子的鏈接?? 2.def?GetCurrentPageTieziUrl(PageUrl):?? 3.????#設置代理IP訪問?? 4.????#代理IP可以上http://http.zhimaruanjian.com/獲取?? 5.????proxy_handler=urllib.request.ProxyHandler({"post":"121.22.252.85:8000"})?? 6.????proxy_auth_handler=urllib.request.ProxyBasicAuthHandler()?? 7.????opener?=?urllib.request.build_opener(urllib.request.HTTPHandler,?proxy_handler)?? 8.????urllib.request.install_opener(opener)?? 9.????#獲取網(wǎng)頁信息?? 10.????req=request.Request(PageUrl,headers=headers1?or?headers2?or?headers3)?? 11.????html=urlopen(req)?? 12.????bsObj=BeautifulSoup(html.read(),"html.parser")?? 13.????html.close()?? 14.????#http://bbs.baobeihuijia.com/forum-191-1.html變成http://bbs.baobeihuijia.com,以便組成帖子鏈接?? 15.????siteindex=PageUrl.rfind("/")?? 16.????tempsiteurl=PageUrl[0:siteindex+1]#http://bbs.baobeihuijia.com/?? 17.????#print(tempsiteurl)?? 18.????TieziUrl=[]?? 19.????#爬取想要的信息?? 20.????for?templist1?in?bsObj.findAll("tbody",id=re.compile("normalthread_([0-9]+)"))?:?? 21.????????for?templist2?in?templist1.findAll("a",{"class":"s?xst"}):?? 22.????????????tempteiziUrl=tempsiteurl+templist2.attrs["href"]#組成帖子鏈接?? 23.????????????print(tempteiziUrl)?? 24.????????????TieziUrl.append(tempteiziUrl)?? 25.????return?TieziUrl#返回帖子鏈接列表?? Selenium形式: [python]?view plain?copy 1.#得到當前版塊頁面所有帖子的鏈接?? 2.def?GetCurrentPageTieziUrl(PageUrl):?? 3.????#設置代理IP訪問?? 4.????#代理IP可以上http://http.zhimaruanjian.com/獲取?? 5.????proxy_handler=urllib.request.ProxyHandler({"post":"110.73.30.157:8123"})?? 6.????proxy_auth_handler=urllib.request.ProxyBasicAuthHandler()?? 7.????opener?=?urllib.request.build_opener(urllib.request.HTTPHandler,?proxy_handler)?? 8.????urllib.request.install_opener(opener)?? 9.?? 10.????try:?? 11.????????#掉用第三方包selenium打開瀏覽器登陸?? 12.????????#driver=webdriver.Chrome()#打開chrome?? 13.???????driver=webdriver.Chrome()#打開無界面瀏覽器Chrome?? 14.???????#driver=webdriver.PhantomJS()#打開無界面瀏覽器PhantomJS?? 15.???????driver.set_page_load_timeout(10)?? 16.???????try:?? 17.???????????driver.get(PageUrl)#登陸兩次?? 18.???????????driver.get(PageUrl)?? 19.???????except?TimeoutError:?? 20.???????????driver.refresh()?? 21.?? 22.???????#print(driver.page_source)?? 23.???????html=driver.page_source#將瀏覽器執(zhí)行后的源代碼賦給html?? 24.????????#獲取網(wǎng)頁信息?? 25.????#抓捕網(wǎng)頁解析過程中的錯誤?? 26.???????try:?? 27.???????????#req=request.Request(tieziUrl,headers=headers5)?? 28.???????????#html=urlopen(req)?? 29.???????????bsObj=BeautifulSoup(html,"html.parser")?? 30.???????????#html.close()?? 31.???????except?UnicodeDecodeError?as?e:?? 32.???????????print("-----UnicodeDecodeError?url",PageUrl)?? 33.???????except?urllib.error.URLError?as?e:?? 34.???????????print("-----urlError?url:",PageUrl)?? 35.???????except?socket.timeout?as?e:?? 36.???????????print("-----socket?timout:",PageUrl)?? 37.?? 38.???????n=0?? 39.???????while(bsObj.find("title").get_text()?==?"頁面重載開啟"):?? 40.???????????print("當前頁面不是重加載后的頁面,程序會嘗試刷新一次到跳轉后的頁面 ")?? 41.???????????driver.get(PageUrl)?? 42.???????????html=driver.page_source#將瀏覽器執(zhí)行后的源代碼賦給html?? 43.???????????bsObj=BeautifulSoup(html,"html.parser")?? 44.???????????n=n+1?? 45.???????????if?n==10:?? 46.???????????????driver.close()?#?Close?the?current?window.?? 47.???????????????driver.quit()#關閉chrome瀏覽器?? 48.???????????????return?1?? 49.?? 50.????except?Exception?as?e:?? 51.????????driver.close()?#?Close?the?current?window.?? 52.????????driver.quit()#關閉chrome瀏覽器?? 53.????????time.sleep(1)?? 54.?? 55.????driver.close()?#?Close?the?current?window.?? 56.????driver.quit()#關閉chrome瀏覽器?? 57.?? 58.?? 59.????#http://bbs.baobeihuijia.com/forum-191-1.html變成http://bbs.baobeihuijia.com,以便組成帖子鏈接?? 60.????siteindex=PageUrl.rfind("/")?? 61.????tempsiteurl=PageUrl[0:siteindex+1]#http://bbs.baobeihuijia.com/?? 62.????#print(tempsiteurl)?? 63.????TieziUrl=[]?? 64.????#爬取想要的信息?? 65.????for?templist1?in?bsObj.findAll("tbody",id=re.compile("normalthread_([0-9]+)"))?:?? 66.????????if?templist1==None:?? 67.????????????continue?? 68.????????for?templist2?in?templist1.findAll("a",{"class":"s?xst"}):?? 69.????????????if?templist2==None:?? 70.????????????????continue?? 71.????????????tempteiziUrl=tempsiteurl+templist2.attrs["href"]#組成帖子鏈接?? 72.????????????print(tempteiziUrl)?? 73.????????????TieziUrl.append(tempteiziUrl)?? 74.????return?TieziUrl#返回帖子鏈接列表??
3.獲取每一個帖子鏈接上要爬取的信息,編號,姓名,性別,出生日期,失蹤時身高,失蹤時間,失蹤地點,以及是否報案,并寫入CSV中
通過查看每一個帖子的鏈接,發(fā)現(xiàn)其失蹤人口信息都在
[python]?view plain?copy 1.#得到當前頁面失蹤人口信息?? 2.#pageUrl為當前帖子頁面鏈接?? 3.def?CurrentPageMissingPopulationInformation(tieziUrl):?? 4.????#設置代理IP訪問?? 5.????#代理IP可以上http://http.zhimaruanjian.com/獲取?? 6.????proxy_handler=urllib.request.ProxyHandler({"post":"210.136.17.78:8080"})?? 7.????proxy_auth_handler=urllib.request.ProxyBasicAuthHandler()?? 8.????opener?=?urllib.request.build_opener(urllib.request.HTTPHandler,?proxy_handler)?? 9.????urllib.request.install_opener(opener)?? 10.????#獲取網(wǎng)頁信息?? 11.????req=request.Request(tieziUrl,headers=headers1?or?headers2?or?headers3)?? 12.????html=urlopen(req)?? 13.????bsObj=BeautifulSoup(html.read(),"html.parser")?? 14.????html.close()?? 15.????#查找想要的信息?? 16.????templist1=bsObj.find("td",{"class":"t_f"}).ul?? 17.????if?templist1==None:#判斷是否不包含ul字段,如果不,跳出函數(shù)?? 18.????????return?? 19.????mycsv=["NULL","NULL","NULL","NULL","NULL","NULL","NULL","NULL"]#初始化提取信息列表?? 20.????for?templist2?in?templist1.findAll("font",size=re.compile("^([0-9]+)*$")):?? 21.????????if?len(templist2)==0:?? 22.????????????continue?? 23.????????tempText=templist2.get_text()?? 24.????????#print(tempText[0:4])?? 25.????????if?"寶貝回家編號"?in?tempText[0:6]:?? 26.????????????print(tempText)?? 27.????????????index=tempText.find(":")?? 28.????????????tempText=tempText[index+1:]?? 29.????????????#mycsv.append(tempText)?? 30.????????????if?len(tempText)==0:?? 31.????????????????tempText="NULL"?? 32.????????????mycsv[0]=tempText?? 33.????????if?"尋親編號"?in?tempText[0:6]:?? 34.????????????print(tempText)?? 35.????????????index=tempText.find(":")?? 36.????????????tempText=tempText[index+1:]?? 37.????????????if?len(tempText)==0:?? 38.????????????????tempText="NULL"?? 39.????????????#mycsv.append(tempText)?? 40.????????????mycsv[0]=tempText?? 41.????????if?"登記編號"?in?tempText[0:6]:?? 42.????????????print(tempText)?? 43.????????????index=tempText.find(":")?? 44.????????????tempText=tempText[index+1:]?? 45.????????????if?len(tempText)==0:?? 46.????????????????tempText="NULL"?? 47.????????????#mycsv.append(tempText)?? 48.????????????mycsv[0]=tempText?? 49.????????if?"姓"?in?tempText[0:6]:?? 50.????????????print(tempText)?? 51.????????????index=tempText.find(":")?? 52.????????????tempText=tempText[index+1:]?? 53.????????????#mycsv.append(tempText)?? 54.????????????mycsv[1]=tempText?? 55.????????if"性"?in?tempText[0:6]:?? 56.????????????print(tempText)?? 57.????????????index=tempText.find(":")?? 58.????????????tempText=tempText[index+1:]?? 59.????????????#mycsv.append(tempText)?? 60.????????????mycsv[2]=tempText?? 61.????????if?"出生日期"?in?tempText[0:6]:?? 62.????????????print(tempText)?? 63.????????????index=tempText.find(":")?? 64.????????????tempText=tempText[index+1:]?? 65.????????????#mycsv.append(tempText)?? 66.????????????mycsv[3]=tempText?? 67.????????if?"失蹤時身高"?in?tempText[0:6]:?? 68.????????????print(tempText)?? 69.????????????index=tempText.find(":")?? 70.????????????tempText=tempText[index+1:]?? 71.????????????#mycsv.append(tempText)?? 72.????????????mycsv[4]=tempText?? 73.????????if?"失蹤時間"?in?tempText[0:6]:?? 74.????????????print(tempText)?? 75.????????????index=tempText.find(":")?? 76.????????????tempText=tempText[index+1:]?? 77.????????????#mycsv.append(tempText)?? 78.????????????mycsv[5]=tempText?? 79.????????if?"失蹤日期"?in?tempText[0:6]:?? 80.????????????print(tempText)?? 81.????????????index=tempText.find(":")?? 82.????????????tempText=tempText[index+1:]?? 83.????????????#mycsv.append(tempText)?? 84.????????????mycsv[5]=tempText?? 85.????????if?"失蹤地點"?in?tempText[0:6]:?? 86.????????????print(tempText)?? 87.????????????index=tempText.find(":")?? 88.????????????tempText=tempText[index+1:]?? 89.????????????#mycsv.append(tempText)?? 90.????????????mycsv[6]=tempText?? 91.????????if?"是否報案"?in?tempText[0:6]:?? 92.????????????print(tempText)?? 93.????????????index=tempText.find(":")?? 94.????????????tempText=tempText[index+1:]?? 95.????????????#mycsv.append(tempText)?? 96.????????????mycsv[7]=tempText?? 97.????try:?? 98.????????writer.writerow((str(mycsv[0]),str(mycsv[1]),str(mycsv[2]),str(mycsv[3]),str(mycsv[4]),str(mycsv[5]),str(mycsv[6]),str(mycsv[7])))#寫入CSV文件?? 99.????finally:?? 100.????????time.sleep(1)#設置爬完之后的睡眠時間,這里先設置為1秒?
?
Selenium形式:
[python]?view plain?copy 1.#得到當前頁面失蹤人口信息?? 2.#pageUrl為當前帖子頁面鏈接?? 3.def?CurrentPageMissingPopulationInformation(tieziUrl):?? 4.????#設置代理IP訪問?? 5.????#代理IP可以上http://http.zhimaruanjian.com/獲取?? 6.????proxy_handler=urllib.request.ProxyHandler({"post":"128.199.169.17:80"})?? 7.????proxy_auth_handler=urllib.request.ProxyBasicAuthHandler()?? 8.????opener?=?urllib.request.build_opener(urllib.request.HTTPHandler,?proxy_handler)?? 9.????urllib.request.install_opener(opener)?? 10.?? 11.????try:?? 12.????????#掉用第三方包selenium打開瀏覽器登陸?? 13.????????#driver=webdriver.Chrome()#打開chrome?? 14.???????driver=webdriver.Chrome()#打開無界面瀏覽器Chrome?? 15.???????#driver=webdriver.PhantomJS()#打開無界面瀏覽器PhantomJS?? 16.???????driver.set_page_load_timeout(10)?? 17.???????#driver.implicitly_wait(30)?? 18.???????try:?? 19.???????????driver.get(tieziUrl)#登陸兩次?? 20.???????????driver.get(tieziUrl)?? 21.???????except?TimeoutError:?? 22.???????????driver.refresh()?? 23.?? 24.???????#print(driver.page_source)?? 25.???????html=driver.page_source#將瀏覽器執(zhí)行后的源代碼賦給html?? 26.????????#獲取網(wǎng)頁信息?? 27.????#抓捕網(wǎng)頁解析過程中的錯誤?? 28.???????try:?? 29.???????????#req=request.Request(tieziUrl,headers=headers5)?? 30.???????????#html=urlopen(req)?? 31.???????????bsObj=BeautifulSoup(html,"html.parser")?? 32.???????????#html.close()?? 33.???????except?UnicodeDecodeError?as?e:?? 34.???????????print("-----UnicodeDecodeError?url",tieziUrl)?? 35.???????except?urllib.error.URLError?as?e:?? 36.???????????print("-----urlError?url:",tieziUrl)?? 37.???????except?socket.timeout?as?e:?? 38.???????????print("-----socket?timout:",tieziUrl)?? 39.?? 40.?? 41.???????while(bsObj.find("title").get_text()?==?"頁面重載開啟"):?? 42.???????????print("當前頁面不是重加載后的頁面,程序會嘗試刷新一次到跳轉后的頁面 ")?? 43.???????????driver.get(tieziUrl)?? 44.???????????html=driver.page_source#將瀏覽器執(zhí)行后的源代碼賦給html?? 45.???????????bsObj=BeautifulSoup(html,"html.parser")?? 46.????except?Exception?as?e:?? 47.????????driver.close()?#?Close?the?current?window.?? 48.????????driver.quit()#關閉chrome瀏覽器?? 49.????????time.sleep(0.5)?? 50.?? 51.????driver.close()?#?Close?the?current?window.?? 52.????driver.quit()#關閉chrome瀏覽器?? 53.?? 54.?? 55.????#查找想要的信息?? 56.????templist1=bsObj.find("td",{"class":"t_f"}).ul?? 57.????if?templist1==None:#判斷是否不包含ul字段,如果不,跳出函數(shù)?? 58.????????print("當前帖子頁面不包含ul字段")?? 59.????????return?1?? 60.????mycsv=["NULL","NULL","NULL","NULL","NULL","NULL","NULL","NULL"]#初始化提取信息列表?? 61.????for?templist2?in?templist1.findAll("font",size=re.compile("^([0-9]+)*$")):?? 62.????????tempText=templist2.get_text()?? 63.????????#print(tempText[0:4])?? 64.????????if?"寶貝回家編號"?in?tempText[0:6]:?? 65.????????????print(tempText)?? 66.????????????index=tempText.find(":")?? 67.????????????tempText=tempText[index+1:]?? 68.????????????#mycsv.append(tempText)?? 69.????????????if?len(tempText)==0:?? 70.????????????????tempText="NULL"?? 71.????????????mycsv[0]=tempText?? 72.????????if?"尋親編號"?in?tempText[0:6]:?? 73.????????????print(tempText)?? 74.????????????index=tempText.find(":")?? 75.????????????tempText=tempText[index+1:]?? 76.????????????if?len(tempText)==0:?? 77.????????????????tempText="NULL"?? 78.????????????#mycsv.append(tempText)?? 79.????????????mycsv[0]=tempText?? 80.????????if?"登記編號"?in?tempText[0:6]:?? 81.????????????print(tempText)?? 82.????????????index=tempText.find(":")?? 83.????????????tempText=tempText[index+1:]?? 84.????????????if?len(tempText)==0:?? 85.????????????????tempText="NULL"?? 86.????????????#mycsv.append(tempText)?? 87.????????????mycsv[0]=tempText?? 88.????????if?"姓"?in?tempText[0:6]:?? 89.????????????print(tempText)?? 90.????????????index=tempText.find(":")?? 91.????????????tempText=tempText[index+1:]?? 92.????????????#mycsv.append(tempText)?? 93.????????????mycsv[1]=tempText?? 94.????????if"性"?in?tempText[0:6]:?? 95.????????????print(tempText)?? 96.????????????index=tempText.find(":")?? 97.????????????tempText=tempText[index+1:]?? 98.????????????#mycsv.append(tempText)?? 99.????????????mycsv[2]=tempText?? 100.????????if?"出生日期"?in?tempText[0:6]:?? 101.????????????print(tempText)?? 102.????????????index=tempText.find(":")?? 103.????????????tempText=tempText[index+1:]?? 104.????????????#mycsv.append(tempText)?? 105.????????????mycsv[3]=tempText?? 106.????????if?"失蹤時身高"?in?tempText[0:6]:?? 107.????????????print(tempText)?? 108.????????????index=tempText.find(":")?? 109.????????????tempText=tempText[index+1:]?? 110.????????????#mycsv.append(tempText)?? 111.????????????mycsv[4]=tempText?? 112.????????if?"失蹤時間"?in?tempText[0:6]:?? 113.????????????print(tempText)?? 114.????????????index=tempText.find(":")?? 115.????????????tempText=tempText[index+1:]?? 116.????????????#mycsv.append(tempText)?? 117.????????????mycsv[5]=tempText?? 118.????????if?"失蹤日期"?in?tempText[0:6]:?? 119.????????????print(tempText)?? 120.????????????index=tempText.find(":")?? 121.????????????tempText=tempText[index+1:]?? 122.????????????#mycsv.append(tempText)?? 123.????????????mycsv[5]=tempText?? 124.????????if?"失蹤地點"?in?tempText[0:6]:?? 125.????????????print(tempText)?? 126.????????????index=tempText.find(":")?? 127.????????????tempText=tempText[index+1:]?? 128.????????????#mycsv.append(tempText)?? 129.????????????mycsv[6]=tempText?? 130.????????if?"是否報案"?in?tempText[0:6]:?? 131.????????????print(tempText)?? 132.????????????index=tempText.find(":")?? 133.????????????tempText=tempText[index+1:]?? 134.????????????#mycsv.append(tempText)?? 135.????????????mycsv[7]=tempText?? 136.????try:?? 137.????????writer.writerow((str(mycsv[0]),str(mycsv[1]),str(mycsv[2]),str(mycsv[3]),str(mycsv[4]),str(mycsv[5]),str(mycsv[6]),str(mycsv[7])))#寫入CSV文件?? 138.????????csvfile.flush()#馬上將這條數(shù)據(jù)寫入csv文件中?? 139.????finally:?? 140.????????print("當前帖子信息寫入完成 ")?? 141.????????time.sleep(5)#設置爬完之后的睡眠時間,這里先設置為1秒??
現(xiàn)附上所有代碼,此代碼僅供參考,不能用于商業(yè)用途,網(wǎng)絡爬蟲易給網(wǎng)站服務器造成巨大負荷,任何人使用本代碼所引起的任何后果,本人不予承擔法律責任。貼出代碼的初衷是供大家學習爬蟲,大家只是研究下網(wǎng)絡框架即可,不要使用此代碼去加重網(wǎng)站負荷,本人由于不當使用,已被封IP,前車之鑒,爬取失蹤人口信息只是為了從空間上分析人口失蹤的規(guī)律,由此給網(wǎng)站造成的什么不便,請見諒。
附上所有代碼:
[python]?view plain?copy 1.#__author__?=?"Administrator"?? 2.#coding=utf-8?? 3.import?io?? 4.import?os?? 5.import?sys?? 6.import?math?? 7.import?urllib?? 8.from?urllib.request?import??urlopen?? 9.from?urllib.request?import?urlretrieve?? 10.from?urllib??import?request?? 11.from?bs4?import?BeautifulSoup?? 12.import?re?? 13.import?time?? 14.import?socket?? 15.import?csv?? 16.from?selenium?import?webdriver?? 17.?? 18.socket.setdefaulttimeout(5000)#設置全局超時函數(shù)?? 19.?? 20.?? 21.?? 22.sys.stdout?=?io.TextIOWrapper(sys.stdout.buffer,encoding="gb18030")?? 23.#sys.stdout?=?io.TextIOWrapper(sys.stdout.buffer,encoding="utf-8")?? 24.#設置不同的headers,偽裝為不同的瀏覽器?? 25.headers1={"User-Agent":"Mozilla/5.0?(Windows?NT?6.1;?WOW64;?rv:23.0)?Gecko/20100101?Firefox/23.0"}?? 26.headers2={"User-Agent":"Mozilla/5.0?(Windows?NT?6.3;?WOW64)?AppleWebKit/537.36?(KHTML,?like?Gecko)?Chrome/45.0.2454.101?Safari/537.36"}?? 27.headers3={"User-Agent":"Mozilla/5.0?(Windows?NT?6.1)?AppleWebKit/537.11?(KHTML,?like?Gecko)?Chrome/23.0.1271.64?Safari/537.11"}?? 28.headers4={"User-Agent":"Mozilla/5.0?(Windows?NT?10.0;?WOW64)?AppleWebKit/537.36?(KHTML,?like?Gecko)?Chrome/53.0.2785.104?Safari/537.36?Core/1.53.2372.400?QQBrowser/9.5.10548.400"}?? 29.headers5={"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",?? 30."Connection":"keep-alive",?? 31."Host":"bbs.baobeihuijia.com",?? 32."Referer":"http://bbs.baobeihuijia.com/forum-191-1.html",?? 33."Upgrade-Insecure-Requests":"1",?? 34."User-Agent":"Mozilla/5.0?(Windows?NT?6.1;?WOW64)?AppleWebKit/537.36?(KHTML,?like?Gecko)?Chrome/51.0.2704.103?Safari/537.36"}?? 35.?? 36.headers6={"Host":?"bbs.baobeihuijia.com",?? 37."User-Agent":?"Mozilla/5.0?(Windows?NT?6.1;?WOW64;?rv:51.0)?Gecko/20100101?Firefox/51.0",?? 38."Accept":?"textml,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",?? 39."Connection":?"keep-alive",?? 40."Upgrade-Insecure-Requests":"?1"?? 41.}?? 42.#得到當前頁面失蹤人口信息?? 43.#pageUrl為當前帖子頁面鏈接?? 44.def?CurrentPageMissingPopulationInformation(tieziUrl):?? 45.????#設置代理IP訪問?? 46.????#代理IP可以上http://http.zhimaruanjian.com/獲取?? 47.????proxy_handler=urllib.request.ProxyHandler({"post":"128.199.169.17:80"})?? 48.????proxy_auth_handler=urllib.request.ProxyBasicAuthHandler()?? 49.????opener?=?urllib.request.build_opener(urllib.request.HTTPHandler,?proxy_handler)?? 50.????urllib.request.install_opener(opener)?? 51.?? 52.????try:?? 53.????????#掉用第三方包selenium打開瀏覽器登陸?? 54.????????#driver=webdriver.Chrome()#打開chrome?? 55.???????driver=webdriver.Chrome()#打開無界面瀏覽器Chrome?? 56.???????#driver=webdriver.PhantomJS()#打開無界面瀏覽器PhantomJS?? 57.???????driver.set_page_load_timeout(10)?? 58.???????#driver.implicitly_wait(30)?? 59.???????try:?? 60.???????????driver.get(tieziUrl)#登陸兩次?? 61.???????????driver.get(tieziUrl)?? 62.???????except?TimeoutError:?? 63.???????????driver.refresh()?? 64.?? 65.???????#print(driver.page_source)?? 66.???????html=driver.page_source#將瀏覽器執(zhí)行后的源代碼賦給html?? 67.????????#獲取網(wǎng)頁信息?? 68.????#抓捕網(wǎng)頁解析過程中的錯誤?? 69.???????try:?? 70.???????????#req=request.Request(tieziUrl,headers=headers5)?? 71.???????????#html=urlopen(req)?? 72.???????????bsObj=BeautifulSoup(html,"html.parser")?? 73.???????????#html.close()?? 74.???????except?UnicodeDecodeError?as?e:?? 75.???????????print("-----UnicodeDecodeError?url",tieziUrl)?? 76.???????except?urllib.error.URLError?as?e:?? 77.???????????print("-----urlError?url:",tieziUrl)?? 78.???????except?socket.timeout?as?e:?? 79.???????????print("-----socket?timout:",tieziUrl)?? 80.?? 81.?? 82.???????while(bsObj.find("title").get_text()?==?"頁面重載開啟"):?? 83.???????????print("當前頁面不是重加載后的頁面,程序會嘗試刷新一次到跳轉后的頁面 ")?? 84.???????????driver.get(tieziUrl)?? 85.???????????html=driver.page_source#將瀏覽器執(zhí)行后的源代碼賦給html?? 86.???????????bsObj=BeautifulSoup(html,"html.parser")?? 87.????except?Exception?as?e:?? 88.????????driver.close()?#?Close?the?current?window.?? 89.????????driver.quit()#關閉chrome瀏覽器?? 90.????????time.sleep(0.5)?? 91.?? 92.????driver.close()?#?Close?the?current?window.?? 93.????driver.quit()#關閉chrome瀏覽器?? 94.?? 95.?? 96.????#查找想要的信息?? 97.????templist1=bsObj.find("td",{"class":"t_f"}).ul?? 98.????if?templist1==None:#判斷是否不包含ul字段,如果不,跳出函數(shù)?? 99.????????print("當前帖子頁面不包含ul字段")?? 100.????????return?1?? 101.????mycsv=["NULL","NULL","NULL","NULL","NULL","NULL","NULL","NULL"]#初始化提取信息列表?? 102.????for?templist2?in?templist1.findAll("font",size=re.compile("^([0-9]+)*$")):?? 103.????????tempText=templist2.get_text()?? 104.????????#print(tempText[0:4])?? 105.????????if?"寶貝回家編號"?in?tempText[0:6]:?? 106.????????????print(tempText)?? 107.????????????index=tempText.find(":")?? 108.????????????tempText=tempText[index+1:]?? 109.????????????#mycsv.append(tempText)?? 110.????????????if?len(tempText)==0:?? 111.????????????????tempText="NULL"?? 112.????????????mycsv[0]=tempText?? 113.????????if?"尋親編號"?in?tempText[0:6]:?? 114.????????????print(tempText)?? 115.????????????index=tempText.find(":")?? 116.????????????tempText=tempText[index+1:]?? 117.????????????if?len(tempText)==0:?? 118.????????????????tempText="NULL"?? 119.????????????#mycsv.append(tempText)?? 120.????????????mycsv[0]=tempText?? 121.????????if?"登記編號"?in?tempText[0:6]:?? 122.????????????print(tempText)?? 123.????????????index=tempText.find(":")?? 124.????????????tempText=tempText[index+1:]?? 125.????????????if?len(tempText)==0:?? 126.????????????????tempText="NULL"?? 127.????????????#mycsv.append(tempText)?? 128.????????????mycsv[0]=tempText?? 129.????????if?"姓"?in?tempText[0:6]:?? 130.????????????print(tempText)?? 131.????????????index=tempText.find(":")?? 132.????????????tempText=tempText[index+1:]?? 133.????????????#mycsv.append(tempText)?? 134.????????????mycsv[1]=tempText?? 135.????????if"性"?in?tempText[0:6]:?? 136.????????????print(tempText)?? 137.????????????index=tempText.find(":")?? 138.????????????tempText=tempText[index+1:]?? 139.????????????#mycsv.append(tempText)?? 140.????????????mycsv[2]=tempText?? 141.????????if?"出生日期"?in?tempText[0:6]:?? 142.????????????print(tempText)?? 143.????????????index=tempText.find(":")?? 144.????????????tempText=tempText[index+1:]?? 145.????????????#mycsv.append(tempText)?? 146.????????????mycsv[3]=tempText?? 147.????????if?"失蹤時身高"?in?tempText[0:6]:?? 148.????????????print(tempText)?? 149.????????????index=tempText.find(":")?? 150.????????????tempText=tempText[index+1:]?? 151.????????????#mycsv.append(tempText)?? 152.????????????mycsv[4]=tempText?? 153.????????if?"失蹤時間"?in?tempText[0:6]:?? 154.????????????print(tempText)?? 155.????????????index=tempText.find(":")?? 156.????????????tempText=tempText[index+1:]?? 157.????????????#mycsv.append(tempText)?? 158.????????????mycsv[5]=tempText?? 159.????????if?"失蹤日期"?in?tempText[0:6]:?? 160.????????????print(tempText)?? 161.????????????index=tempText.find(":")?? 162.????????????tempText=tempText[index+1:]?? 163.????????????#mycsv.append(tempText)?? 164.????????????mycsv[5]=tempText?? 165.????????if?"失蹤地點"?in?tempText[0:6]:?? 166.????????????print(tempText)?? 167.????????????index=tempText.find(":")?? 168.????????????tempText=tempText[index+1:]?? 169.????????????#mycsv.append(tempText)?? 170.????????????mycsv[6]=tempText?? 171.????????if?"是否報案"?in?tempText[0:6]:?? 172.????????????print(tempText)?? 173.????????????index=tempText.find(":")?? 174.????????????tempText=tempText[index+1:]?? 175.????????????#mycsv.append(tempText)?? 176.????????????mycsv[7]=tempText?? 177.????try:?? 178.????????writer.writerow((str(mycsv[0]),str(mycsv[1]),str(mycsv[2]),str(mycsv[3]),str(mycsv[4]),str(mycsv[5]),str(mycsv[6]),str(mycsv[7])))#寫入CSV文件?? 179.????????csvfile.flush()#馬上將這條數(shù)據(jù)寫入csv文件中?? 180.????finally:?? 181.????????print("當前帖子信息寫入完成 ")?? 182.????????time.sleep(5)#設置爬完之后的睡眠時間,這里先設置為1秒?? 183.?? 184.?? 185.#得到當前板塊所有的頁面鏈接?? 186.#siteUrl為當前版塊的頁面鏈接?? 187.def?GetALLPageUrl(siteUrl):?? 188.????#設置代理IP訪問?? 189.????#代理IP可以上http://http.zhimaruanjian.com/獲取?? 190.????proxy_handler=urllib.request.ProxyHandler({"post":"123.207.143.51:8080"})?? 191.????proxy_auth_handler=urllib.request.ProxyBasicAuthHandler()?? 192.????opener?=?urllib.request.build_opener(urllib.request.HTTPHandler,?proxy_handler)?? 193.????urllib.request.install_opener(opener)?? 194.?? 195.????try:?? 196.????????#掉用第三方包selenium打開瀏覽器登陸?? 197.????????#driver=webdriver.Chrome()#打開chrome?? 198.???????driver=webdriver.Chrome()#打開無界面瀏覽器Chrome?? 199.???????#driver=webdriver.PhantomJS()#打開無界面瀏覽器PhantomJS?? 200.???????driver.set_page_load_timeout(10)?? 201.???????#driver.implicitly_wait(30)?? 202.???????try:?? 203.???????????driver.get(siteUrl)#登陸兩次?? 204.???????????driver.get(siteUrl)?? 205.???????except?TimeoutError:?? 206.???????????driver.refresh()?? 207.?? 208.???????#print(driver.page_source)?? 209.???????html=driver.page_source#將瀏覽器執(zhí)行后的源代碼賦給html?? 210.????????#獲取網(wǎng)頁信息?? 211.????#抓捕網(wǎng)頁解析過程中的錯誤?? 212.???????try:?? 213.???????????#req=request.Request(tieziUrl,headers=headers5)?? 214.???????????#html=urlopen(req)?? 215.???????????bsObj=BeautifulSoup(html,"html.parser")?? 216.???????????#print(bsObj.find("title").get_text())?? 217.???????????#html.close()?? 218.???????except?UnicodeDecodeError?as?e:?? 219.???????????print("-----UnicodeDecodeError?url",siteUrl)?? 220.???????except?urllib.error.URLError?as?e:?? 221.???????????print("-----urlError?url:",siteUrl)?? 222.???????except?socket.timeout?as?e:?? 223.???????????print("-----socket?timout:",siteUrl)?? 224.?? 225.?? 226.?? 227.???????while(bsObj.find("title").get_text()?==?"頁面重載開啟"):?? 228.???????????print("當前頁面不是重加載后的頁面,程序會嘗試刷新一次到跳轉后的頁面 ")?? 229.???????????driver.get(siteUrl)?? 230.???????????html=driver.page_source#將瀏覽器執(zhí)行后的源代碼賦給html?? 231.???????????bsObj=BeautifulSoup(html,"html.parser")?? 232.????except?Exception?as?e:?? 233.?? 234.????????driver.close()?#?Close?the?current?window.?? 235.????????driver.quit()#關閉chrome瀏覽器?? 236.????????#time.sleep()?? 237.?? 238.????driver.close()?#?Close?the?current?window.?? 239.????driver.quit()#關閉chrome瀏覽器?? 240.?? 241.?? 242.????#http://bbs.baobeihuijia.com/forum-191-1.html變成http://bbs.baobeihuijia.com,以便組成頁面鏈接?? 243.????siteindex=siteUrl.rfind("/")?? 244.????tempsiteurl=siteUrl[0:siteindex+1]#http://bbs.baobeihuijia.com/?? 245.????tempbianhaoqian=siteUrl[siteindex+1:-6]#forum-191-?? 246.?? 247.????#爬取想要的信息?? 248.????bianhao=[]#存儲頁面編號?? 249.????pageUrl=[]#存儲頁面鏈接?? 250.?? 251.????templist1=bsObj.find("div",{"class":"pg"})?? 252.????#if?templist1==None:?? 253.????????#return?? 254.????for?templist2?in?templist1.findAll("a",href=re.compile("forum-([0-9]+)-([0-9]+).html")):?? 255.????????if?templist2==None:?? 256.????????????continue?? 257.????????lianjie=templist2.attrs["href"]?? 258.????????#print(lianjie)?? 259.????????index1=lianjie.rfind("-")#查找-在字符串中的位置?? 260.????????index2=lianjie.rfind(".")#查找.在字符串中的位置?? 261.????????tempbianhao=lianjie[index1+1:index2]?? 262.????????bianhao.append(int(tempbianhao))?? 263.????bianhaoMax=max(bianhao)#獲取頁面的最大編號?? 264.?? 265.????for?i?in?range(1,bianhaoMax+1):?? 266.????????temppageUrl=tempsiteurl+tempbianhaoqian+str(i)+".html"#組成頁面鏈接?? 267.????????print(temppageUrl)?? 268.????????pageUrl.append(temppageUrl)?? 269.????return?pageUrl#返回頁面鏈接列表?? 270.?? 271.#得到當前版塊頁面所有帖子的鏈接?? 272.def?GetCurrentPageTieziUrl(PageUrl):?? 273.????#設置代理IP訪問?? 274.????#代理IP可以上http://http.zhimaruanjian.com/獲取?? 275.????proxy_handler=urllib.request.ProxyHandler({"post":"110.73.30.157:8123"})?? 276.????proxy_auth_handler=urllib.request.ProxyBasicAuthHandler()?? 277.????opener?=?urllib.request.build_opener(urllib.request.HTTPHandler,?proxy_handler)?? 278.????urllib.request.install_opener(opener)?? 279.?? 280.????try:?? 281.????????#掉用第三方包selenium打開瀏覽器登陸?? 282.????????#driver=webdriver.Chrome()#打開chrome?? 283.???????driver=webdriver.Chrome()#打開無界面瀏覽器Chrome?? 284.???????#driver=webdriver.PhantomJS()#打開無界面瀏覽器PhantomJS?? 285.???????driver.set_page_load_timeout(10)?? 286.???????try:?? 287.???????????driver.get(PageUrl)#登陸兩次?? 288.???????????driver.get(PageUrl)?? 289.???????except?TimeoutError:?? 290.???????????driver.refresh()?? 291.?? 292.???????#print(driver.page_source)?? 293.???????html=driver.page_source#將瀏覽器執(zhí)行后的源代碼賦給html?? 294.????????#獲取網(wǎng)頁信息?? 295.????#抓捕網(wǎng)頁解析過程中的錯誤?? 296.???????try:?? 297.???????????#req=request.Request(tieziUrl,headers=headers5)?? 298.???????????#html=urlopen(req)?? 299.???????????bsObj=BeautifulSoup(html,"html.parser")?? 300.???????????#html.close()?? 301.???????except?UnicodeDecodeError?as?e:?? 302.???????????print("-----UnicodeDecodeError?url",PageUrl)?? 303.???????except?urllib.error.URLError?as?e:?? 304.???????????print("-----urlError?url:",PageUrl)?? 305.???????except?socket.timeout?as?e:?? 306.???????????print("-----socket?timout:",PageUrl)?? 307.?? 308.???????n=0?? 309.???????while(bsObj.find("title").get_text()?==?"頁面重載開啟"):?? 310.???????????print("當前頁面不是重加載后的頁面,程序會嘗試刷新一次到跳轉后的頁面 ")?? 311.???????????driver.get(PageUrl)?? 312.???????????html=driver.page_source#將瀏覽器執(zhí)行后的源代碼賦給html?? 313.???????????bsObj=BeautifulSoup(html,"html.parser")?? 314.???????????n=n+1?? 315.???????????if?n==10:?? 316.???????????????driver.close()?#?Close?the?current?window.?? 317.???????????????driver.quit()#關閉chrome瀏覽器?? 318.???????????????return?1?? 319.?? 320.????except?Exception?as?e:?? 321.????????driver.close()?#?Close?the?current?window.?? 322.????????driver.quit()#關閉chrome瀏覽器?? 323.????????time.sleep(1)?? 324.?? 325.????driver.close()?#?Close?the?current?window.?? 326.????driver.quit()#關閉chrome瀏覽器?? 327.?? 328.?? 329.????#http://bbs.baobeihuijia.com/forum-191-1.html變成http://bbs.baobeihuijia.com,以便組成帖子鏈接?? 330.????siteindex=PageUrl.rfind("/")?? 331.????tempsiteurl=PageUrl[0:siteindex+1]#http://bbs.baobeihuijia.com/?? 332.????#print(tempsiteurl)?? 333.????TieziUrl=[]?? 334.????#爬取想要的信息?? 335.????for?templist1?in?bsObj.findAll("tbody",id=re.compile("normalthread_([0-9]+)"))?:?? 336.????????if?templist1==None:?? 337.????????????continue?? 338.????????for?templist2?in?templist1.findAll("a",{"class":"s?xst"}):?? 339.????????????if?templist2==None:?? 340.????????????????continue?? 341.????????????tempteiziUrl=tempsiteurl+templist2.attrs["href"]#組成帖子鏈接?? 342.????????????print(tempteiziUrl)?? 343.????????????TieziUrl.append(tempteiziUrl)?? 344.????return?TieziUrl#返回帖子鏈接列表?? 345.?? 346.?? 347.?? 348.#CurrentPageMissingPopulationInformation("http://bbs.baobeihuijia.com/thread-213126-1-1.html")?? 349.#GetALLPageUrl("http://bbs.baobeihuijia.com/forum-191-1.html")?? 350.#GetCurrentPageTieziUrl("http://bbs.baobeihuijia.com/forum-191-1.html")?? 351.?? 352.if?__name__?==?"__main__":?? 353.????csvfile=open("E:/MissingPeople.csv","w+",newline="",encoding="gb18030")?? 354.????writer=csv.writer(csvfile)?? 355.????writer.writerow(("寶貝回家編號","姓名","性別","出生日期","失蹤時身高","失蹤時間","失蹤地點","是否報案"))?? 356.????pageurl=GetALLPageUrl("https://bbs.baobeihuijia.com/forum-191-1.html")#尋找失蹤寶貝?? 357.????#pageurl=GetALLPageUrl("http://bbs.baobeihuijia.com/forum-189-1.html")#被拐寶貝回家?? 358.????time.sleep(5)?? 359.????print("所有頁面鏈接獲取成功! ")?? 360.????n=0?? 361.????for?templist1?in?pageurl:?? 362.????????#print(templist1)?? 363.????????tieziurl=GetCurrentPageTieziUrl(templist1)?? 364.????????time.sleep(5)?? 365.????????print("當前頁面"+str(templist1)+"所有帖子鏈接獲取成功! ")?? 366.????????if?tieziurl?==1:?? 367.????????????print("不能得到當前帖子頁面! ")?? 368.????????????continue?? 369.????????else:?? 370.????????????for?templist2?in?tieziurl:?? 371.????????????#print(templist2)?? 372.???????????????n=n+1?? 373.???????????????print(" 正在收集第"+str(n)+"條信息!")?? 374.???????????????time.sleep(5)?? 375.???????????????tempzhi=CurrentPageMissingPopulationInformation(templist2)?? 376.???????????????if?tempzhi==1:?? 377.??????????????????print(" 第"+str(n)+"條信息為空!")?? 378.??????????????????continue?? 379.????print("")?? 380.????print("信息爬取完成!請放心的關閉程序!")?? 381.????csvfile.close()??
寫成的CSV文件截圖:
文章版權歸作者所有,未經(jīng)允許請勿轉載,若此文章存在違規(guī)行為,您可以聯(lián)系管理員刪除。
轉載請注明本文地址:http://specialneedsforspecialkids.com/yun/30691.html
摘要:前兩天有人私信我,讓我爬這個網(wǎng)站,上的失蹤兒童信息,準備根據(jù)失蹤兒童的失蹤時的地理位置來更好的尋找失蹤兒童,這種事情本就應該義不容辭如果對網(wǎng)站服務器造成負荷,還請諒解。這次依然是用第三方爬蟲包,還有,來爬取信息。 前兩天有人私信我,讓我爬這個網(wǎng)站,http://bbs.baobeihuijia.com/f...上的失蹤兒童信息,準備根據(jù)失蹤兒童的失蹤時的地理位置來更好的尋找失蹤兒童,這...
摘要:前兩天有人私信我,讓我爬這個網(wǎng)站,上的失蹤兒童信息,準備根據(jù)失蹤兒童的失蹤時的地理位置來更好的尋找失蹤兒童,這種事情本就應該義不容辭如果對網(wǎng)站服務器造成負荷,還請諒解。這次依然是用第三方爬蟲包,還有,來爬取信息。 前兩天有人私信我,讓我爬這個網(wǎng)站,http://bbs.baobeihuijia.com/f...上的失蹤兒童信息,準備根據(jù)失蹤兒童的失蹤時的地理位置來更好的尋找失蹤兒童,這...
摘要:前兩天有人私信我,讓我爬這個網(wǎng)站,上的失蹤兒童信息,準備根據(jù)失蹤兒童的失蹤時的地理位置來更好的尋找失蹤兒童,這種事情本就應該義不容辭如果對網(wǎng)站服務器造成負荷,還請諒解。這次依然是用第三方爬蟲包,還有,來爬取信息。 前兩天有人私信我,讓我爬這個網(wǎng)站,http://bbs.baobeihuijia.com/f...上的失蹤兒童信息,準備根據(jù)失蹤兒童的失蹤時的地理位置來更好的尋找失蹤兒童,這...
摘要:準備字典文件為了更加準確的將失蹤地址中的省市縣三級單位提取出來,最好能夠直接以省市縣區(qū)的名稱為字典,這樣通過分詞以上的詞都能準確的切分出來。在網(wǎng)上搜索發(fā)現(xiàn),根據(jù)最新的國家統(tǒng)計區(qū)位碼之作的現(xiàn)成的字典文件,并沒有。 在失蹤兒童信息保存在本地之后,有一個字段是失蹤地點,字段內容通常比較詳細,具體到了失蹤的街道或者村,我打算通過某種方法將失蹤地點中的省、市、縣/區(qū)三級地址提取出來。 確定分詞技...
摘要:寫基于和開發(fā)的失蹤兒童信息平臺。團圓系統(tǒng)的全稱應該是公安部兒童失蹤信息緊急發(fā)布平臺,在新浪微博上有一個官方的微博賬號,通過這個微博賬號發(fā)布兒童失蹤信息。 這是我在sf上的第一篇文章。寫基于swoole、Mixphp和CodeIgniter開發(fā)的失蹤兒童信息平臺。 在2017年的時候,關注到有一個團圓系統(tǒng),它是公安部專門為了快速擴散失蹤兒童消息的平臺,但是網(wǎng)上并沒有找到這個平臺的地址。當...
閱讀 2976·2023-04-26 02:25
閱讀 2249·2023-04-25 18:05
閱讀 647·2021-09-30 09:57
閱讀 2943·2021-09-27 14:10
閱讀 1652·2019-08-30 15:44
閱讀 1003·2019-08-29 15:28
閱讀 2524·2019-08-29 14:10
閱讀 2263·2019-08-29 13:30