国产xxxx99真实实拍_久久不雅视频_高清韩国a级特黄毛片_嗯老师别我我受不了了小说

資訊專欄INFORMATION COLUMN

Python爬蟲小實踐:尋找失蹤人口,爬取失蹤兒童信息并寫成csv文件,方便存入數據庫

susheng / 2049人閱讀

摘要:前兩天有人私信我,讓我爬這個網站,上的失蹤兒童信息,準備根據失蹤兒童的失蹤時的地理位置來更好的尋找失蹤兒童,這種事情本就應該義不容辭如果對網站服務器造成負荷,還請諒解。這次依然是用第三方爬蟲包,還有,來爬取信息。

前兩天有人私信我,讓我爬這個網站,http://bbs.baobeihuijia.com/f...上的失蹤兒童信息,準備根據失蹤兒童的失蹤時的地理位置來更好的尋找失蹤兒童,這種事情本就應該義不容辭,如果對網站服務器造成負荷,還請諒解。

這次依然是用第三方爬蟲包BeautifulSoup,還有Selenium+Chrome,Selenium+PhantomJS來爬取信息。
通過分析網站的框架,依然分三步來進行。

步驟一:獲取http://bbs.baobeihuijia.com/f...這個版塊上的所有分頁頁面鏈接
步驟二:獲取每一個分頁鏈接上所發的帖子的鏈接
步驟三:獲取每一個帖子鏈接上要爬取的信息,編號,姓名,性別,出生日期,失蹤時身高,失蹤時間,失蹤地點,以及是否報案

起先用的BeautifulSoup,但是被管理員設置了網站重定向,然后就采用selenium的方式,在這里還是對網站管理員說一聲抱歉。

1、獲取http://bbs.baobeihuijia.com/f...這個版塊上的所有分頁頁面鏈接

通過分析:發現分頁的頁面鏈接處于

下,所以寫了以下的代碼
BeautifulSoup形式:

</>復制代碼

  1. [python]?view plain?copy
  2. 1.def?GetALLPageUrl(siteUrl):??
  3. 2.????#設置代理IP訪問??
  4. 3.????#代理IP可以上http://http.zhimaruanjian.com/獲取??
  5. 4.????proxy_handler=urllib.request.ProxyHandler({"https":"111.76.129.200:808"})??
  6. 5.????proxy_auth_handler=urllib.request.ProxyBasicAuthHandler()??
  7. 6.????opener?=?urllib.request.build_opener(urllib.request.HTTPHandler,?proxy_handler)??
  8. 7.????urllib.request.install_opener(opener)??
  9. 8.????#獲取網頁信息??
  10. 9.????req=request.Request(siteUrl,headers=headers1?or?headers2?or?headers3)??
  11. 10.????html=urlopen(req)??
  12. 11.????bsObj=BeautifulSoup(html.read(),"html.parser")??
  13. 12.????html.close()??
  14. 13.????#http://bbs.baobeihuijia.com/forum-191-1.html變成http://bbs.baobeihuijia.com,以便組成頁面鏈接??
  15. 14.????siteindex=siteUrl.rfind("/")??
  16. 15.????tempsiteurl=siteUrl[0:siteindex+1]#http://bbs.baobeihuijia.com/??
  17. 16.????tempbianhaoqian=siteUrl[siteindex+1:-6]#forum-191-??
  18. 17.??
  19. 18.????#爬取想要的信息??
  20. 19.????bianhao=[]#存儲頁面編號??
  21. 20.????pageUrl=[]#存儲頁面鏈接??
  22. 21.????templist1=bsObj.find("div",{"class":"pg"})??
  23. 22.????for?templist2?in?templist1.findAll("a",href=re.compile("forum-([0-9]+)-([0-9]+).html")):??
  24. 23.????????lianjie=templist2.attrs["href"]??
  25. 24.????????#print(lianjie)??
  26. 25.????????index1=lianjie.rfind("-")#查找-在字符串中的位置??
  27. 26.????????index2=lianjie.rfind(".")#查找.在字符串中的位置??
  28. 27.????????tempbianhao=lianjie[index1+1:index2]??
  29. 28.????????bianhao.append(int(tempbianhao))??
  30. 29.????bianhaoMax=max(bianhao)#獲取頁面的最大編號??
  31. 30.??
  32. 31.????for?i?in?range(1,bianhaoMax+1):??
  33. 32.????????temppageUrl=tempsiteurl+tempbianhaoqian+str(i)+".html"#組成頁面鏈接??
  34. 33.????????#print(temppageUrl)??
  35. 34.????????pageUrl.append(temppageUrl)??
  36. 35.????return?pageUrl#返回頁面鏈接列表??
  37. Selenium形式:
  38. [python]?view plain?copy
  39. 1.#得到當前板塊所有的頁面鏈接??
  40. 2.#siteUrl為當前版塊的頁面鏈接??
  41. 3.def?GetALLPageUrl(siteUrl):??
  42. 4.????#設置代理IP訪問??
  43. 5.????#代理IP可以上http://http.zhimaruanjian.com/獲取??
  44. 6.????proxy_handler=urllib.request.ProxyHandler({"post":"123.207.143.51:8080"})??
  45. 7.????proxy_auth_handler=urllib.request.ProxyBasicAuthHandler()??
  46. 8.????opener?=?urllib.request.build_opener(urllib.request.HTTPHandler,?proxy_handler)??
  47. 9.????urllib.request.install_opener(opener)??
  48. 10.??
  49. 11.????try:??
  50. 12.????????#掉用第三方包selenium打開瀏覽器登陸??
  51. 13.????????#driver=webdriver.Chrome()#打開chrome??
  52. 14.???????driver=webdriver.Chrome()#打開無界面瀏覽器Chrome??
  53. 15.???????#driver=webdriver.PhantomJS()#打開無界面瀏覽器PhantomJS??
  54. 16.???????driver.set_page_load_timeout(10)??
  55. 17.???????#driver.implicitly_wait(30)??
  56. 18.???????try:??
  57. 19.???????????driver.get(siteUrl)#登陸兩次??
  58. 20.???????????driver.get(siteUrl)??
  59. 21.???????except?TimeoutError:??
  60. 22.???????????driver.refresh()??
  61. 23.??
  62. 24.???????#print(driver.page_source)??
  63. 25.???????html=driver.page_source#將瀏覽器執行后的源代碼賦給html??
  64. 26.????????#獲取網頁信息??
  65. 27.????#抓捕網頁解析過程中的錯誤??
  66. 28.???????try:??
  67. 29.???????????#req=request.Request(tieziUrl,headers=headers5)??
  68. 30.???????????#html=urlopen(req)??
  69. 31.???????????bsObj=BeautifulSoup(html,"html.parser")??
  70. 32.???????????#print(bsObj.find("title").get_text())??
  71. 33.???????????#html.close()??
  72. 34.???????except?UnicodeDecodeError?as?e:??
  73. 35.???????????print("-----UnicodeDecodeError?url",siteUrl)??
  74. 36.???????except?urllib.error.URLError?as?e:??
  75. 37.???????????print("-----urlError?url:",siteUrl)??
  76. 38.???????except?socket.timeout?as?e:??
  77. 39.???????????print("-----socket?timout:",siteUrl)??
  78. 40.??
  79. 41.??
  80. 42.??
  81. 43.???????while(bsObj.find("title").get_text()?==?"頁面重載開啟"):??
  82. 44.???????????print("當前頁面不是重加載后的頁面,程序會嘗試刷新一次到跳轉后的頁面
  83. ")??
  84. 45.???????????driver.get(siteUrl)??
  85. 46.???????????html=driver.page_source#將瀏覽器執行后的源代碼賦給html??
  86. 47.???????????bsObj=BeautifulSoup(html,"html.parser")??
  87. 48.????except?Exception?as?e:??
  88. 49.??
  89. 50.????????driver.close()?#?Close?the?current?window.??
  90. 51.????????driver.quit()#關閉chrome瀏覽器??
  91. 52.????????#time.sleep()??
  92. 53.??
  93. 54.????driver.close()?#?Close?the?current?window.??
  94. 55.????driver.quit()#關閉chrome瀏覽器??
  95. 56.??
  96. 57.??
  97. 58.????#http://bbs.baobeihuijia.com/forum-191-1.html變成http://bbs.baobeihuijia.com,以便組成頁面鏈接??
  98. 59.????siteindex=siteUrl.rfind("/")??
  99. 60.????tempsiteurl=siteUrl[0:siteindex+1]#http://bbs.baobeihuijia.com/??
  100. 61.????tempbianhaoqian=siteUrl[siteindex+1:-6]#forum-191-??
  101. 62.??
  102. 63.????#爬取想要的信息??
  103. 64.????bianhao=[]#存儲頁面編號??
  104. 65.????pageUrl=[]#存儲頁面鏈接??
  105. 66.??
  106. 67.????templist1=bsObj.find("div",{"class":"pg"})??
  107. 68.????#if?templist1==None:??
  108. 69.????????#return??
  109. 70.????for?templist2?in?templist1.findAll("a",href=re.compile("forum-([0-9]+)-([0-9]+).html")):??
  110. 71.????????if?templist2==None:??
  111. 72.????????????continue??
  112. 73.????????lianjie=templist2.attrs["href"]??
  113. 74.????????#print(lianjie)??
  114. 75.????????index1=lianjie.rfind("-")#查找-在字符串中的位置??
  115. 76.????????index2=lianjie.rfind(".")#查找.在字符串中的位置??
  116. 77.????????tempbianhao=lianjie[index1+1:index2]??
  117. 78.????????bianhao.append(int(tempbianhao))??
  118. 79.????bianhaoMax=max(bianhao)#獲取頁面的最大編號??
  119. 80.??
  120. 81.????for?i?in?range(1,bianhaoMax+1):??
  121. 82.????????temppageUrl=tempsiteurl+tempbianhaoqian+str(i)+".html"#組成頁面鏈接??
  122. 83.????????print(temppageUrl)??
  123. 84.????????pageUrl.append(temppageUrl)??
  124. 85.????return?pageUrl#返回頁面鏈接列表??

2.獲取每一個分頁鏈接上所發的帖子的鏈接

每個帖子的鏈接都位于href下
所以寫了以下的代碼:
BeautifulSoup形式:

</>復制代碼

  1. [python]?view plain?copy
  2. 1.#得到當前版塊頁面所有帖子的鏈接??
  3. 2.def?GetCurrentPageTieziUrl(PageUrl):??
  4. 3.????#設置代理IP訪問??
  5. 4.????#代理IP可以上http://http.zhimaruanjian.com/獲取??
  6. 5.????proxy_handler=urllib.request.ProxyHandler({"post":"121.22.252.85:8000"})??
  7. 6.????proxy_auth_handler=urllib.request.ProxyBasicAuthHandler()??
  8. 7.????opener?=?urllib.request.build_opener(urllib.request.HTTPHandler,?proxy_handler)??
  9. 8.????urllib.request.install_opener(opener)??
  10. 9.????#獲取網頁信息??
  11. 10.????req=request.Request(PageUrl,headers=headers1?or?headers2?or?headers3)??
  12. 11.????html=urlopen(req)??
  13. 12.????bsObj=BeautifulSoup(html.read(),"html.parser")??
  14. 13.????html.close()??
  15. 14.????#http://bbs.baobeihuijia.com/forum-191-1.html變成http://bbs.baobeihuijia.com,以便組成帖子鏈接??
  16. 15.????siteindex=PageUrl.rfind("/")??
  17. 16.????tempsiteurl=PageUrl[0:siteindex+1]#http://bbs.baobeihuijia.com/??
  18. 17.????#print(tempsiteurl)??
  19. 18.????TieziUrl=[]??
  20. 19.????#爬取想要的信息??
  21. 20.????for?templist1?in?bsObj.findAll("tbody",id=re.compile("normalthread_([0-9]+)"))?:??
  22. 21.????????for?templist2?in?templist1.findAll("a",{"class":"s?xst"}):??
  23. 22.????????????tempteiziUrl=tempsiteurl+templist2.attrs["href"]#組成帖子鏈接??
  24. 23.????????????print(tempteiziUrl)??
  25. 24.????????????TieziUrl.append(tempteiziUrl)??
  26. 25.????return?TieziUrl#返回帖子鏈接列表??
  27. Selenium形式:
  28. [python]?view plain?copy
  29. 1.#得到當前版塊頁面所有帖子的鏈接??
  30. 2.def?GetCurrentPageTieziUrl(PageUrl):??
  31. 3.????#設置代理IP訪問??
  32. 4.????#代理IP可以上http://http.zhimaruanjian.com/獲取??
  33. 5.????proxy_handler=urllib.request.ProxyHandler({"post":"110.73.30.157:8123"})??
  34. 6.????proxy_auth_handler=urllib.request.ProxyBasicAuthHandler()??
  35. 7.????opener?=?urllib.request.build_opener(urllib.request.HTTPHandler,?proxy_handler)??
  36. 8.????urllib.request.install_opener(opener)??
  37. 9.??
  38. 10.????try:??
  39. 11.????????#掉用第三方包selenium打開瀏覽器登陸??
  40. 12.????????#driver=webdriver.Chrome()#打開chrome??
  41. 13.???????driver=webdriver.Chrome()#打開無界面瀏覽器Chrome??
  42. 14.???????#driver=webdriver.PhantomJS()#打開無界面瀏覽器PhantomJS??
  43. 15.???????driver.set_page_load_timeout(10)??
  44. 16.???????try:??
  45. 17.???????????driver.get(PageUrl)#登陸兩次??
  46. 18.???????????driver.get(PageUrl)??
  47. 19.???????except?TimeoutError:??
  48. 20.???????????driver.refresh()??
  49. 21.??
  50. 22.???????#print(driver.page_source)??
  51. 23.???????html=driver.page_source#將瀏覽器執行后的源代碼賦給html??
  52. 24.????????#獲取網頁信息??
  53. 25.????#抓捕網頁解析過程中的錯誤??
  54. 26.???????try:??
  55. 27.???????????#req=request.Request(tieziUrl,headers=headers5)??
  56. 28.???????????#html=urlopen(req)??
  57. 29.???????????bsObj=BeautifulSoup(html,"html.parser")??
  58. 30.???????????#html.close()??
  59. 31.???????except?UnicodeDecodeError?as?e:??
  60. 32.???????????print("-----UnicodeDecodeError?url",PageUrl)??
  61. 33.???????except?urllib.error.URLError?as?e:??
  62. 34.???????????print("-----urlError?url:",PageUrl)??
  63. 35.???????except?socket.timeout?as?e:??
  64. 36.???????????print("-----socket?timout:",PageUrl)??
  65. 37.??
  66. 38.???????n=0??
  67. 39.???????while(bsObj.find("title").get_text()?==?"頁面重載開啟"):??
  68. 40.???????????print("當前頁面不是重加載后的頁面,程序會嘗試刷新一次到跳轉后的頁面
  69. ")??
  70. 41.???????????driver.get(PageUrl)??
  71. 42.???????????html=driver.page_source#將瀏覽器執行后的源代碼賦給html??
  72. 43.???????????bsObj=BeautifulSoup(html,"html.parser")??
  73. 44.???????????n=n+1??
  74. 45.???????????if?n==10:??
  75. 46.???????????????driver.close()?#?Close?the?current?window.??
  76. 47.???????????????driver.quit()#關閉chrome瀏覽器??
  77. 48.???????????????return?1??
  78. 49.??
  79. 50.????except?Exception?as?e:??
  80. 51.????????driver.close()?#?Close?the?current?window.??
  81. 52.????????driver.quit()#關閉chrome瀏覽器??
  82. 53.????????time.sleep(1)??
  83. 54.??
  84. 55.????driver.close()?#?Close?the?current?window.??
  85. 56.????driver.quit()#關閉chrome瀏覽器??
  86. 57.??
  87. 58.??
  88. 59.????#http://bbs.baobeihuijia.com/forum-191-1.html變成http://bbs.baobeihuijia.com,以便組成帖子鏈接??
  89. 60.????siteindex=PageUrl.rfind("/")??
  90. 61.????tempsiteurl=PageUrl[0:siteindex+1]#http://bbs.baobeihuijia.com/??
  91. 62.????#print(tempsiteurl)??
  92. 63.????TieziUrl=[]??
  93. 64.????#爬取想要的信息??
  94. 65.????for?templist1?in?bsObj.findAll("tbody",id=re.compile("normalthread_([0-9]+)"))?:??
  95. 66.????????if?templist1==None:??
  96. 67.????????????continue??
  97. 68.????????for?templist2?in?templist1.findAll("a",{"class":"s?xst"}):??
  98. 69.????????????if?templist2==None:??
  99. 70.????????????????continue??
  100. 71.????????????tempteiziUrl=tempsiteurl+templist2.attrs["href"]#組成帖子鏈接??
  101. 72.????????????print(tempteiziUrl)??
  102. 73.????????????TieziUrl.append(tempteiziUrl)??
  103. 74.????return?TieziUrl#返回帖子鏈接列表??

3.獲取每一個帖子鏈接上要爬取的信息,編號,姓名,性別,出生日期,失蹤時身高,失蹤時間,失蹤地點,以及是否報案,并寫入CSV中

通過查看每一個帖子的鏈接,發現其失蹤人口信息都在

    標簽下,所以編寫了以下的代碼
    BeautifulSoup形式:

    </>復制代碼

    1. [python]?view plain?copy
    2. 1.#得到當前頁面失蹤人口信息??
    3. 2.#pageUrl為當前帖子頁面鏈接??
    4. 3.def?CurrentPageMissingPopulationInformation(tieziUrl):??
    5. 4.????#設置代理IP訪問??
    6. 5.????#代理IP可以上http://http.zhimaruanjian.com/獲取??
    7. 6.????proxy_handler=urllib.request.ProxyHandler({"post":"210.136.17.78:8080"})??
    8. 7.????proxy_auth_handler=urllib.request.ProxyBasicAuthHandler()??
    9. 8.????opener?=?urllib.request.build_opener(urllib.request.HTTPHandler,?proxy_handler)??
    10. 9.????urllib.request.install_opener(opener)??
    11. 10.????#獲取網頁信息??
    12. 11.????req=request.Request(tieziUrl,headers=headers1?or?headers2?or?headers3)??
    13. 12.????html=urlopen(req)??
    14. 13.????bsObj=BeautifulSoup(html.read(),"html.parser")??
    15. 14.????html.close()??
    16. 15.????#查找想要的信息??
    17. 16.????templist1=bsObj.find("td",{"class":"t_f"}).ul??
    18. 17.????if?templist1==None:#判斷是否不包含ul字段,如果不,跳出函數??
    19. 18.????????return??
    20. 19.????mycsv=["NULL","NULL","NULL","NULL","NULL","NULL","NULL","NULL"]#初始化提取信息列表??
    21. 20.????for?templist2?in?templist1.findAll("font",size=re.compile("^([0-9]+)*$")):??
    22. 21.????????if?len(templist2)==0:??
    23. 22.????????????continue??
    24. 23.????????tempText=templist2.get_text()??
    25. 24.????????#print(tempText[0:4])??
    26. 25.????????if?"寶貝回家編號"?in?tempText[0:6]:??
    27. 26.????????????print(tempText)??
    28. 27.????????????index=tempText.find(":")??
    29. 28.????????????tempText=tempText[index+1:]??
    30. 29.????????????#mycsv.append(tempText)??
    31. 30.????????????if?len(tempText)==0:??
    32. 31.????????????????tempText="NULL"??
    33. 32.????????????mycsv[0]=tempText??
    34. 33.????????if?"尋親編號"?in?tempText[0:6]:??
    35. 34.????????????print(tempText)??
    36. 35.????????????index=tempText.find(":")??
    37. 36.????????????tempText=tempText[index+1:]??
    38. 37.????????????if?len(tempText)==0:??
    39. 38.????????????????tempText="NULL"??
    40. 39.????????????#mycsv.append(tempText)??
    41. 40.????????????mycsv[0]=tempText??
    42. 41.????????if?"登記編號"?in?tempText[0:6]:??
    43. 42.????????????print(tempText)??
    44. 43.????????????index=tempText.find(":")??
    45. 44.????????????tempText=tempText[index+1:]??
    46. 45.????????????if?len(tempText)==0:??
    47. 46.????????????????tempText="NULL"??
    48. 47.????????????#mycsv.append(tempText)??
    49. 48.????????????mycsv[0]=tempText??
    50. 49.????????if?"姓"?in?tempText[0:6]:??
    51. 50.????????????print(tempText)??
    52. 51.????????????index=tempText.find(":")??
    53. 52.????????????tempText=tempText[index+1:]??
    54. 53.????????????#mycsv.append(tempText)??
    55. 54.????????????mycsv[1]=tempText??
    56. 55.????????if"性"?in?tempText[0:6]:??
    57. 56.????????????print(tempText)??
    58. 57.????????????index=tempText.find(":")??
    59. 58.????????????tempText=tempText[index+1:]??
    60. 59.????????????#mycsv.append(tempText)??
    61. 60.????????????mycsv[2]=tempText??
    62. 61.????????if?"出生日期"?in?tempText[0:6]:??
    63. 62.????????????print(tempText)??
    64. 63.????????????index=tempText.find(":")??
    65. 64.????????????tempText=tempText[index+1:]??
    66. 65.????????????#mycsv.append(tempText)??
    67. 66.????????????mycsv[3]=tempText??
    68. 67.????????if?"失蹤時身高"?in?tempText[0:6]:??
    69. 68.????????????print(tempText)??
    70. 69.????????????index=tempText.find(":")??
    71. 70.????????????tempText=tempText[index+1:]??
    72. 71.????????????#mycsv.append(tempText)??
    73. 72.????????????mycsv[4]=tempText??
    74. 73.????????if?"失蹤時間"?in?tempText[0:6]:??
    75. 74.????????????print(tempText)??
    76. 75.????????????index=tempText.find(":")??
    77. 76.????????????tempText=tempText[index+1:]??
    78. 77.????????????#mycsv.append(tempText)??
    79. 78.????????????mycsv[5]=tempText??
    80. 79.????????if?"失蹤日期"?in?tempText[0:6]:??
    81. 80.????????????print(tempText)??
    82. 81.????????????index=tempText.find(":")??
    83. 82.????????????tempText=tempText[index+1:]??
    84. 83.????????????#mycsv.append(tempText)??
    85. 84.????????????mycsv[5]=tempText??
    86. 85.????????if?"失蹤地點"?in?tempText[0:6]:??
    87. 86.????????????print(tempText)??
    88. 87.????????????index=tempText.find(":")??
    89. 88.????????????tempText=tempText[index+1:]??
    90. 89.????????????#mycsv.append(tempText)??
    91. 90.????????????mycsv[6]=tempText??
    92. 91.????????if?"是否報案"?in?tempText[0:6]:??
    93. 92.????????????print(tempText)??
    94. 93.????????????index=tempText.find(":")??
    95. 94.????????????tempText=tempText[index+1:]??
    96. 95.????????????#mycsv.append(tempText)??
    97. 96.????????????mycsv[7]=tempText??
    98. 97.????try:??
    99. 98.????????writer.writerow((str(mycsv[0]),str(mycsv[1]),str(mycsv[2]),str(mycsv[3]),str(mycsv[4]),str(mycsv[5]),str(mycsv[6]),str(mycsv[7])))#寫入CSV文件??
    100. 99.????finally:??
    101. 100.????????time.sleep(1)#設置爬完之后的睡眠時間,這里先設置為1秒?

    ?

    Selenium形式:

    </>復制代碼

    1. [python]?view plain?copy
    2. 1.#得到當前頁面失蹤人口信息??
    3. 2.#pageUrl為當前帖子頁面鏈接??
    4. 3.def?CurrentPageMissingPopulationInformation(tieziUrl):??
    5. 4.????#設置代理IP訪問??
    6. 5.????#代理IP可以上http://http.zhimaruanjian.com/獲取??
    7. 6.????proxy_handler=urllib.request.ProxyHandler({"post":"128.199.169.17:80"})??
    8. 7.????proxy_auth_handler=urllib.request.ProxyBasicAuthHandler()??
    9. 8.????opener?=?urllib.request.build_opener(urllib.request.HTTPHandler,?proxy_handler)??
    10. 9.????urllib.request.install_opener(opener)??
    11. 10.??
    12. 11.????try:??
    13. 12.????????#掉用第三方包selenium打開瀏覽器登陸??
    14. 13.????????#driver=webdriver.Chrome()#打開chrome??
    15. 14.???????driver=webdriver.Chrome()#打開無界面瀏覽器Chrome??
    16. 15.???????#driver=webdriver.PhantomJS()#打開無界面瀏覽器PhantomJS??
    17. 16.???????driver.set_page_load_timeout(10)??
    18. 17.???????#driver.implicitly_wait(30)??
    19. 18.???????try:??
    20. 19.???????????driver.get(tieziUrl)#登陸兩次??
    21. 20.???????????driver.get(tieziUrl)??
    22. 21.???????except?TimeoutError:??
    23. 22.???????????driver.refresh()??
    24. 23.??
    25. 24.???????#print(driver.page_source)??
    26. 25.???????html=driver.page_source#將瀏覽器執行后的源代碼賦給html??
    27. 26.????????#獲取網頁信息??
    28. 27.????#抓捕網頁解析過程中的錯誤??
    29. 28.???????try:??
    30. 29.???????????#req=request.Request(tieziUrl,headers=headers5)??
    31. 30.???????????#html=urlopen(req)??
    32. 31.???????????bsObj=BeautifulSoup(html,"html.parser")??
    33. 32.???????????#html.close()??
    34. 33.???????except?UnicodeDecodeError?as?e:??
    35. 34.???????????print("-----UnicodeDecodeError?url",tieziUrl)??
    36. 35.???????except?urllib.error.URLError?as?e:??
    37. 36.???????????print("-----urlError?url:",tieziUrl)??
    38. 37.???????except?socket.timeout?as?e:??
    39. 38.???????????print("-----socket?timout:",tieziUrl)??
    40. 39.??
    41. 40.??
    42. 41.???????while(bsObj.find("title").get_text()?==?"頁面重載開啟"):??
    43. 42.???????????print("當前頁面不是重加載后的頁面,程序會嘗試刷新一次到跳轉后的頁面
    44. ")??
    45. 43.???????????driver.get(tieziUrl)??
    46. 44.???????????html=driver.page_source#將瀏覽器執行后的源代碼賦給html??
    47. 45.???????????bsObj=BeautifulSoup(html,"html.parser")??
    48. 46.????except?Exception?as?e:??
    49. 47.????????driver.close()?#?Close?the?current?window.??
    50. 48.????????driver.quit()#關閉chrome瀏覽器??
    51. 49.????????time.sleep(0.5)??
    52. 50.??
    53. 51.????driver.close()?#?Close?the?current?window.??
    54. 52.????driver.quit()#關閉chrome瀏覽器??
    55. 53.??
    56. 54.??
    57. 55.????#查找想要的信息??
    58. 56.????templist1=bsObj.find("td",{"class":"t_f"}).ul??
    59. 57.????if?templist1==None:#判斷是否不包含ul字段,如果不,跳出函數??
    60. 58.????????print("當前帖子頁面不包含ul字段")??
    61. 59.????????return?1??
    62. 60.????mycsv=["NULL","NULL","NULL","NULL","NULL","NULL","NULL","NULL"]#初始化提取信息列表??
    63. 61.????for?templist2?in?templist1.findAll("font",size=re.compile("^([0-9]+)*$")):??
    64. 62.????????tempText=templist2.get_text()??
    65. 63.????????#print(tempText[0:4])??
    66. 64.????????if?"寶貝回家編號"?in?tempText[0:6]:??
    67. 65.????????????print(tempText)??
    68. 66.????????????index=tempText.find(":")??
    69. 67.????????????tempText=tempText[index+1:]??
    70. 68.????????????#mycsv.append(tempText)??
    71. 69.????????????if?len(tempText)==0:??
    72. 70.????????????????tempText="NULL"??
    73. 71.????????????mycsv[0]=tempText??
    74. 72.????????if?"尋親編號"?in?tempText[0:6]:??
    75. 73.????????????print(tempText)??
    76. 74.????????????index=tempText.find(":")??
    77. 75.????????????tempText=tempText[index+1:]??
    78. 76.????????????if?len(tempText)==0:??
    79. 77.????????????????tempText="NULL"??
    80. 78.????????????#mycsv.append(tempText)??
    81. 79.????????????mycsv[0]=tempText??
    82. 80.????????if?"登記編號"?in?tempText[0:6]:??
    83. 81.????????????print(tempText)??
    84. 82.????????????index=tempText.find(":")??
    85. 83.????????????tempText=tempText[index+1:]??
    86. 84.????????????if?len(tempText)==0:??
    87. 85.????????????????tempText="NULL"??
    88. 86.????????????#mycsv.append(tempText)??
    89. 87.????????????mycsv[0]=tempText??
    90. 88.????????if?"姓"?in?tempText[0:6]:??
    91. 89.????????????print(tempText)??
    92. 90.????????????index=tempText.find(":")??
    93. 91.????????????tempText=tempText[index+1:]??
    94. 92.????????????#mycsv.append(tempText)??
    95. 93.????????????mycsv[1]=tempText??
    96. 94.????????if"性"?in?tempText[0:6]:??
    97. 95.????????????print(tempText)??
    98. 96.????????????index=tempText.find(":")??
    99. 97.????????????tempText=tempText[index+1:]??
    100. 98.????????????#mycsv.append(tempText)??
    101. 99.????????????mycsv[2]=tempText??
    102. 100.????????if?"出生日期"?in?tempText[0:6]:??
    103. 101.????????????print(tempText)??
    104. 102.????????????index=tempText.find(":")??
    105. 103.????????????tempText=tempText[index+1:]??
    106. 104.????????????#mycsv.append(tempText)??
    107. 105.????????????mycsv[3]=tempText??
    108. 106.????????if?"失蹤時身高"?in?tempText[0:6]:??
    109. 107.????????????print(tempText)??
    110. 108.????????????index=tempText.find(":")??
    111. 109.????????????tempText=tempText[index+1:]??
    112. 110.????????????#mycsv.append(tempText)??
    113. 111.????????????mycsv[4]=tempText??
    114. 112.????????if?"失蹤時間"?in?tempText[0:6]:??
    115. 113.????????????print(tempText)??
    116. 114.????????????index=tempText.find(":")??
    117. 115.????????????tempText=tempText[index+1:]??
    118. 116.????????????#mycsv.append(tempText)??
    119. 117.????????????mycsv[5]=tempText??
    120. 118.????????if?"失蹤日期"?in?tempText[0:6]:??
    121. 119.????????????print(tempText)??
    122. 120.????????????index=tempText.find(":")??
    123. 121.????????????tempText=tempText[index+1:]??
    124. 122.????????????#mycsv.append(tempText)??
    125. 123.????????????mycsv[5]=tempText??
    126. 124.????????if?"失蹤地點"?in?tempText[0:6]:??
    127. 125.????????????print(tempText)??
    128. 126.????????????index=tempText.find(":")??
    129. 127.????????????tempText=tempText[index+1:]??
    130. 128.????????????#mycsv.append(tempText)??
    131. 129.????????????mycsv[6]=tempText??
    132. 130.????????if?"是否報案"?in?tempText[0:6]:??
    133. 131.????????????print(tempText)??
    134. 132.????????????index=tempText.find(":")??
    135. 133.????????????tempText=tempText[index+1:]??
    136. 134.????????????#mycsv.append(tempText)??
    137. 135.????????????mycsv[7]=tempText??
    138. 136.????try:??
    139. 137.????????writer.writerow((str(mycsv[0]),str(mycsv[1]),str(mycsv[2]),str(mycsv[3]),str(mycsv[4]),str(mycsv[5]),str(mycsv[6]),str(mycsv[7])))#寫入CSV文件??
    140. 138.????????csvfile.flush()#馬上將這條數據寫入csv文件中??
    141. 139.????finally:??
    142. 140.????????print("當前帖子信息寫入完成
    143. ")??
    144. 141.????????time.sleep(5)#設置爬完之后的睡眠時間,這里先設置為1秒??

    現附上所有代碼,此代碼僅供參考,不能用于商業用途,網絡爬蟲易給網站服務器造成巨大負荷,任何人使用本代碼所引起的任何后果,本人不予承擔法律責任。貼出代碼的初衷是供大家學習爬蟲,大家只是研究下網絡框架即可,不要使用此代碼去加重網站負荷,本人由于不當使用,已被封IP,前車之鑒,爬取失蹤人口信息只是為了從空間上分析人口失蹤的規律,由此給網站造成的什么不便,請見諒。

    附上所有代碼:

    </>復制代碼

    1. [python]?view plain?copy
    2. 1.#__author__?=?"Administrator"??
    3. 2.#coding=utf-8??
    4. 3.import?io??
    5. 4.import?os??
    6. 5.import?sys??
    7. 6.import?math??
    8. 7.import?urllib??
    9. 8.from?urllib.request?import??urlopen??
    10. 9.from?urllib.request?import?urlretrieve??
    11. 10.from?urllib??import?request??
    12. 11.from?bs4?import?BeautifulSoup??
    13. 12.import?re??
    14. 13.import?time??
    15. 14.import?socket??
    16. 15.import?csv??
    17. 16.from?selenium?import?webdriver??
    18. 17.??
    19. 18.socket.setdefaulttimeout(5000)#設置全局超時函數??
    20. 19.??
    21. 20.??
    22. 21.??
    23. 22.sys.stdout?=?io.TextIOWrapper(sys.stdout.buffer,encoding="gb18030")??
    24. 23.#sys.stdout?=?io.TextIOWrapper(sys.stdout.buffer,encoding="utf-8")??
    25. 24.#設置不同的headers,偽裝為不同的瀏覽器??
    26. 25.headers1={"User-Agent":"Mozilla/5.0?(Windows?NT?6.1;?WOW64;?rv:23.0)?Gecko/20100101?Firefox/23.0"}??
    27. 26.headers2={"User-Agent":"Mozilla/5.0?(Windows?NT?6.3;?WOW64)?AppleWebKit/537.36?(KHTML,?like?Gecko)?Chrome/45.0.2454.101?Safari/537.36"}??
    28. 27.headers3={"User-Agent":"Mozilla/5.0?(Windows?NT?6.1)?AppleWebKit/537.11?(KHTML,?like?Gecko)?Chrome/23.0.1271.64?Safari/537.11"}??
    29. 28.headers4={"User-Agent":"Mozilla/5.0?(Windows?NT?10.0;?WOW64)?AppleWebKit/537.36?(KHTML,?like?Gecko)?Chrome/53.0.2785.104?Safari/537.36?Core/1.53.2372.400?QQBrowser/9.5.10548.400"}??
    30. 29.headers5={"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",??
    31. 30."Connection":"keep-alive",??
    32. 31."Host":"bbs.baobeihuijia.com",??
    33. 32."Referer":"http://bbs.baobeihuijia.com/forum-191-1.html",??
    34. 33."Upgrade-Insecure-Requests":"1",??
    35. 34."User-Agent":"Mozilla/5.0?(Windows?NT?6.1;?WOW64)?AppleWebKit/537.36?(KHTML,?like?Gecko)?Chrome/51.0.2704.103?Safari/537.36"}??
    36. 35.??
    37. 36.headers6={"Host":?"bbs.baobeihuijia.com",??
    38. 37."User-Agent":?"Mozilla/5.0?(Windows?NT?6.1;?WOW64;?rv:51.0)?Gecko/20100101?Firefox/51.0",??
    39. 38."Accept":?"textml,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",??
    40. 39."Connection":?"keep-alive",??
    41. 40."Upgrade-Insecure-Requests":"?1"??
    42. 41.}??
    43. 42.#得到當前頁面失蹤人口信息??
    44. 43.#pageUrl為當前帖子頁面鏈接??
    45. 44.def?CurrentPageMissingPopulationInformation(tieziUrl):??
    46. 45.????#設置代理IP訪問??
    47. 46.????#代理IP可以上http://http.zhimaruanjian.com/獲取??
    48. 47.????proxy_handler=urllib.request.ProxyHandler({"post":"128.199.169.17:80"})??
    49. 48.????proxy_auth_handler=urllib.request.ProxyBasicAuthHandler()??
    50. 49.????opener?=?urllib.request.build_opener(urllib.request.HTTPHandler,?proxy_handler)??
    51. 50.????urllib.request.install_opener(opener)??
    52. 51.??
    53. 52.????try:??
    54. 53.????????#掉用第三方包selenium打開瀏覽器登陸??
    55. 54.????????#driver=webdriver.Chrome()#打開chrome??
    56. 55.???????driver=webdriver.Chrome()#打開無界面瀏覽器Chrome??
    57. 56.???????#driver=webdriver.PhantomJS()#打開無界面瀏覽器PhantomJS??
    58. 57.???????driver.set_page_load_timeout(10)??
    59. 58.???????#driver.implicitly_wait(30)??
    60. 59.???????try:??
    61. 60.???????????driver.get(tieziUrl)#登陸兩次??
    62. 61.???????????driver.get(tieziUrl)??
    63. 62.???????except?TimeoutError:??
    64. 63.???????????driver.refresh()??
    65. 64.??
    66. 65.???????#print(driver.page_source)??
    67. 66.???????html=driver.page_source#將瀏覽器執行后的源代碼賦給html??
    68. 67.????????#獲取網頁信息??
    69. 68.????#抓捕網頁解析過程中的錯誤??
    70. 69.???????try:??
    71. 70.???????????#req=request.Request(tieziUrl,headers=headers5)??
    72. 71.???????????#html=urlopen(req)??
    73. 72.???????????bsObj=BeautifulSoup(html,"html.parser")??
    74. 73.???????????#html.close()??
    75. 74.???????except?UnicodeDecodeError?as?e:??
    76. 75.???????????print("-----UnicodeDecodeError?url",tieziUrl)??
    77. 76.???????except?urllib.error.URLError?as?e:??
    78. 77.???????????print("-----urlError?url:",tieziUrl)??
    79. 78.???????except?socket.timeout?as?e:??
    80. 79.???????????print("-----socket?timout:",tieziUrl)??
    81. 80.??
    82. 81.??
    83. 82.???????while(bsObj.find("title").get_text()?==?"頁面重載開啟"):??
    84. 83.???????????print("當前頁面不是重加載后的頁面,程序會嘗試刷新一次到跳轉后的頁面
    85. ")??
    86. 84.???????????driver.get(tieziUrl)??
    87. 85.???????????html=driver.page_source#將瀏覽器執行后的源代碼賦給html??
    88. 86.???????????bsObj=BeautifulSoup(html,"html.parser")??
    89. 87.????except?Exception?as?e:??
    90. 88.????????driver.close()?#?Close?the?current?window.??
    91. 89.????????driver.quit()#關閉chrome瀏覽器??
    92. 90.????????time.sleep(0.5)??
    93. 91.??
    94. 92.????driver.close()?#?Close?the?current?window.??
    95. 93.????driver.quit()#關閉chrome瀏覽器??
    96. 94.??
    97. 95.??
    98. 96.????#查找想要的信息??
    99. 97.????templist1=bsObj.find("td",{"class":"t_f"}).ul??
    100. 98.????if?templist1==None:#判斷是否不包含ul字段,如果不,跳出函數??
    101. 99.????????print("當前帖子頁面不包含ul字段")??
    102. 100.????????return?1??
    103. 101.????mycsv=["NULL","NULL","NULL","NULL","NULL","NULL","NULL","NULL"]#初始化提取信息列表??
    104. 102.????for?templist2?in?templist1.findAll("font",size=re.compile("^([0-9]+)*$")):??
    105. 103.????????tempText=templist2.get_text()??
    106. 104.????????#print(tempText[0:4])??
    107. 105.????????if?"寶貝回家編號"?in?tempText[0:6]:??
    108. 106.????????????print(tempText)??
    109. 107.????????????index=tempText.find(":")??
    110. 108.????????????tempText=tempText[index+1:]??
    111. 109.????????????#mycsv.append(tempText)??
    112. 110.????????????if?len(tempText)==0:??
    113. 111.????????????????tempText="NULL"??
    114. 112.????????????mycsv[0]=tempText??
    115. 113.????????if?"尋親編號"?in?tempText[0:6]:??
    116. 114.????????????print(tempText)??
    117. 115.????????????index=tempText.find(":")??
    118. 116.????????????tempText=tempText[index+1:]??
    119. 117.????????????if?len(tempText)==0:??
    120. 118.????????????????tempText="NULL"??
    121. 119.????????????#mycsv.append(tempText)??
    122. 120.????????????mycsv[0]=tempText??
    123. 121.????????if?"登記編號"?in?tempText[0:6]:??
    124. 122.????????????print(tempText)??
    125. 123.????????????index=tempText.find(":")??
    126. 124.????????????tempText=tempText[index+1:]??
    127. 125.????????????if?len(tempText)==0:??
    128. 126.????????????????tempText="NULL"??
    129. 127.????????????#mycsv.append(tempText)??
    130. 128.????????????mycsv[0]=tempText??
    131. 129.????????if?"姓"?in?tempText[0:6]:??
    132. 130.????????????print(tempText)??
    133. 131.????????????index=tempText.find(":")??
    134. 132.????????????tempText=tempText[index+1:]??
    135. 133.????????????#mycsv.append(tempText)??
    136. 134.????????????mycsv[1]=tempText??
    137. 135.????????if"性"?in?tempText[0:6]:??
    138. 136.????????????print(tempText)??
    139. 137.????????????index=tempText.find(":")??
    140. 138.????????????tempText=tempText[index+1:]??
    141. 139.????????????#mycsv.append(tempText)??
    142. 140.????????????mycsv[2]=tempText??
    143. 141.????????if?"出生日期"?in?tempText[0:6]:??
    144. 142.????????????print(tempText)??
    145. 143.????????????index=tempText.find(":")??
    146. 144.????????????tempText=tempText[index+1:]??
    147. 145.????????????#mycsv.append(tempText)??
    148. 146.????????????mycsv[3]=tempText??
    149. 147.????????if?"失蹤時身高"?in?tempText[0:6]:??
    150. 148.????????????print(tempText)??
    151. 149.????????????index=tempText.find(":")??
    152. 150.????????????tempText=tempText[index+1:]??
    153. 151.????????????#mycsv.append(tempText)??
    154. 152.????????????mycsv[4]=tempText??
    155. 153.????????if?"失蹤時間"?in?tempText[0:6]:??
    156. 154.????????????print(tempText)??
    157. 155.????????????index=tempText.find(":")??
    158. 156.????????????tempText=tempText[index+1:]??
    159. 157.????????????#mycsv.append(tempText)??
    160. 158.????????????mycsv[5]=tempText??
    161. 159.????????if?"失蹤日期"?in?tempText[0:6]:??
    162. 160.????????????print(tempText)??
    163. 161.????????????index=tempText.find(":")??
    164. 162.????????????tempText=tempText[index+1:]??
    165. 163.????????????#mycsv.append(tempText)??
    166. 164.????????????mycsv[5]=tempText??
    167. 165.????????if?"失蹤地點"?in?tempText[0:6]:??
    168. 166.????????????print(tempText)??
    169. 167.????????????index=tempText.find(":")??
    170. 168.????????????tempText=tempText[index+1:]??
    171. 169.????????????#mycsv.append(tempText)??
    172. 170.????????????mycsv[6]=tempText??
    173. 171.????????if?"是否報案"?in?tempText[0:6]:??
    174. 172.????????????print(tempText)??
    175. 173.????????????index=tempText.find(":")??
    176. 174.????????????tempText=tempText[index+1:]??
    177. 175.????????????#mycsv.append(tempText)??
    178. 176.????????????mycsv[7]=tempText??
    179. 177.????try:??
    180. 178.????????writer.writerow((str(mycsv[0]),str(mycsv[1]),str(mycsv[2]),str(mycsv[3]),str(mycsv[4]),str(mycsv[5]),str(mycsv[6]),str(mycsv[7])))#寫入CSV文件??
    181. 179.????????csvfile.flush()#馬上將這條數據寫入csv文件中??
    182. 180.????finally:??
    183. 181.????????print("當前帖子信息寫入完成
    184. ")??
    185. 182.????????time.sleep(5)#設置爬完之后的睡眠時間,這里先設置為1秒??
    186. 183.??
    187. 184.??
    188. 185.#得到當前板塊所有的頁面鏈接??
    189. 186.#siteUrl為當前版塊的頁面鏈接??
    190. 187.def?GetALLPageUrl(siteUrl):??
    191. 188.????#設置代理IP訪問??
    192. 189.????#代理IP可以上http://http.zhimaruanjian.com/獲取??
    193. 190.????proxy_handler=urllib.request.ProxyHandler({"post":"123.207.143.51:8080"})??
    194. 191.????proxy_auth_handler=urllib.request.ProxyBasicAuthHandler()??
    195. 192.????opener?=?urllib.request.build_opener(urllib.request.HTTPHandler,?proxy_handler)??
    196. 193.????urllib.request.install_opener(opener)??
    197. 194.??
    198. 195.????try:??
    199. 196.????????#掉用第三方包selenium打開瀏覽器登陸??
    200. 197.????????#driver=webdriver.Chrome()#打開chrome??
    201. 198.???????driver=webdriver.Chrome()#打開無界面瀏覽器Chrome??
    202. 199.???????#driver=webdriver.PhantomJS()#打開無界面瀏覽器PhantomJS??
    203. 200.???????driver.set_page_load_timeout(10)??
    204. 201.???????#driver.implicitly_wait(30)??
    205. 202.???????try:??
    206. 203.???????????driver.get(siteUrl)#登陸兩次??
    207. 204.???????????driver.get(siteUrl)??
    208. 205.???????except?TimeoutError:??
    209. 206.???????????driver.refresh()??
    210. 207.??
    211. 208.???????#print(driver.page_source)??
    212. 209.???????html=driver.page_source#將瀏覽器執行后的源代碼賦給html??
    213. 210.????????#獲取網頁信息??
    214. 211.????#抓捕網頁解析過程中的錯誤??
    215. 212.???????try:??
    216. 213.???????????#req=request.Request(tieziUrl,headers=headers5)??
    217. 214.???????????#html=urlopen(req)??
    218. 215.???????????bsObj=BeautifulSoup(html,"html.parser")??
    219. 216.???????????#print(bsObj.find("title").get_text())??
    220. 217.???????????#html.close()??
    221. 218.???????except?UnicodeDecodeError?as?e:??
    222. 219.???????????print("-----UnicodeDecodeError?url",siteUrl)??
    223. 220.???????except?urllib.error.URLError?as?e:??
    224. 221.???????????print("-----urlError?url:",siteUrl)??
    225. 222.???????except?socket.timeout?as?e:??
    226. 223.???????????print("-----socket?timout:",siteUrl)??
    227. 224.??
    228. 225.??
    229. 226.??
    230. 227.???????while(bsObj.find("title").get_text()?==?"頁面重載開啟"):??
    231. 228.???????????print("當前頁面不是重加載后的頁面,程序會嘗試刷新一次到跳轉后的頁面
    232. ")??
    233. 229.???????????driver.get(siteUrl)??
    234. 230.???????????html=driver.page_source#將瀏覽器執行后的源代碼賦給html??
    235. 231.???????????bsObj=BeautifulSoup(html,"html.parser")??
    236. 232.????except?Exception?as?e:??
    237. 233.??
    238. 234.????????driver.close()?#?Close?the?current?window.??
    239. 235.????????driver.quit()#關閉chrome瀏覽器??
    240. 236.????????#time.sleep()??
    241. 237.??
    242. 238.????driver.close()?#?Close?the?current?window.??
    243. 239.????driver.quit()#關閉chrome瀏覽器??
    244. 240.??
    245. 241.??
    246. 242.????#http://bbs.baobeihuijia.com/forum-191-1.html變成http://bbs.baobeihuijia.com,以便組成頁面鏈接??
    247. 243.????siteindex=siteUrl.rfind("/")??
    248. 244.????tempsiteurl=siteUrl[0:siteindex+1]#http://bbs.baobeihuijia.com/??
    249. 245.????tempbianhaoqian=siteUrl[siteindex+1:-6]#forum-191-??
    250. 246.??
    251. 247.????#爬取想要的信息??
    252. 248.????bianhao=[]#存儲頁面編號??
    253. 249.????pageUrl=[]#存儲頁面鏈接??
    254. 250.??
    255. 251.????templist1=bsObj.find("div",{"class":"pg"})??
    256. 252.????#if?templist1==None:??
    257. 253.????????#return??
    258. 254.????for?templist2?in?templist1.findAll("a",href=re.compile("forum-([0-9]+)-([0-9]+).html")):??
    259. 255.????????if?templist2==None:??
    260. 256.????????????continue??
    261. 257.????????lianjie=templist2.attrs["href"]??
    262. 258.????????#print(lianjie)??
    263. 259.????????index1=lianjie.rfind("-")#查找-在字符串中的位置??
    264. 260.????????index2=lianjie.rfind(".")#查找.在字符串中的位置??
    265. 261.????????tempbianhao=lianjie[index1+1:index2]??
    266. 262.????????bianhao.append(int(tempbianhao))??
    267. 263.????bianhaoMax=max(bianhao)#獲取頁面的最大編號??
    268. 264.??
    269. 265.????for?i?in?range(1,bianhaoMax+1):??
    270. 266.????????temppageUrl=tempsiteurl+tempbianhaoqian+str(i)+".html"#組成頁面鏈接??
    271. 267.????????print(temppageUrl)??
    272. 268.????????pageUrl.append(temppageUrl)??
    273. 269.????return?pageUrl#返回頁面鏈接列表??
    274. 270.??
    275. 271.#得到當前版塊頁面所有帖子的鏈接??
    276. 272.def?GetCurrentPageTieziUrl(PageUrl):??
    277. 273.????#設置代理IP訪問??
    278. 274.????#代理IP可以上http://http.zhimaruanjian.com/獲取??
    279. 275.????proxy_handler=urllib.request.ProxyHandler({"post":"110.73.30.157:8123"})??
    280. 276.????proxy_auth_handler=urllib.request.ProxyBasicAuthHandler()??
    281. 277.????opener?=?urllib.request.build_opener(urllib.request.HTTPHandler,?proxy_handler)??
    282. 278.????urllib.request.install_opener(opener)??
    283. 279.??
    284. 280.????try:??
    285. 281.????????#掉用第三方包selenium打開瀏覽器登陸??
    286. 282.????????#driver=webdriver.Chrome()#打開chrome??
    287. 283.???????driver=webdriver.Chrome()#打開無界面瀏覽器Chrome??
    288. 284.???????#driver=webdriver.PhantomJS()#打開無界面瀏覽器PhantomJS??
    289. 285.???????driver.set_page_load_timeout(10)??
    290. 286.???????try:??
    291. 287.???????????driver.get(PageUrl)#登陸兩次??
    292. 288.???????????driver.get(PageUrl)??
    293. 289.???????except?TimeoutError:??
    294. 290.???????????driver.refresh()??
    295. 291.??
    296. 292.???????#print(driver.page_source)??
    297. 293.???????html=driver.page_source#將瀏覽器執行后的源代碼賦給html??
    298. 294.????????#獲取網頁信息??
    299. 295.????#抓捕網頁解析過程中的錯誤??
    300. 296.???????try:??
    301. 297.???????????#req=request.Request(tieziUrl,headers=headers5)??
    302. 298.???????????#html=urlopen(req)??
    303. 299.???????????bsObj=BeautifulSoup(html,"html.parser")??
    304. 300.???????????#html.close()??
    305. 301.???????except?UnicodeDecodeError?as?e:??
    306. 302.???????????print("-----UnicodeDecodeError?url",PageUrl)??
    307. 303.???????except?urllib.error.URLError?as?e:??
    308. 304.???????????print("-----urlError?url:",PageUrl)??
    309. 305.???????except?socket.timeout?as?e:??
    310. 306.???????????print("-----socket?timout:",PageUrl)??
    311. 307.??
    312. 308.???????n=0??
    313. 309.???????while(bsObj.find("title").get_text()?==?"頁面重載開啟"):??
    314. 310.???????????print("當前頁面不是重加載后的頁面,程序會嘗試刷新一次到跳轉后的頁面
    315. ")??
    316. 311.???????????driver.get(PageUrl)??
    317. 312.???????????html=driver.page_source#將瀏覽器執行后的源代碼賦給html??
    318. 313.???????????bsObj=BeautifulSoup(html,"html.parser")??
    319. 314.???????????n=n+1??
    320. 315.???????????if?n==10:??
    321. 316.???????????????driver.close()?#?Close?the?current?window.??
    322. 317.???????????????driver.quit()#關閉chrome瀏覽器??
    323. 318.???????????????return?1??
    324. 319.??
    325. 320.????except?Exception?as?e:??
    326. 321.????????driver.close()?#?Close?the?current?window.??
    327. 322.????????driver.quit()#關閉chrome瀏覽器??
    328. 323.????????time.sleep(1)??
    329. 324.??
    330. 325.????driver.close()?#?Close?the?current?window.??
    331. 326.????driver.quit()#關閉chrome瀏覽器??
    332. 327.??
    333. 328.??
    334. 329.????#http://bbs.baobeihuijia.com/forum-191-1.html變成http://bbs.baobeihuijia.com,以便組成帖子鏈接??
    335. 330.????siteindex=PageUrl.rfind("/")??
    336. 331.????tempsiteurl=PageUrl[0:siteindex+1]#http://bbs.baobeihuijia.com/??
    337. 332.????#print(tempsiteurl)??
    338. 333.????TieziUrl=[]??
    339. 334.????#爬取想要的信息??
    340. 335.????for?templist1?in?bsObj.findAll("tbody",id=re.compile("normalthread_([0-9]+)"))?:??
    341. 336.????????if?templist1==None:??
    342. 337.????????????continue??
    343. 338.????????for?templist2?in?templist1.findAll("a",{"class":"s?xst"}):??
    344. 339.????????????if?templist2==None:??
    345. 340.????????????????continue??
    346. 341.????????????tempteiziUrl=tempsiteurl+templist2.attrs["href"]#組成帖子鏈接??
    347. 342.????????????print(tempteiziUrl)??
    348. 343.????????????TieziUrl.append(tempteiziUrl)??
    349. 344.????return?TieziUrl#返回帖子鏈接列表??
    350. 345.??
    351. 346.??
    352. 347.??
    353. 348.#CurrentPageMissingPopulationInformation("http://bbs.baobeihuijia.com/thread-213126-1-1.html")??
    354. 349.#GetALLPageUrl("http://bbs.baobeihuijia.com/forum-191-1.html")??
    355. 350.#GetCurrentPageTieziUrl("http://bbs.baobeihuijia.com/forum-191-1.html")??
    356. 351.??
    357. 352.if?__name__?==?"__main__":??
    358. 353.????csvfile=open("E:/MissingPeople.csv","w+",newline="",encoding="gb18030")??
    359. 354.????writer=csv.writer(csvfile)??
    360. 355.????writer.writerow(("寶貝回家編號","姓名","性別","出生日期","失蹤時身高","失蹤時間","失蹤地點","是否報案"))??
    361. 356.????pageurl=GetALLPageUrl("https://bbs.baobeihuijia.com/forum-191-1.html")#尋找失蹤寶貝??
    362. 357.????#pageurl=GetALLPageUrl("http://bbs.baobeihuijia.com/forum-189-1.html")#被拐寶貝回家??
    363. 358.????time.sleep(5)??
    364. 359.????print("所有頁面鏈接獲取成功!
    365. ")??
    366. 360.????n=0??
    367. 361.????for?templist1?in?pageurl:??
    368. 362.????????#print(templist1)??
    369. 363.????????tieziurl=GetCurrentPageTieziUrl(templist1)??
    370. 364.????????time.sleep(5)??
    371. 365.????????print("當前頁面"+str(templist1)+"所有帖子鏈接獲取成功!
    372. ")??
    373. 366.????????if?tieziurl?==1:??
    374. 367.????????????print("不能得到當前帖子頁面!
    375. ")??
    376. 368.????????????continue??
    377. 369.????????else:??
    378. 370.????????????for?templist2?in?tieziurl:??
    379. 371.????????????#print(templist2)??
    380. 372.???????????????n=n+1??
    381. 373.???????????????print("
    382. 正在收集第"+str(n)+"條信息!")??
    383. 374.???????????????time.sleep(5)??
    384. 375.???????????????tempzhi=CurrentPageMissingPopulationInformation(templist2)??
    385. 376.???????????????if?tempzhi==1:??
    386. 377.??????????????????print("
    387. "+str(n)+"條信息為空!")??
    388. 378.??????????????????continue??
    389. 379.????print("")??
    390. 380.????print("信息爬取完成!請放心的關閉程序!")??
    391. 381.????csvfile.close()??

    寫成的CSV文件截圖:

文章版權歸作者所有,未經允許請勿轉載,若此文章存在違規行為,您可以聯系管理員刪除。

轉載請注明本文地址:http://specialneedsforspecialkids.com/yun/67865.html

相關文章

  • Python爬蟲實踐尋找失蹤人口爬取失蹤兒童信息寫成csv文件方便存入據庫

    摘要:前兩天有人私信我,讓我爬這個網站,上的失蹤兒童信息,準備根據失蹤兒童的失蹤時的地理位置來更好的尋找失蹤兒童,這種事情本就應該義不容辭如果對網站服務器造成負荷,還請諒解。這次依然是用第三方爬蟲包,還有,來爬取信息。 前兩天有人私信我,讓我爬這個網站,http://bbs.baobeihuijia.com/f...上的失蹤兒童信息,準備根據失蹤兒童的失蹤時的地理位置來更好的尋找失蹤兒童,這...

    gaara 評論0 收藏0
  • Python爬蟲實踐尋找失蹤人口爬取失蹤兒童信息寫成csv文件方便存入據庫

    摘要:前兩天有人私信我,讓我爬這個網站,上的失蹤兒童信息,準備根據失蹤兒童的失蹤時的地理位置來更好的尋找失蹤兒童,這種事情本就應該義不容辭如果對網站服務器造成負荷,還請諒解。這次依然是用第三方爬蟲包,還有,來爬取信息。 前兩天有人私信我,讓我爬這個網站,http://bbs.baobeihuijia.com/f...上的失蹤兒童信息,準備根據失蹤兒童的失蹤時的地理位置來更好的尋找失蹤兒童,這...

    ddongjian0000 評論0 收藏0
  • Python爬蟲實踐尋找失蹤人口爬取失蹤兒童信息寫成csv文件方便存入據庫

    摘要:前兩天有人私信我,讓我爬這個網站,上的失蹤兒童信息,準備根據失蹤兒童的失蹤時的地理位置來更好的尋找失蹤兒童,這種事情本就應該義不容辭如果對網站服務器造成負荷,還請諒解。這次依然是用第三方爬蟲包,還有,來爬取信息。 前兩天有人私信我,讓我爬這個網站,http://bbs.baobeihuijia.com/f...上的失蹤兒童信息,準備根據失蹤兒童的失蹤時的地理位置來更好的尋找失蹤兒童,這...

    yy13818512006 評論0 收藏0
  • python編程之制作省市縣名稱字典

    摘要:準備字典文件為了更加準確的將失蹤地址中的省市縣三級單位提取出來,最好能夠直接以省市縣區的名稱為字典,這樣通過分詞以上的詞都能準確的切分出來。在網上搜索發現,根據最新的國家統計區位碼之作的現成的字典文件,并沒有。 在失蹤兒童信息保存在本地之后,有一個字段是失蹤地點,字段內容通常比較詳細,具體到了失蹤的街道或者村,我打算通過某種方法將失蹤地點中的省、市、縣/區三級地址提取出來。 確定分詞技...

    darry 評論0 收藏0
  • swoole+Mixphp+CodeIgniter開發失蹤兒童信息平臺

    摘要:寫基于和開發的失蹤兒童信息平臺。團圓系統的全稱應該是公安部兒童失蹤信息緊急發布平臺,在新浪微博上有一個官方的微博賬號,通過這個微博賬號發布兒童失蹤信息。 這是我在sf上的第一篇文章。寫基于swoole、Mixphp和CodeIgniter開發的失蹤兒童信息平臺。 在2017年的時候,關注到有一個團圓系統,它是公安部專門為了快速擴散失蹤兒童消息的平臺,但是網上并沒有找到這個平臺的地址。當...

    jasperyang 評論0 收藏0

發表評論

0條評論

susheng

|高級講師

TA的文章

閱讀更多
最新活動
閱讀需要支付1元查看
<