在編寫(xiě)python爬蟲(chóng)時(shí)經(jīng)常會(huì)遇到異常中斷的情況,導(dǎo)致爬蟲(chóng)意外終止,一個(gè)理想的爬蟲(chóng)應(yīng)該能夠在遇到這些異常時(shí)繼續(xù)運(yùn)行。下面就談?wù)勥@幾種常見(jiàn)異常及其處理方法:
異常1:requests.exceptions.ProxyError
對(duì)于這個(gè)錯(cuò)誤,stackoverflow給出的解釋是
The ProxyError exception is not actually the requests.exceptions exception; it an exception with the same name from the embedded urllib3 library, and it is wrapped in a MaxRetryError exception.
翻譯過(guò)來(lái)就是這個(gè)錯(cuò)誤實(shí)際上不是requests.exceptions中的異常,這是嵌入到urllib2庫(kù)中的同名異常,這個(gè)異常是封裝在MaxRetryError當(dāng)中的。補(bǔ)充一點(diǎn),通常在代理服務(wù)器不通時(shí)出現(xiàn)這個(gè)異常。
異常2:requests.exceptions.ConnectionError
對(duì)于這個(gè)錯(cuò)誤,stackoverflow給出的解釋是
In the event of a network problem (e.g. DNS failure, refused connection, etc), Requests will raise a ConnectionError exception.
翻譯過(guò)來(lái)就是說(shuō)這是網(wǎng)絡(luò)問(wèn)題出現(xiàn)的異常事件(如DNS錯(cuò)誤,拒絕連接,等等),這是Requests庫(kù)中自帶的異常
一種解決辦法是捕捉基類(lèi)異常,這種方法可以處理所有的異常情況:
try:
r = requests.get(url, params={’s’: thing})
except requests.exceptions.RequestException as e: # This is the correct syntax
print e
sys.exit(1)
另外一種解決辦法是分別處理各種異常,這里面有三種異常:
try:
r = requests.get(url, params={’s’: thing})
except requests.exceptions.Timeout:
except requests.exceptions.TooManyRedirects:
except requests.exceptions.RequestException as e:
print e
sys.exit(1)
異常3:requests.exceptions.ChunkedEncodingError
對(duì)于這個(gè)錯(cuò)誤,stackoverflow給出的解釋是
The link you included in your question is simply a wrapper that executes urllib’s read() function, which catches any incomplete read exceptions for you. If you don’t want to implement this entire patch, you could always just throw in a try/catch loop where you read your links.
問(wèn)題中給出的鏈接是執(zhí)行urllib’s庫(kù)的read函數(shù)時(shí),捕捉到了讀取不完整數(shù)據(jù)導(dǎo)致的異常。如果你不想實(shí)現(xiàn)這個(gè)完整的不動(dòng),只要在讀取你的鏈接時(shí)拋出一個(gè)try/catch循環(huán)即可:
try:
page = urllib2.urlopen(urls).read()
except httplib.IncompleteRead, e:
page = e.partial








暫無(wú)數(shù)據(jù)