How to Program an Email Spider in Python
- 1). Define two regular expressions to match email addresses and hyperlinks in the code of the Web page:
import urllib
import threading
import re
r = re.compile('(?<=href\=\"mailto:).*?@.*?.[\w]{0,3}(?=\")') # Mails
r1 = re.compile('(?<=href\=\").*?(?=\")') # Links - 2). Define a class constructor that takes a Web page URL as its argument. The constructor will take the URL as a starting point, then begin the "Spider" class as a separate thread:
class Spider(threading.Thread):
def __init__(self,address):
self.url = address
threading.Thread.__init__ ( self ) - 3). Define the "run" method, which executes whenever a new thread of type "Spider" begins. This method processes the Web page with "urllib.urlopen", pulls emails from the code by using the "r" regular expression and stores them in a log file. It then takes the hyperlinks and downloads the information from that URL, starting a new thread to process that Web page:
def run(self):
source = urllib.urlopen(self.url).read()
mails = r.findall(source)
mails = list(set(mails))
log = open('log.txt','a')
for i in mails:
if re.match("^[_.0-9a-z-]+@([0-9a-z][0-9a-z-]+.)+[a-z]{2,4}$", i) != None:
if (i+'\n') not in (open('log.txt','r').readlines()):
print 'Saved: ',i
log.write(i+'\n')
count += 1
log.close()
urls = r1.findall(source)
for url in urls:
Crawl(url).start() - 4). Run the Spider class by calling a new thread of type "Spider" and supplying it with a URL:
Spider('www.google.com').start()
Source...