How to Program an Email Spider in Python

Technology Networking & Internet

By Technology Last updated Saturday, June/30/2029

1). Define two regular expressions to match email addresses and hyperlinks in the code of the Web page:

import urllib
import threading
import re

r = re.compile('(?<=href\=\"mailto:).*?@.*?.[\w]{0,3}(?=\")') # Mails
r1 = re.compile('(?<=href\=\").*?(?=\")') # Links
2). Define a class constructor that takes a Web page URL as its argument. The constructor will take the URL as a starting point, then begin the "Spider" class as a separate thread:

class Spider(threading.Thread):
def __init__(self,address):
self.url = address
threading.Thread.__init__ ( self )
3). Define the "run" method, which executes whenever a new thread of type "Spider" begins. This method processes the Web page with "urllib.urlopen", pulls emails from the code by using the "r" regular expression and stores them in a log file. It then takes the hyperlinks and downloads the information from that URL, starting a new thread to process that Web page:

def run(self):

source = urllib.urlopen(self.url).read()
mails = r.findall(source)
mails = list(set(mails))
log = open('log.txt','a')
for i in mails:
if re.match("^[_.0-9a-z-]+@([0-9a-z][0-9a-z-]+.)+[a-z]{2,4}$", i) != None:
if (i+'\n') not in (open('log.txt','r').readlines()):
print 'Saved: ',i
log.write(i+'\n')
count += 1
log.close()
urls = r1.findall(source)
for url in urls:
Crawl(url).start()
4). Run the Spider class by calling a new thread of type "Spider" and supplying it with a URL:

Spider('www.google.com').start()

Source...

Networking Internet technology Microsoft Access Excel ffice Powerpoint Word Oracle

Get the latest news, exclusives, sport, celebrities, showbiz, politics, business and lifestyle from The iFocus,Stay informed and read the latest news today from The iFocus, the definitive source.