How to Program an Email Spider in Python

104 30
    • 1). Define two regular expressions to match email addresses and hyperlinks in the code of the Web page:

      import urllib
      import threading
      import re

      r = re.compile('(?<=href\=\"mailto:).*?@.*?.[\w]{0,3}(?=\")') # Mails
      r1 = re.compile('(?<=href\=\").*?(?=\")') # Links

    • 2). Define a class constructor that takes a Web page URL as its argument. The constructor will take the URL as a starting point, then begin the "Spider" class as a separate thread:

      class Spider(threading.Thread):
      def __init__(self,address):
      self.url = address
      threading.Thread.__init__ ( self )

    • 3). Define the "run" method, which executes whenever a new thread of type "Spider" begins. This method processes the Web page with "urllib.urlopen", pulls emails from the code by using the "r" regular expression and stores them in a log file. It then takes the hyperlinks and downloads the information from that URL, starting a new thread to process that Web page:

      def run(self):

      source = urllib.urlopen(self.url).read()
      mails = r.findall(source)
      mails = list(set(mails))
      log = open('log.txt','a')
      for i in mails:
      if re.match("^[_.0-9a-z-]+@([0-9a-z][0-9a-z-]+.)+[a-z]{2,4}$", i) != None:
      if (i+'\n') not in (open('log.txt','r').readlines()):
      print 'Saved: ',i
      log.write(i+'\n')
      count += 1
      log.close()
      urls = r1.findall(source)
      for url in urls:
      Crawl(url).start()

    • 4). Run the Spider class by calling a new thread of type "Spider" and supplying it with a URL:

      Spider('www.google.com').start()

Source...
Subscribe to our newsletter
Sign up here to get the latest news, updates and special offers delivered directly to your inbox.
You can unsubscribe at any time

Leave A Reply

Your email address will not be published.