This is the 3rd post of a series. If you haven’t read the previous posts, here they are :
In the previous post, I talked about the program introduced by the company and it’s flaws. Here, I’ll take you through how I made my own better version of the software using Python. If you haven’t started using Python yet, you should read this.
The whole source code is available on GitHub.
What I wanted my program to do:
1. Take a file named “keywords” as input.
This is exactly same as the program made by the company. Here’s the Python code to read the file and store the keywords in a list
keywordlist =  f = open('keywords','r') for line in f: keywordlist.append(line)
2. Run a Google search and pull URLs for each of the keywords.
Google doesn’t allow bots to run search queries. So I had to use an external module GoogleScraper.py to accomplish this.
GoogleScraper.py has a function scrape that returns the URLs on a search page. Here’s the function geturls that returns the domain names for a specific keyword.
def geturls(keyword,results_per_page,pages): result =  temp = scrape(keyword,results_per_page,pages,0) for url in temp: #Extracting only domain names from URL hostname = url.hostname.split(&quot;.&quot;) hostname = &quot;.&quot;.join(len(hostname[-2]) &lt; 4 and hostname[-3:] or hostname[-2:]) # result.append(hostname) return result
4. Do WHOIS searches for each of the domain names
I used FreeWHOIS.US to do WHOIS searches. Here is the function that creates the URL to pull WHOIS info from. It takes the domain name as an input.
def whois_urlcreator(domain): #base_url=&quot;http://www.whoisfly.com/&quot; base_url=&quot;http://www.freewhois.us/index.php?query=&quot; fullurl=base_url+domain+&quot;&amp;submit=Whois&quot; return fullurl
Next, I use the URL generated to obtain the WHOIS info.
def getwhoisinfo(whoisurl): f=urllib.request.urlopen(whoisurl) try: result = f.read().decode('utf-8') except: result = &quot;&quot; return result
After that, I extract only the e-mail IDs from the WHOIS info obtained.
def getwhoisemail(whoisinfo): r = re.compile(&quot;[-a-zA-Z0-9._]+@[-a-zA-Z0-9_]+.[a-zA-Z0-9_.]+&quot;) results = r.findall(whoisinfo) return list(set(results))
4. Store domain names, statuses and email Ids in a excel file.
For each domain name, I create an object with the following definition:
class excelitem(object): def __init__(self,domain,status,emails): self.domain = domain self.status = status #status=0 if no emails found, 1 if any e-mails found self.emails = emails def showstatus(self): print (self.status)
I use the functions listed above to create a list of objects from the keywords file.
After this comes the main improvement over the company’s software. I write all these objects to an excel file. When I view this file, I can know exactly which domains I’ve been able to get WHOIS emails for. And later I can fill in the rest by doing manual WHOIS searches.
Here’s a sample of how the excel file will look like:
1. WHOIS info for GoDaddy domains, Special TLDs
GoDaddy stopped providing complete WHOIS info to third-party sites and now requires that you search using their site only and enter a captcha for every search you do. Similarly certain TLDs such as .com.au require a captcha too. For these, I’ll have to manually update the Excel file.
The complete project is available on GitHub