Scrape Emails using Python

I am working on a small side business of creating simple mobile friendly websites using Carrd.co. The council that I joined recently that is connected with my church uses a website provider to host information about their events, officers and other relevant information. The website template has not been updated in years so the site is not mobile friendly. Quite frankly, there is too much stuff on the website so the user experience is very poor. When the council posts information in the church’s weekly bulletin they never link to their site. Their site could be a great place to host additional information and get in front of the parishioners.

I had the thought to use Carrd.co to create a digital business card so I could share my member number and link to relative information I thought was useful when deciding to join my council. From there it developed into wanting to build a mobile friendly landing page for my council. The main reason is to host more event information so the parishioners can learn more about the event, sign-up and contact the council for more information. I created a template for my council and had the idea I could offer it to other councils in the entire organization. I believe there are councils out there that do not need a full fledge website with all the options that my council is currently subscribed to. They could benefit from an inexpensive and simple site like the one I built using Carrd.co.

I was able to locate the emails and found the format and placement stays the same for each council in my state. There are roughly ~100 councils so I could manually open each link and copy and paste the email address, but had the thought to ask GitHub Co-Pilot to help me build a Python script to scrape the emails. Within a matter of 20 minutes, I had this working script. It is amazing how much time GitHub Co-Pilot saves me.

The script below opens the excel file I had with all the individual URLs of the ~100 chapters. It then creates a DataFrame (table) to store the information the script will grab from each website. Using the Excel file it goes to the “link” column and gets the link using a for-loop. It opens each html page and parses the information. Finds the table on the page containing the contact information and finds the row containing the specified text. Then it concatenates the URL from the Excel file and the second column (column 1) of the table from the html page. Then saves all the results to a new Excel file called results.xlsx

import pandas as pd
from selenium import webdriver
from bs4 import BeautifulSoup

# Load the Excel file
df = pd.read_excel("C:\\Users\\Documents\\emails.xlsx")

# Create a new instance of the Google Chrome driver
driver = webdriver.Chrome()

# Create an empty DataFrame to store the results
results = pd.DataFrame(columns=['Column 1', 'Column 2'])

# Iterate over the first 5 links in the Excel file
for link in df['URL'][:5]:  #Delete the [:5] to run the entire list
    # Go to the link
    driver.get(link)

    # Get the HTML of the page and parse it with BeautifulSoup
    soup = BeautifulSoup(driver.page_source, "html.parser")

    # Find the table
    table = soup.find("table")


        # Find all rows in the table
    rows = table.find_all("tr")

        # Iterate over the rows
    for row in rows:
        # Find all columns in the row
        cols = row.find_all("td")

        # If the first column's text is the row name you're interested in
        if cols[0].text.strip() == "Text":
           
            # Add the link and the Text to the results DataFrame
            new_row = pd.DataFrame({'Link': [link], 'Text': [cols[1].text.strip()]})
            results = pd.concat([results, new_row], ignore_index=True)


# Close the browser
driver.quit()

# Write the results to a new Excel file
results.to_excel("C:\\Users\\Documents\\Results.xlsx", index=False)

This script can iterate through ~100 URL’s in a matter of minutes. It possibly could be quicker but I would need to research how to speed it up. I did run into a problem where the website stopped me from running the script and grabbing the emails. I then turned on my Google One VPN and was able to run the script without any issues. It must have noticed the script robot and stopped me by my IP address.

Leave a Reply Cancel reply

Related Posts

Wet Run

Ideas / Zero Progress

11 Reasons to Wake Up at 4AM 🌄☕