Using AI to Turn Research Posts into Structured Data
Imagine this: you’re knee-deep in research papers, market reports, and strategy publications, trying to sift through mountains of data to find actionable insights. It’s like looking for a needle in a haystack. What if you could streamline this process, turning these massive posts into structured, digestible data? In this lesson, we will build on our previous AI analyst tool and show you how to distill large amounts of information into actionable insights using real-world examples.
Let’s take Market Ethos by Craig Basinger as our case study. As a fellow portfolio manager overseeing over a billion dollars in assets, Craig regularly shares his market insights through numerous publications. His work is highly valuable and provides deep market insights, but finding specific references or recurring themes can be challenging. What if you had an AI-driven assistant to help you navigate and extract valuable data from these reports, enhancing the benefits of already insightful content?
In this lesson, we’ll:
By the end, you’ll be equipped with powerful techniques to enhance your research efficiency and uncover valuable insights hidden in extensive publications.
Before diving into data analysis, it’s crucial to get our data ready. Preparing the data properly is half the battle, as it sets the foundation for all subsequent analysis. In this lesson, we’ll show you how to turn website content into a structured table that we can easily process.
Below is a screenshot of the Market Ethos webpage that we’ll be working with.
Our goal is to extract information from this webpage and convert it into a table format, making it ready for further analysis.
We will use Python’s requests
library to fetch the webpage content and BeautifulSoup
from the bs4
library to parse the HTML. BeautifulSoup is a powerful library for web scraping that allows us to extract data from HTML and XML files. We’ll look for specific HTML tags to extract the required information.
Here’s how we do it:
requests.get(url)
to fetch the content of the webpage.BeautifulSoup
to parse the HTML content and create a soup object for easy navigation.div
elements that contain the articles.div
, we extract the title, published date, and URL of the article.import requests from bs4 import BeautifulSoup import pandas as pd # Fetch the Webpage url = 'https://www.purposeinvest.com/thoughtful/market-ethos' response = requests.get(url) webpage_content = response.text # Parse the HTML soup = BeautifulSoup(webpage_content, 'html.parser') # Find All Target Divs target_divs = soup.find_all('div', class_='mb-5 flex w-full max-w-2xl flex-col p-0 md:mb-0 md:flex-row md:p-2 2xl:w-1/2 undefined undefined') # Initialize an empty list to store extracted information articles_info = [] # Extract Information for div in target_divs: # Extracting the title title = div.find('h1').text.strip() if div.find('h1') else 'No Title' # Extracting the published date date = div.find('span', class_='font-bold uppercase text-teal-1 metro:text-white nest:text-white crypto:text-white dark:text-white').text.strip() if div.find('span', class_='font-bold uppercase text-teal-1 metro:text-white nest:text-white crypto:text-white dark:text-white') else 'No Date' # Extracting the URL in the "Read" button read_button = div.find('a', href=True) url = read_button['href'] if read_button else 'No URL' # Append the extracted information to the list articles_info.append({'Title': title, 'Published Date': date, 'URL': url}) # Create a DataFrame df = pd.DataFrame(articles_info) # Display the DataFrame df['Published Date'] = pd.to_datetime(df['Published Date']) df['URL'] = 'https://www.purposeinvest.com' + df['URL'] df.head()
Now that we have a structured table with titles, published dates, and URLs, the next step is to extract the actual article content. We’ll use BeautifulSoup again to scrape the text of each article from its respective URL.
for idx, row in df.iterrows(): article_url = row['URL&...
Quick Links