Leveraging AI to Distill Market Insights from Extensive Publications

Using AI to Turn Research Posts into Structured Data

Imagine this: you’re knee-deep in research papers, market reports, and strategy publications, trying to sift through mountains of data to find actionable insights. It’s like looking for a needle in a haystack. What if you could streamline this process, turning these massive posts into structured, digestible data? In this lesson, we will build on our previous AI analyst tool and show you how to distill large amounts of information into actionable insights using real-world examples.

Let’s take Market Ethos by Craig Basinger as our case study. As a fellow portfolio manager overseeing over a billion dollars in assets, Craig regularly shares his market insights through numerous publications. His work is highly valuable and provides deep market insights, but finding specific references or recurring themes can be challenging. What if you had an AI-driven assistant to help you navigate and extract valuable data from these reports, enhancing the benefits of already insightful content?

In this lesson, we’ll:

Convert unstructured posts into structured data.

Identify key questions we aim to answer, such as sentiment analysis, asset class tagging, timestamping, and ticker extraction.

Fine-tune our prompts for ChatGPT to get precise results.

Organize the extracted information into a pandas DataFrame.

Evaluate the insights and their potential impact on your asset allocation.

By the end, you’ll be equipped with powerful techniques to enhance your research efficiency and uncover valuable insights hidden in extensive publications.

Before diving into data analysis, it’s crucial to get our data ready. Preparing the data properly is half the battle, as it sets the foundation for all subsequent analysis. In this lesson, we’ll show you how to turn website content into a structured table that we can easily process.

Below is a screenshot of the Market Ethos webpage that we’ll be working with.

04 Extract Structural Data with AI 0 market ethos

Our goal is to extract information from this webpage and convert it into a table format, making it ready for further analysis.

We will use Python’s requests library to fetch the webpage content and BeautifulSoup from the bs4 library to parse the HTML. BeautifulSoup is a powerful library for web scraping that allows us to extract data from HTML and XML files. We’ll look for specific HTML tags to extract the required information.

Here’s how we do it:

Fetching the Webpage: We use requests.get(url) to fetch the content of the webpage.

Parsing the HTML: We use BeautifulSoup to parse the HTML content and create a soup object for easy navigation.

Finding Target Divs: We locate all the target div elements that contain the articles.

Extracting Information: For each target div, we extract the title, published date, and URL of the article.

Storing Data in a DataFrame: We store the extracted information in a pandas DataFrame for easy manipulation and analysis.

import requests from bs4 import BeautifulSoup import pandas as pd # Fetch the Webpage url = 'https://www.purposeinvest.com/thoughtful/market-ethos' response = requests.get(url) webpage_content = response.text # Parse the HTML soup = BeautifulSoup(webpage_content, 'html.parser') # Find All Target Divs target_divs = soup.find_all('div', class_='mb-5 flex w-full max-w-2xl flex-col p-0 md:mb-0 md:flex-row md:p-2 2xl:w-1/2 undefined undefined') # Initialize an empty list to store extracted information articles_info = [] # Extract Information for div in target_divs: # Extracting the title title = div.find('h1').text.strip() if div.find('h1') else 'No Title' # Extracting the published date date = div.find('span', class_='font-bold uppercase text-teal-1 metro:text-white nest:text-white crypto:text-white dark:text-white').text.strip() if div.find('span', class_='font-bold uppercase text-teal-1 metro:text-white nest:text-white crypto:text-white dark:text-white') else 'No Date' # Extracting the URL in the "Read" button read_button = div.find('a', href=True) url = read_button['href'] if read_button else 'No URL' # Append the extracted information to the list articles_info.append({'Title': title, 'Published Date': date, 'URL': url}) # Create a DataFrame df = pd.DataFrame(articles_info) # Display the DataFrame df['Published Date'] = pd.to_datetime(df['Published Date']) df['URL'] = 'https://www.purposeinvest.com' + df['URL'] df.head()

04 Extract Structural Data with AI 1 df1

Now that we have a structured table with titles, published dates, and URLs, the next step is to extract the actual article content. We’ll use BeautifulSoup again to scrape the text of each article from its respective URL.

for idx, row in df.iterrows(): article_url = row['URL&...