Creating an  Article Relevance Analyzer ๐Ÿง  powered by Natural Language Processing (NLP)

Creating an Article Relevance Analyzer ๐Ÿง  powered by Natural Language Processing (NLP)

-Natural Language Processing (NLP) ๐Ÿ“ฑ App using Java

ยท

13 min read

๐Ÿ“—Introduction

๐ŸŒŸ๐Ÿ“š Recently, I've dived into the fascinating world of ML and AI! ๐Ÿค“๐Ÿ” As someone who enjoys devouring docs and articles ๐Ÿ“‘๐Ÿ“š, I find myself constantly reading and taking notes. However, I often end up wasting precious time on irrelevant ones! ๐Ÿ˜ซ๐Ÿ’ญ So, I decided to find a way to be more efficient and productive.

Imagine having an app that automatically scours popular articles, pinpointing the ones relevant to your interests. ๐Ÿ“ฒโœจ That's exactly what I aim to buildโ€”an article relevance analyzer! ๐Ÿง๐Ÿš€

In today's digital era, Natural Language Processing (NLP) ๐Ÿง ๐Ÿ—ฃ๏ธ has emerged as a powerful tool for engaging with intelligent software. Think of voice assistants like Google's Assistant, Apple's Siri, and Microsoft's Cortana, which can understand and respond to our everyday language. But have you ever wondered how this "smart" software actually works? ๐Ÿค”

In this exciting article, we'll explore the inner workings of these applications and even delve into developing our own natural language processing software. ๐ŸŽ‰๐Ÿ”ง

So, get ready to join me on this journey as we dive into the process of building an article relevance analyzer! ๐Ÿš€๐Ÿ“š

๐Ÿ“—Stanford NLP Library

๐Ÿ’ก๐Ÿ“š Natural language processing (NLP) apps, just like other machine learning apps, rely on a combination of small, simple, and intuitive algorithms working together. To save time and effort, it's often best to utilize an external library that already implements and integrates these algorithms seamlessly.

In our example, I'll introduce you to the Stanford NLP library ๐ŸŒŸ๐Ÿ“š, a Java-based library for natural language processing that supports multiple languages.

One specific algorithm we'll focus on is the part-of-speech (POS) tagger ๐Ÿท๏ธ. ๐Ÿท๏ธ It's a clever algorithm that automatically labels each word in a text with its corresponding part of speech. ๐Ÿ“ By analyzing the words' characteristics and taking the context into account, this tagger does an awesome job classifying the words in the text.

Now, I must admit, the nitty-gritty details of how the POS tagger algorithm works are a bit too much for this article. But don't worry! If you're curious and want to explore further, you can dive into the topic here to satisfy your hunger for knowledge. ๐Ÿค“๐Ÿ”

๐Ÿ“—Getting Started!

I'm assuming you've got the Java basics down, perhaps from your adventures with Data Structures and Algorithms! But hey, hold on a sec, I'm guessing you might not be familiar with setting up a Maven architecture. (Professionals, please forgive me for a moment!) ๐Ÿ˜‰

โœ…Set up a Java development environment: Ensure that you have Java Development Kit (JDK) installed on your system. You can download and install the JDK from the official Oracle website.

โœ…Create a new Maven project: Open a command prompt or terminal window, navigate to the desired directory where you want to create your project, and execute the following command:

mvn archetype:generate -DgroupId=com.example -DartifactId=article-analyzer -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false

This will create a new Maven project with the specified group ID (com.example) and artifact ID (article-analyzer).

โœ…Now, add the Stanford NLP library as a dependency. Since you are using Maven, simply include the following code in your pom.xml file.

<dependency
  <groupId>edu.stanford.nlp</groupId>
  <artifactId>stanford-corenlp</artifactId>
  <version>3.6.0</version>
</dependency>
<dependency>
  <groupId>edu.stanford.nlp</groupId>
  <artifactId>stanford-corenlp</artifactId>
  <version>3.6.0</version>
  <classifier>models</classifier>
</dependency>>

โœ…โœจ Furthermore, since our app will scrape article content from web pages, we need to add two more dependencies.

<dependency>
  <groupId>de.l3s.boilerpipe</groupId>
  <artifactId>boilerpipe</artifactId>
  <version>1.1.0</version>
</dependency>
<dependency>
  <groupId>net.sourceforge.nekohtml</groupId>
  <artifactId>nekohtml</artifactId>
  <version>1.9.22</version>
</dependency>

๐ŸŽ‰Now you are good to go.๐ŸŽ‰

๐Ÿ“—Scraping and Cleaning

๐Ÿ” The first part of our analyzer ๐Ÿงช will involve retrieving articles ๐Ÿ“ฐ and extracting their content from web pages ๐ŸŒ

๐Ÿ” When retrieving articles from random sources ๐Ÿ—ž๏ธ, the pages are usually filled with extraneous information ๐Ÿ“บ (embedded videos, outbound links, advertisements, etc.) that is irrelevant to the article itself. This is where Boilerpipe comes into play! ๐ŸŽญ

๐Ÿ” Boilerpipe is an extremely powerful and efficient algorithm ๐Ÿค– for removing "clutter" that identifies the main content of an article ๐Ÿ“ by analyzing different content blocks using features like the length of an average sentence ๐Ÿ“, types of tags used in content blocks ๐Ÿท๏ธ, and density of links ๐Ÿ”—. The Boilerpipe algorithm has proven to be competitive with other much more computationally expensive algorithms, such as those based on machine vision ๐Ÿ‘€.

๐Ÿ” To retrieve and clean articles, we can use the Boilerpipe library's built-in functions ๐Ÿ“š. Let's implement a function called "extractFromUrl" that takes a URL as an input ๐Ÿ“ฅ, uses Boilerpipe to fetch the HTML document ๐Ÿ“‘, extracts the text ๐Ÿ“œ, and cleans it โœจ.

โœ…Now, Create the BoilerPipeExtractor class: Create a new Java class named BoilerPipeExtractor.javaunder the src/main/java/com/example directory. Copy the provided BoilerPipeExtractor code into this class.

package com.example;

import java.net.URL;

import de.l3s.boilerpipe.document.TextDocument;
import de.l3s.boilerpipe.extractors.CommonExtractors;
import de.l3s.boilerpipe.sax.BoilerpipeSAXInput;
import de.l3s.boilerpipe.sax.HTMLDocument;
import de.l3s.boilerpipe.sax.HTMLFetcher;
public class BoilerPipeExtractor {
    public static String extractFromUrl(String userUrl)
            throws java.io.IOException,
            org.xml.sax.SAXException,
            de.l3s.boilerpipe.BoilerpipeProcessingException  {
        final HTMLDocument htmlDoc = HTMLFetcher.fetch(new URL(userUrl));
        final TextDocument doc = new BoilerpipeSAXInput(htmlDoc.toInputSource()).getTextDocument();
        return CommonExtractors.ARTICLE_EXTRACTOR.getText(doc);
    }
}

โœจ The Boilerpipe library offers various extractors that work based on the fantastic boilerpipe algorithm. One extractor we're particularly interested in is the ArticleExtractor, which is designed to give us the juiciest bits from HTML-formatted news articles. ๐Ÿ“ฐ๐Ÿ”

ArticleExtractor pays close attention to the HTML tags used in each content block and also takes into account the density of outbound links. This makes it a perfect fit for our task, much more suitable than the faster-but-simpler DefaultExtractor. โšก๐Ÿ”

The best part? The built-in functions in the library handle everything for us seamlessly! Here's the breakdown:

  1. HTMLFetcher.fetch: This function grabs the HTML document for us.

  2. getTextDocument: With this function, we extract the text document from the HTML.

  3. CommonExtractors.ARTICLE_EXTRACTOR.getText: And voila! This function uses the magical boilerpipe algorithm to extract the relevant text from the article. ๐ŸŽฉโœจ

With these functions at our disposal, we can effortlessly extract the most relevant content from HTML-formatted articles. It's like having a superpower! ๐Ÿ’ช

Now, let's dive into an example to better understand how it works. To keep things simple, we'll take the Wikipedia page about chickens as an example. You can find the page here. You can feed this URL to the function and see what comes out.

โœ…Add the following code to your main function:

package com.example;

public class App {
    public static void main(String[] args) throws java.io.IOException,
            org.xml.sax.SAXException,
            de.l3s.boilerpipe.BoilerpipeProcessingException {
        System.setProperty("console.encoding", "UTF-8");

        String urlString = "https://en.wikipedia.org/wiki/Chicken";
        String text = BoilerPipeExtractor.extractFromUrl(urlString);
        System.out.println(text);
    }
}

You should see your output in the main body of the article, without the ads, HTML tags, and outbound links. Here is the beginning snippet from what I got when I ran this:

Female crouching (receptive posture) or stepping aside or running away (if unwilling to copulate)
Male mounting
Male treading with both feet on hen's back...

And that is indeed the main article body of the article. Hard to imagine this being much simpler to implement.

๐Ÿ“—Tagging Parts of Speech

Now that you've successfully extracted the main article body, it's time to dig deeper and determine if the article mentions topics that are of interest to you. ๐Ÿ“๐Ÿ”

You might be tempted to do a basic string or regular expression search, but hold on a moment! There are some drawbacks to that approach. ๐Ÿ˜ฏโŒ

Firstly, a simple string search could result in false positives. Imagine an article mentioning "Microsoft Excel" being tagged as mentioning just "Microsoft." Not quite accurate, right? ๐Ÿค”โŒ

Secondly, with regular expression searches, depending on how they're constructed, you might encounter false negatives.

Lastly, if you're dealing with a large number of keywords and processing numerous articles, searching through the entire text for each keyword in your articles could take forever, resulting in poor performance. Time is precious! ๐Ÿ˜ซโณ

Here's where! Stanford's CoreNLP library comes to the rescue with its powerful features. It solves all three of these problems amaingly. ๐ŸŽ‰๐Ÿ”ง

For our analyzer, we'll employ the Parts-of-Speech (POS) tagger. This clever tool allows us to find all the proper nouns in the article and compare them to our article of interesting keywords. ๐Ÿ“ˆ๐Ÿ’ผ

By leveraging NLP technology, we not only enhance the accuracy of our tagger and minimize false positives and negatives, but we also significantly reduce the amount of text we need to compare. Proper nouns are just a small fraction of the article's content. Clever, isn't it? ๐Ÿ˜„๐Ÿ”

To speed up the analysis, we'll preprocess our Article into a data structure with low membership query cost. This way, we can quickly identify the relevant stocks without wasting precious time. โšก๐Ÿ’ก

Stanford CoreNLP offers a super convenient tagger called MaxentTagger that can perform POS tagging with just a few lines of code. It's like having a helpful assistant to handle the heavy lifting for you! ๐Ÿค—

Here is a simple implementation:

package com.example;

public class ArticleAnalyzer{
    private HashSet<String> article;
    private static final String modelPath = "edu\\stanford\\nlp\\models\\pos-tagger\\english-left3words\\english-left3words-distsim.tagger";
    private MaxentTagger tagger;

    public ArticleAnalyzer() {
        tagger = new MaxentTagger(modelPath);
    }
    public String tagPos(String input) {
        return tagger.tagString(input);
    }
}

The tagger function, tagPos, takes a string as an input and outputs a string that contains the words in the original string along with the corresponding part of speech. In your main function, instantiate an ArticleAnalyzer and feed the output of the scraper into the tagger function and you should see something like this:

MILAN/PARIS_NN Italy_NNP 's_POS Luxottica_NNP -LRB-_-LRB- LUX.MI_NNP -RRB-_-RRB- and_CC France_NNP 's_POS Essilor_NNP -LRB-_-LRB- ESSI.PA_NNP -RRB-_-RRB- have_VBP agreed_VBN a_DT 46_CD billion_CD euro_NN -LRB-_-LRB- $_$ 49_CD billion_CD -RRB-_-RRB- merger_NN to_TO create_VB a_DT global_JJ eyewear_NN powerhouse_NN with_IN annual_JJ revenue_NN of_IN more_JJR than_IN 15_CD billion_CD euros_NNS ._. The_DT all-share_JJ deal_NN is_VBZ one_CD of_IN Europe_NNP 's_POS largest_JJS cross-border_JJ tie-ups_NNS and_CC brings_VBZ together_RB Luxottica_NNP ,_, the_DT world_NN 's_POS top_JJ spectacles_NNS maker_NN with_IN brands_NNS such_JJ as_IN Ray-Ban_NNP and_CC Oakley_NNP ,_, with_IN leading_VBG lens_NN manufacturer_NN Essilor_NNP ._. ``_`` Finally_RB ..._: two_CD products_NNS which_WDT are_VBP naturally_RB complementary_JJ --_: namely_RB frames_NNS and_CC lenses_NNS --_: will_MD be_VB designed_VBN ,_, manufactured_VBN and_CC distributed_VBN under_IN the_DT same_JJ roof_NN ,_, ''_'' Luxottica_NNP 's_POS 81-year-old_JJ founder_NN Leonardo_NNP Del_NNP Vecchio_NNP said_VBD in_IN a_DT statement_NN on_IN Monday_NNP ._. Shares_NNS in_IN Luxottica_NNP were_VBD up_RB by_IN 8.6_CD percent_NN at_IN 53.80_CD euros_NNS by_IN 1405_CD GMT_NNP -LRB-_-LRB- 9:05_CD a.m._NN ET_NNP -RRB-_-RRB- ,_, with_IN Essilor_NNP up_IN 12.2_CD percent_NN at_IN 114.60_CD euros_NNS ._. The_DT merger_NN between_IN the_DT top_JJ players_NNS in_IN the_DT 95_CD billion_CD eyewear_NN market_NN is_VBZ aimed_VBN at_IN helping_VBG the_DT businesses_NNS to_TO take_VB full_JJ advantage_NN of_IN expected_VBN strong_JJ demand_NN for_IN prescription_NN spectacles_NNS and_CC sunglasses_NNS due_JJ to_TO an_DT aging_NN global_JJ population_NN and_CC increasing_VBG awareness_NN about_IN...

๐Ÿ“—Creating a Tagged Output Set

So far, we've accomplished quite a bit! We've developed functions to download, clean, and tag news articles. But there's still one important task left: figuring out if the article mentions any companies that are of interest to the user. ๐Ÿ“ฐ๐Ÿ’ผ

To tackle this, we need to gather all the proper nouns and then check if any of the stocks from our portfolio appear among those proper nouns. It's like searching for hidden gems within the text! ๐Ÿ’Ž๐Ÿ”

To find those proper nouns, we'll start by splitting the tagged string output into individual tokens. We'll use spaces as delimiters to break the string apart. Then, for each token, we'll split it again using the underscore (_) and examine if the part of speech indicates a proper noun. It's like a little detective work to identify those significant names! ๐Ÿ”Ž๐Ÿ•ต๏ธโ€โ™€๏ธ

Once we have gathered all the proper nouns, we want to store them in a data structure that suits our purpose. In this example, we'll opt for a HashSet. This data structure has some nifty characteristics: it ensures no duplicate entries, doesn't maintain the order of entries, but most importantly, it offers lightning-fast membership queries. Since we only care about checking if a word is present or not, the HashSet is the perfect fit for our needs. It's like having a supercharged tool to quickly find what we're looking for! ๐Ÿ’ช๐Ÿ”ง

Here is the implementation that accomplishes this:

public static HashSet<String> extractProperNouns(String taggedOutput) {
    HashSet<String> propNounSet = new HashSet<>();
    String[] split = taggedOutput.split(" ");
    List<String> propNounList = new ArrayList<>();
    for (String token : split) {
        String[] splitTokens = token.split("_");
        if (splitTokens.length >= 2 && splitTokens[1].equals("NNP")) {
            propNounList.add(splitTokens[0]);
        } else {
            if (!propNounList.isEmpty()) {
                propNounSet.add(StringUtils.join(propNounList, " "));
                propNounList.clear();
            }
        }
    }
    if (!propNounList.isEmpty()) {
        propNounSet.add(StringUtils.join(propNounList, " "));
        propNounList.clear();
    }
    return propNounSet;
}

Now the function should return a set with the individual proper nouns and the consecutive proper nouns (i.e., joined by spaces).

๐Ÿ“—Articles vs. PropNouns Comparison

๐ŸŒŸ Guess what? We're almost there! ๐ŸŽ‰

In the previous parts, we created some awesome stuff! First, we built a scraper that can download and extract the main text of an article. ๐Ÿ“œ Then, we made a tagger that can analyze the article and find proper nouns. ๐Ÿ•ต๏ธโ€โ™€๏ธ Finally, we developed a processor that takes those tagged proper nouns and puts them into a HashSet. ๐Ÿ“š

Now, the last step is super easy! We just need to take that HashSet and compare it with our list of keywords. โš–๏ธ

To make it happen, simply add this snippet of code to your ArticleAnalyzer class:

package com.example;

import de.l3s.boilerpipe.BoilerpipeProcessingException;
import edu.stanford.nlp.tagger.maxent.MaxentTagger;
import edu.stanford.nlp.util.StringUtils;
import org.xml.sax.SAXException;

import java.io.IOException;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;

import static com.example.BoilerPipeExtractor.extractFromUrl;

public class ArticleAnalyzer {
    private HashSet<String> article;
    private static final String modelPath = "edu\\stanford\\nlp\\models\\pos-tagger\\english-left3words\\english-left3words-distsim.tagger";
    private MaxentTagger tagger;

    public ArticleAnalyzer() {
        tagger = new MaxentTagger(modelPath);
        article = new HashSet<>();
    }

    public String tagPos(String input) {
        return tagger.tagString(input);
    }

    public static HashSet<String> extractProperNouns(String taggedOutput) {
        HashSet<String> propNounSet = new HashSet<>();
        String[] split = taggedOutput.split(" ");
        List<String> propNounList = new ArrayList<>();
        for (String token : split) {
            String[] splitTokens = token.split("_");
            if (splitTokens.length >= 2 && splitTokens[1].equals("NNP")) {
                propNounList.add(splitTokens[0]);
            } else {
                if (!propNounList.isEmpty()) {
                    propNounSet.add(StringUtils.join(propNounList, " "));
                    propNounList.clear();
                }
            }
        }
        if (!propNounList.isEmpty()) {
            propNounSet.add(StringUtils.join(propNounList, " "));
            propNounList.clear();
        }
        return propNounSet;
    }

    public void addKeyword(String keyword) {
        article.add(keyword);
    }

    public boolean areKeywordsMentioned(HashSet<String> articleProperNouns) {
        HashSet<String> lowercaseArticle = new HashSet<>();
        for (String topics : article) {
            lowercaseArticle.add(topics.toLowerCase());
        }
        for (String properNoun : articleProperNouns) {
            if (lowercaseArticle.contains(properNoun.toLowerCase())) {
                return true;
            }
        }
        return false;
    }


    public boolean analyzeArticle(String urlString) throws IOException, SAXException, BoilerpipeProcessingException {
        String articleText = extractFromUrl(urlString);
        String tagged = tagPos(articleText);
        HashSet<String> properNounsSet = extractProperNouns(tagged);
        return areKeywordsMentioned(properNounsSet);
    }
}

๐Ÿ“—Connecting the dots

Finally, we can use the app!

ArticleAnalyzer.java

package com.example;

import de.l3s.boilerpipe.BoilerpipeProcessingException;
import edu.stanford.nlp.tagger.maxent.MaxentTagger;
import edu.stanford.nlp.util.StringUtils;
import org.xml.sax.SAXException;

import java.io.IOException;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;

import static com.example.BoilerPipeExtractor.extractFromUrl;

public class ArticleAnalyzer {
    private HashSet<String> article;
    private static final String modelPath = "edu\\stanford\\nlp\\models\\pos-tagger\\english-left3words\\english-left3words-distsim.tagger";
    private MaxentTagger tagger;

    public ArticleAnalyzer() {
        tagger = new MaxentTagger(modelPath);
        article = new HashSet<>();
    }

    public String tagPos(String input) {
        return tagger.tagString(input);
    }

    public static HashSet<String> extractProperNouns(String taggedOutput) {
        HashSet<String> propNounSet = new HashSet<>();
        String[] split = taggedOutput.split(" ");
        List<String> propNounList = new ArrayList<>();
        for (String token : split) {
            String[] splitTokens = token.split("_");
            if (splitTokens.length >= 2 && splitTokens[1].equals("NNP")) {
                propNounList.add(splitTokens[0]);
            } else {
                if (!propNounList.isEmpty()) {
                    propNounSet.add(StringUtils.join(propNounList, " "));
                    propNounList.clear();
                }
            }
        }
        if (!propNounList.isEmpty()) {
            propNounSet.add(StringUtils.join(propNounList, " "));
            propNounList.clear();
        }
        return propNounSet;
    }

    public void addKeyword(String keyword) {
        article.add(keyword);
    }

    public boolean areKeywordsMentioned(HashSet<String> articleProperNouns) {
        HashSet<String> lowercaseArticle = new HashSet<>();
        for (String topics : article) {
            lowercaseArticle.add(topics.toLowerCase());
        }
        for (String properNoun : articleProperNouns) {
            if (lowercaseArticle.contains(properNoun.toLowerCase())) {
                return true;
            }
        }
        return false;
    }


    public boolean analyzeArticle(String urlString) throws IOException, SAXException, BoilerpipeProcessingException {
        String articleText = extractFromUrl(urlString);
        String tagged = tagPos(articleText);
        HashSet<String> properNounsSet = extractProperNouns(tagged);
        return areKeywordsMentioned(properNounsSet);
    }
}

App.java

package com.example;

import de.l3s.boilerpipe.BoilerpipeProcessingException;
import org.xml.sax.SAXException;

import java.io.IOException;

public class App {
    public static void main(String[] args) throws IOException, SAXException, BoilerpipeProcessingException {
        ArticleAnalyzer analyzer = new ArticleAnalyzer();
        analyzer.addKeyword("chicken");
        boolean mentioned = analyzer.analyzeArticle("https://en.wikipedia.org/wiki/Chicken");
        if (mentioned) {
            System.out.println("Article is relevant! you might gain learn something.");
        } else {
            System.out.println("Article is irrelevant!");
        }
    }
}

๐Ÿ“—Building an NLP App -> Easy peasy!

We've just finished our wild adventure in the magical world of Natural Language Processing (NLP). Hold on tight and let's recap the highlights! ๐ŸŽขโœจ

In this article, we explored the intricacies of building a powerful NLP app. Step by step, we uncovered the process of extracting valuable insights from articles downloaded from URLs. Trust me, it was quite an adventure! ๐Ÿš€๐Ÿ”

On our journey, we used cutting-edge tech like Boilerpipe and Stanford NLP to streamline development and achieve remarkable results. With Boilerpipe, we decluttered web pages like pros, extracting only the most relevant text. Talk about a clean sweep! ๐Ÿงน๐Ÿ’จ

And then there's Stanford NLP, the language-processing wizard. It helped us do advanced linguistic analysis, detecting entities and references. We even played linguistic detectives, spotting references to companies in our portfolio. Holmes and Watson would be proud! ๐Ÿ”๐Ÿง

By combining these technologies, we transformed a complex task into an efficient process. We've come a long way, but remember, we've only scratched the surface of NLP's possibilities. This field is vast and always evolving, promising endless innovation and opportunities. The future looks exciting! ๐Ÿš€๐ŸŒŸ


Thank You Soo Much for your valuable time.๐Ÿ˜Š๐Ÿฅณ๐Ÿ‘‹

If you have any questions or comments, feel free to reach out to me :)

๐Ÿ‘‹ Hi there! Let's connect and collaborate!

Here are some ways to reach me:

๐Ÿ”น GitHub: github.com/mithindev

๐Ÿ”น Twitter: twitter.com/MithinDev

๐Ÿ”น LinkedIn: linkedin.com/in/mithindev

Looking forward to connecting with you!

PS: Ignore the hashes๐Ÿ˜Š


#NLPAdventure #PowerfulApps #CuttingEdgeTech #DeclutterLikeAPro #LinguisticDetectives #TransformingTasks #EndlessPossibilities #ExcitingFuture #NLPJourney #TechExploration #EfficientProcessing #InnovationUnleashed #MagicalNLP #StreamlinedDevelopment #InsightsUnlocked #IntelligentSoftware #LinguisticAnalysis #IncredibleResults

ย