Creating an Article Relevance Analyzer ๐ง powered by Natural Language Processing (NLP)
-Natural Language Processing (NLP) ๐ฑ App using Java
๐Introduction
๐๐ Recently, I've dived into the fascinating world of ML and AI! ๐ค๐ As someone who enjoys devouring docs and articles ๐๐, I find myself constantly reading and taking notes. However, I often end up wasting precious time on irrelevant ones! ๐ซ๐ญ So, I decided to find a way to be more efficient and productive.
Imagine having an app that automatically scours popular articles, pinpointing the ones relevant to your interests. ๐ฒโจ That's exactly what I aim to buildโan article relevance analyzer! ๐ง๐
In today's digital era, Natural Language Processing (NLP) ๐ง ๐ฃ๏ธ has emerged as a powerful tool for engaging with intelligent software. Think of voice assistants like Google's Assistant, Apple's Siri, and Microsoft's Cortana, which can understand and respond to our everyday language. But have you ever wondered how this "smart" software actually works? ๐ค
In this exciting article, we'll explore the inner workings of these applications and even delve into developing our own natural language processing software. ๐๐ง
So, get ready to join me on this journey as we dive into the process of building an article relevance analyzer! ๐๐
๐Stanford NLP Library
๐ก๐ Natural language processing (NLP) apps, just like other machine learning apps, rely on a combination of small, simple, and intuitive algorithms working together. To save time and effort, it's often best to utilize an external library that already implements and integrates these algorithms seamlessly.
In our example, I'll introduce you to the Stanford NLP library ๐๐, a Java-based library for natural language processing that supports multiple languages.
One specific algorithm we'll focus on is the part-of-speech (POS) tagger ๐ท๏ธ. ๐ท๏ธ It's a clever algorithm that automatically labels each word in a text with its corresponding part of speech. ๐ By analyzing the words' characteristics and taking the context into account, this tagger does an awesome job classifying the words in the text.
Now, I must admit, the nitty-gritty details of how the POS tagger algorithm works are a bit too much for this article. But don't worry! If you're curious and want to explore further, you can dive into the topic here to satisfy your hunger for knowledge. ๐ค๐
๐Getting Started!
I'm assuming you've got the Java basics down, perhaps from your adventures with Data Structures and Algorithms! But hey, hold on a sec, I'm guessing you might not be familiar with setting up a Maven architecture. (Professionals, please forgive me for a moment!) ๐
โ Set up a Java development environment: Ensure that you have Java Development Kit (JDK) installed on your system. You can download and install the JDK from the official Oracle website.
โ Create a new Maven project: Open a command prompt or terminal window, navigate to the desired directory where you want to create your project, and execute the following command:
mvn archetype:generate -DgroupId=com.example -DartifactId=article-analyzer -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false
This will create a new Maven project with the specified group ID (com.example) and artifact ID (article-analyzer).
โ Now, add the Stanford NLP library as a dependency. Since you are using Maven, simply include the following code in your pom.xml file.
<dependency
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.6.0</version>
</dependency>
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.6.0</version>
<classifier>models</classifier>
</dependency>>
โ โจ Furthermore, since our app will scrape article content from web pages, we need to add two more dependencies.
<dependency>
<groupId>de.l3s.boilerpipe</groupId>
<artifactId>boilerpipe</artifactId>
<version>1.1.0</version>
</dependency>
<dependency>
<groupId>net.sourceforge.nekohtml</groupId>
<artifactId>nekohtml</artifactId>
<version>1.9.22</version>
</dependency>
๐Now you are good to go.๐
๐Scraping and Cleaning
๐ The first part of our analyzer ๐งช will involve retrieving articles ๐ฐ and extracting their content from web pages ๐
๐ When retrieving articles from random sources ๐๏ธ, the pages are usually filled with extraneous information ๐บ (embedded videos, outbound links, advertisements, etc.) that is irrelevant to the article itself. This is where Boilerpipe comes into play! ๐ญ
๐ Boilerpipe is an extremely powerful and efficient algorithm ๐ค for removing "clutter" that identifies the main content of an article ๐ by analyzing different content blocks using features like the length of an average sentence ๐, types of tags used in content blocks ๐ท๏ธ, and density of links ๐. The Boilerpipe algorithm has proven to be competitive with other much more computationally expensive algorithms, such as those based on machine vision ๐.
๐ To retrieve and clean articles, we can use the Boilerpipe library's built-in functions ๐. Let's implement a function called "extractFromUrl" that takes a URL as an input ๐ฅ, uses Boilerpipe to fetch the HTML document ๐, extracts the text ๐, and cleans it โจ.
โ
Now, Create the BoilerPipeExtractor
class: Create a new Java class named BoilerPipeExtractor.java
under the src/main/java/com/example
directory. Copy the provided BoilerPipeExtractor
code into this class.
package com.example;
import java.net.URL;
import de.l3s.boilerpipe.document.TextDocument;
import de.l3s.boilerpipe.extractors.CommonExtractors;
import de.l3s.boilerpipe.sax.BoilerpipeSAXInput;
import de.l3s.boilerpipe.sax.HTMLDocument;
import de.l3s.boilerpipe.sax.HTMLFetcher;
public class BoilerPipeExtractor {
public static String extractFromUrl(String userUrl)
throws java.io.IOException,
org.xml.sax.SAXException,
de.l3s.boilerpipe.BoilerpipeProcessingException {
final HTMLDocument htmlDoc = HTMLFetcher.fetch(new URL(userUrl));
final TextDocument doc = new BoilerpipeSAXInput(htmlDoc.toInputSource()).getTextDocument();
return CommonExtractors.ARTICLE_EXTRACTOR.getText(doc);
}
}
โจ The Boilerpipe library offers various extractors that work based on the fantastic boilerpipe algorithm. One extractor we're particularly interested in is the ArticleExtractor, which is designed to give us the juiciest bits from HTML-formatted news articles. ๐ฐ๐
ArticleExtractor pays close attention to the HTML tags used in each content block and also takes into account the density of outbound links. This makes it a perfect fit for our task, much more suitable than the faster-but-simpler DefaultExtractor. โก๐
The best part? The built-in functions in the library handle everything for us seamlessly! Here's the breakdown:
HTMLFetcher.fetch: This function grabs the HTML document for us.
getTextDocument: With this function, we extract the text document from the HTML.
CommonExtractors.ARTICLE_EXTRACTOR.getText: And voila! This function uses the magical boilerpipe algorithm to extract the relevant text from the article. ๐ฉโจ
With these functions at our disposal, we can effortlessly extract the most relevant content from HTML-formatted articles. It's like having a superpower! ๐ช
Now, let's dive into an example to better understand how it works. To keep things simple, we'll take the Wikipedia page about chickens as an example. You can find the page here. You can feed this URL to the function and see what comes out.
โ Add the following code to your main function:
package com.example;
public class App {
public static void main(String[] args) throws java.io.IOException,
org.xml.sax.SAXException,
de.l3s.boilerpipe.BoilerpipeProcessingException {
System.setProperty("console.encoding", "UTF-8");
String urlString = "https://en.wikipedia.org/wiki/Chicken";
String text = BoilerPipeExtractor.extractFromUrl(urlString);
System.out.println(text);
}
}
You should see your output in the main body of the article, without the ads, HTML tags, and outbound links. Here is the beginning snippet from what I got when I ran this:
Female crouching (receptive posture) or stepping aside or running away (if unwilling to copulate)
Male mounting
Male treading with both feet on hen's back...
And that is indeed the main article body of the article. Hard to imagine this being much simpler to implement.
๐Tagging Parts of Speech
Now that you've successfully extracted the main article body, it's time to dig deeper and determine if the article mentions topics that are of interest to you. ๐๐
You might be tempted to do a basic string or regular expression search, but hold on a moment! There are some drawbacks to that approach. ๐ฏโ
Firstly, a simple string search could result in false positives. Imagine an article mentioning "Microsoft Excel" being tagged as mentioning just "Microsoft." Not quite accurate, right? ๐คโ
Secondly, with regular expression searches, depending on how they're constructed, you might encounter false negatives.
Lastly, if you're dealing with a large number of keywords and processing numerous articles, searching through the entire text for each keyword in your articles could take forever, resulting in poor performance. Time is precious! ๐ซโณ
Here's where! Stanford's CoreNLP library comes to the rescue with its powerful features. It solves all three of these problems amaingly. ๐๐ง
For our analyzer, we'll employ the Parts-of-Speech (POS) tagger. This clever tool allows us to find all the proper nouns in the article and compare them to our article of interesting keywords. ๐๐ผ
By leveraging NLP technology, we not only enhance the accuracy of our tagger and minimize false positives and negatives, but we also significantly reduce the amount of text we need to compare. Proper nouns are just a small fraction of the article's content. Clever, isn't it? ๐๐
To speed up the analysis, we'll preprocess our Article into a data structure with low membership query cost. This way, we can quickly identify the relevant stocks without wasting precious time. โก๐ก
Stanford CoreNLP offers a super convenient tagger called MaxentTagger that can perform POS tagging with just a few lines of code. It's like having a helpful assistant to handle the heavy lifting for you! ๐ค
Here is a simple implementation:
package com.example;
public class ArticleAnalyzer{
private HashSet<String> article;
private static final String modelPath = "edu\\stanford\\nlp\\models\\pos-tagger\\english-left3words\\english-left3words-distsim.tagger";
private MaxentTagger tagger;
public ArticleAnalyzer() {
tagger = new MaxentTagger(modelPath);
}
public String tagPos(String input) {
return tagger.tagString(input);
}
}
The tagger function, tagPos, takes a string as an input and outputs a string that contains the words in the original string along with the corresponding part of speech. In your main function, instantiate an ArticleAnalyzer and feed the output of the scraper into the tagger function and you should see something like this:
MILAN/PARIS_NN Italy_NNP 's_POS Luxottica_NNP -LRB-_-LRB- LUX.MI_NNP -RRB-_-RRB- and_CC France_NNP 's_POS Essilor_NNP -LRB-_-LRB- ESSI.PA_NNP -RRB-_-RRB- have_VBP agreed_VBN a_DT 46_CD billion_CD euro_NN -LRB-_-LRB- $_$ 49_CD billion_CD -RRB-_-RRB- merger_NN to_TO create_VB a_DT global_JJ eyewear_NN powerhouse_NN with_IN annual_JJ revenue_NN of_IN more_JJR than_IN 15_CD billion_CD euros_NNS ._. The_DT all-share_JJ deal_NN is_VBZ one_CD of_IN Europe_NNP 's_POS largest_JJS cross-border_JJ tie-ups_NNS and_CC brings_VBZ together_RB Luxottica_NNP ,_, the_DT world_NN 's_POS top_JJ spectacles_NNS maker_NN with_IN brands_NNS such_JJ as_IN Ray-Ban_NNP and_CC Oakley_NNP ,_, with_IN leading_VBG lens_NN manufacturer_NN Essilor_NNP ._. ``_`` Finally_RB ..._: two_CD products_NNS which_WDT are_VBP naturally_RB complementary_JJ --_: namely_RB frames_NNS and_CC lenses_NNS --_: will_MD be_VB designed_VBN ,_, manufactured_VBN and_CC distributed_VBN under_IN the_DT same_JJ roof_NN ,_, ''_'' Luxottica_NNP 's_POS 81-year-old_JJ founder_NN Leonardo_NNP Del_NNP Vecchio_NNP said_VBD in_IN a_DT statement_NN on_IN Monday_NNP ._. Shares_NNS in_IN Luxottica_NNP were_VBD up_RB by_IN 8.6_CD percent_NN at_IN 53.80_CD euros_NNS by_IN 1405_CD GMT_NNP -LRB-_-LRB- 9:05_CD a.m._NN ET_NNP -RRB-_-RRB- ,_, with_IN Essilor_NNP up_IN 12.2_CD percent_NN at_IN 114.60_CD euros_NNS ._. The_DT merger_NN between_IN the_DT top_JJ players_NNS in_IN the_DT 95_CD billion_CD eyewear_NN market_NN is_VBZ aimed_VBN at_IN helping_VBG the_DT businesses_NNS to_TO take_VB full_JJ advantage_NN of_IN expected_VBN strong_JJ demand_NN for_IN prescription_NN spectacles_NNS and_CC sunglasses_NNS due_JJ to_TO an_DT aging_NN global_JJ population_NN and_CC increasing_VBG awareness_NN about_IN...
๐Creating a Tagged Output Set
So far, we've accomplished quite a bit! We've developed functions to download, clean, and tag news articles. But there's still one important task left: figuring out if the article mentions any companies that are of interest to the user. ๐ฐ๐ผ
To tackle this, we need to gather all the proper nouns and then check if any of the stocks from our portfolio appear among those proper nouns. It's like searching for hidden gems within the text! ๐๐
To find those proper nouns, we'll start by splitting the tagged string output into individual tokens. We'll use spaces as delimiters to break the string apart. Then, for each token, we'll split it again using the underscore (_) and examine if the part of speech indicates a proper noun. It's like a little detective work to identify those significant names! ๐๐ต๏ธโโ๏ธ
Once we have gathered all the proper nouns, we want to store them in a data structure that suits our purpose. In this example, we'll opt for a HashSet. This data structure has some nifty characteristics: it ensures no duplicate entries, doesn't maintain the order of entries, but most importantly, it offers lightning-fast membership queries. Since we only care about checking if a word is present or not, the HashSet is the perfect fit for our needs. It's like having a supercharged tool to quickly find what we're looking for! ๐ช๐ง
Here is the implementation that accomplishes this:
public static HashSet<String> extractProperNouns(String taggedOutput) {
HashSet<String> propNounSet = new HashSet<>();
String[] split = taggedOutput.split(" ");
List<String> propNounList = new ArrayList<>();
for (String token : split) {
String[] splitTokens = token.split("_");
if (splitTokens.length >= 2 && splitTokens[1].equals("NNP")) {
propNounList.add(splitTokens[0]);
} else {
if (!propNounList.isEmpty()) {
propNounSet.add(StringUtils.join(propNounList, " "));
propNounList.clear();
}
}
}
if (!propNounList.isEmpty()) {
propNounSet.add(StringUtils.join(propNounList, " "));
propNounList.clear();
}
return propNounSet;
}
Now the function should return a set with the individual proper nouns and the consecutive proper nouns (i.e., joined by spaces).
๐Articles vs. PropNouns Comparison
๐ Guess what? We're almost there! ๐
In the previous parts, we created some awesome stuff! First, we built a scraper that can download and extract the main text of an article. ๐ Then, we made a tagger that can analyze the article and find proper nouns. ๐ต๏ธโโ๏ธ Finally, we developed a processor that takes those tagged proper nouns and puts them into a HashSet. ๐
Now, the last step is super easy! We just need to take that HashSet and compare it with our list of keywords. โ๏ธ
To make it happen, simply add this snippet of code to your ArticleAnalyzer class:
package com.example;
import de.l3s.boilerpipe.BoilerpipeProcessingException;
import edu.stanford.nlp.tagger.maxent.MaxentTagger;
import edu.stanford.nlp.util.StringUtils;
import org.xml.sax.SAXException;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import static com.example.BoilerPipeExtractor.extractFromUrl;
public class ArticleAnalyzer {
private HashSet<String> article;
private static final String modelPath = "edu\\stanford\\nlp\\models\\pos-tagger\\english-left3words\\english-left3words-distsim.tagger";
private MaxentTagger tagger;
public ArticleAnalyzer() {
tagger = new MaxentTagger(modelPath);
article = new HashSet<>();
}
public String tagPos(String input) {
return tagger.tagString(input);
}
public static HashSet<String> extractProperNouns(String taggedOutput) {
HashSet<String> propNounSet = new HashSet<>();
String[] split = taggedOutput.split(" ");
List<String> propNounList = new ArrayList<>();
for (String token : split) {
String[] splitTokens = token.split("_");
if (splitTokens.length >= 2 && splitTokens[1].equals("NNP")) {
propNounList.add(splitTokens[0]);
} else {
if (!propNounList.isEmpty()) {
propNounSet.add(StringUtils.join(propNounList, " "));
propNounList.clear();
}
}
}
if (!propNounList.isEmpty()) {
propNounSet.add(StringUtils.join(propNounList, " "));
propNounList.clear();
}
return propNounSet;
}
public void addKeyword(String keyword) {
article.add(keyword);
}
public boolean areKeywordsMentioned(HashSet<String> articleProperNouns) {
HashSet<String> lowercaseArticle = new HashSet<>();
for (String topics : article) {
lowercaseArticle.add(topics.toLowerCase());
}
for (String properNoun : articleProperNouns) {
if (lowercaseArticle.contains(properNoun.toLowerCase())) {
return true;
}
}
return false;
}
public boolean analyzeArticle(String urlString) throws IOException, SAXException, BoilerpipeProcessingException {
String articleText = extractFromUrl(urlString);
String tagged = tagPos(articleText);
HashSet<String> properNounsSet = extractProperNouns(tagged);
return areKeywordsMentioned(properNounsSet);
}
}
๐Connecting the dots
Finally, we can use the app!
ArticleAnalyzer.java
package com.example;
import de.l3s.boilerpipe.BoilerpipeProcessingException;
import edu.stanford.nlp.tagger.maxent.MaxentTagger;
import edu.stanford.nlp.util.StringUtils;
import org.xml.sax.SAXException;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import static com.example.BoilerPipeExtractor.extractFromUrl;
public class ArticleAnalyzer {
private HashSet<String> article;
private static final String modelPath = "edu\\stanford\\nlp\\models\\pos-tagger\\english-left3words\\english-left3words-distsim.tagger";
private MaxentTagger tagger;
public ArticleAnalyzer() {
tagger = new MaxentTagger(modelPath);
article = new HashSet<>();
}
public String tagPos(String input) {
return tagger.tagString(input);
}
public static HashSet<String> extractProperNouns(String taggedOutput) {
HashSet<String> propNounSet = new HashSet<>();
String[] split = taggedOutput.split(" ");
List<String> propNounList = new ArrayList<>();
for (String token : split) {
String[] splitTokens = token.split("_");
if (splitTokens.length >= 2 && splitTokens[1].equals("NNP")) {
propNounList.add(splitTokens[0]);
} else {
if (!propNounList.isEmpty()) {
propNounSet.add(StringUtils.join(propNounList, " "));
propNounList.clear();
}
}
}
if (!propNounList.isEmpty()) {
propNounSet.add(StringUtils.join(propNounList, " "));
propNounList.clear();
}
return propNounSet;
}
public void addKeyword(String keyword) {
article.add(keyword);
}
public boolean areKeywordsMentioned(HashSet<String> articleProperNouns) {
HashSet<String> lowercaseArticle = new HashSet<>();
for (String topics : article) {
lowercaseArticle.add(topics.toLowerCase());
}
for (String properNoun : articleProperNouns) {
if (lowercaseArticle.contains(properNoun.toLowerCase())) {
return true;
}
}
return false;
}
public boolean analyzeArticle(String urlString) throws IOException, SAXException, BoilerpipeProcessingException {
String articleText = extractFromUrl(urlString);
String tagged = tagPos(articleText);
HashSet<String> properNounsSet = extractProperNouns(tagged);
return areKeywordsMentioned(properNounsSet);
}
}
App.java
package com.example;
import de.l3s.boilerpipe.BoilerpipeProcessingException;
import org.xml.sax.SAXException;
import java.io.IOException;
public class App {
public static void main(String[] args) throws IOException, SAXException, BoilerpipeProcessingException {
ArticleAnalyzer analyzer = new ArticleAnalyzer();
analyzer.addKeyword("chicken");
boolean mentioned = analyzer.analyzeArticle("https://en.wikipedia.org/wiki/Chicken");
if (mentioned) {
System.out.println("Article is relevant! you might gain learn something.");
} else {
System.out.println("Article is irrelevant!");
}
}
}
๐Building an NLP App -> Easy peasy!
We've just finished our wild adventure in the magical world of Natural Language Processing (NLP). Hold on tight and let's recap the highlights! ๐ขโจ
In this article, we explored the intricacies of building a powerful NLP app. Step by step, we uncovered the process of extracting valuable insights from articles downloaded from URLs. Trust me, it was quite an adventure! ๐๐
On our journey, we used cutting-edge tech like Boilerpipe and Stanford NLP to streamline development and achieve remarkable results. With Boilerpipe, we decluttered web pages like pros, extracting only the most relevant text. Talk about a clean sweep! ๐งน๐จ
And then there's Stanford NLP, the language-processing wizard. It helped us do advanced linguistic analysis, detecting entities and references. We even played linguistic detectives, spotting references to companies in our portfolio. Holmes and Watson would be proud! ๐๐ง
By combining these technologies, we transformed a complex task into an efficient process. We've come a long way, but remember, we've only scratched the surface of NLP's possibilities. This field is vast and always evolving, promising endless innovation and opportunities. The future looks exciting! ๐๐
Thank You Soo Much for your valuable time.๐๐ฅณ๐
If you have any questions or comments, feel free to reach out to me :)
๐ Hi there! Let's connect and collaborate!
Here are some ways to reach me:
๐น GitHub: github.com/mithindev
๐น Twitter: twitter.com/MithinDev
๐น LinkedIn: linkedin.com/in/mithindev
Looking forward to connecting with you!
PS: Ignore the hashes๐
#NLPAdventure #PowerfulApps #CuttingEdgeTech #DeclutterLikeAPro #LinguisticDetectives #TransformingTasks #EndlessPossibilities #ExcitingFuture #NLPJourney #TechExploration #EfficientProcessing #InnovationUnleashed #MagicalNLP #StreamlinedDevelopment #InsightsUnlocked #IntelligentSoftware #LinguisticAnalysis #IncredibleResults