Extracting text from HTML in Python: a very fast approach

Python

Extracting text from HTML in Python: a very fast approach – Source Rushter.com

When working with NLP problems, sometimes you need a large corpus of text. The internet is the biggest source of text, but unfortunately extracting text from arbitrary HTML pages is a hard and painful task.

Let’s suppose we need to extract full text from various web pages and we want to strip all HTML tags. Typically, the default solution is to use get_text method from BeautifulSoup package which internally uses lxml. It’s a well-tested solution, but it can be very slow wh …