How to scrape content from a website using PHP ?

Introduction:

Have you ever wanted to get one from another website, but there is no API available for you? This is where the scraping game on the web, if the data is not available through the website, we can simply scrap you off the site.

What is Web Scraping?

To put it shortly, web scraping is a process to retrieve data from a web document. All work is done by a section of code called a scraper. First, send a “fetch” query to a specific website. He then analyzes an HTML document, depending on the result obtained. Once this is done, the scraper searches for the required data within the document and then converts it to a specific format.

So yes, the web allows us to scrap to extract information from websites. But the point is that there are some legal questions about web scraping. Some consider it an act of attacking the site where they are given from scrap. Therefore, it is advisable to read the terms of use of the specific website that you want to scrap because you may need to do something illegal without knowing it.

There are many techniques of Web Scraping, but here I explain two techniques that are used to scrap data from Web documents.

1.Document parsing – Analyzing HTML documents or XML documents becomes, for example, Dom (Document Object Model). PHP offers DOM extension.

2. Regular expressions for scraping Web Documents.

What is Document parsing?

The parsing document is the process of converting to HTML DOM (Document Object Model), in which we can traverse.

What are Regular expressions?

To search a sequence of symbols and characters that print a string or a pattern in a longer text area.

i. Document parsing sample example

Since scraping is a difficult problem, it is always better to get permission from the owners of a website before you scraping their data. We use our own website for this tutorial just to be sure, from a legal perspective. First, we want the content of the page we scrap, we can get this with the file_get_contents() PHP function.

Now, declaring a new DOM document, it is used to transform the HTML string returned by the file_get_contents() into a true document object model that we can go through

Now we need to disable the libxml errors so they will not be sent to the screen, but they are stored in memory.

Then we need to check if this is a real HTML code that has been returned and we need to use the loadHTML() function in the new DOMDocument instance that was previously created to load the returned HTML code. Just use the HTML code returned as an argument.

Now we have to clear the errors, if that. Most of the time, this html causes unpleasant errors. Examples of online HTML styles are uncomfortable (style attributes built into the elements), invalid attributes, and invalid elements. Elements and attributes are considered null if they are not part of the HTML specification for the DOCTYPE used in the page and declare an instance of DOMXpath. This allows us to establish certain requirements with the document DOM that we have created. This requires an instance of the DOM document as an argument.


Finally, we simply write the query for the specific elements we want to get. If you used this jQuery earlier, this process is similar to what you do when you select DOM items. What is the selection here are all the H2 tags that have an ID for comments title (i.e Leave a Reply ) and  H1 for post title , we make the position of the specific  H2 and H1 with the double bar // just before the article we are interested in. The value of the ID does not matter as long as it is not an ID and then selected. The nodeValue attribute contains the text in the H2 that was selected find the full code below.

ii. Document parsing with regular expression sample example

We want the content of the page we scraping, we can do this with the file_get_contents() as above, we save the HTML of HTML contents of our homepage as a string stored in the variables $html

In this example, we have to decide which of us want to extract, in this case we only want the pictures on our homepage. We use the preg_match_all () method to extract the common data with regular expressions.

Now we’ll just add a counter to see the number of matches we’re back.

Here the final code looks like

Conclusion:

We learned some basic concepts of web scraping with PHP.  Do not forget to use your knowledge responsibly and always ask for permission before scraping any web pages.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.