INSTANT OR SLOW SCRAPING?
A. Collect images and text using PHP proxy Chinese is hard to learn and it's hard to find the chinese words in dictionaries. So there is a tool called an annotator for users to copy and patse a string of words to find a series of words translated into English (or other languages). Now suppose you find a comic book about "The novel of three kingdoms" with explanation in chinese and you want to use annotator. Unfortunately in the internet most of the "3 kingdoms" pictured books have the text part saved as picture that can not be used as a string. But I happened to find out the following web site: http://www.e3ol.com It gives us in each page only one picture and one "real" text. From these pages, you can use the mouse to copy the text, but it is very difficult to copy the image (every time one image is clicked it changes to the next image). To easily capture images and text, I used jQuery and a little PHP to create a PHP proxy web page to get the source code, then I find the URL of the image, make the image appear and get the text-descriptions for the picture. After that the button "CLICK ME (add annotation)" appears for us to add the annotation simplified table underneath the description lines.
NOTE: chinese font need to be installed to your PC. If you are using "Windows 7" just right-click the font file then select "Install this font". Finally, you can save the web page as a PDF file. If you embed this PDF into a web, you can copy the text from it. The sad news is that the site e3ol.com only includes the first few chapters of the novel. B. Instant or Slow scraping After reading a few examples in the ebook "Web Scraping with PHP - Instant PHP Web Scraping" I feel they are very useful. I set the name "Slow scraping" to indicate the scraping web with changes in the program to make the scraping slower but avoid the user's IP is blocked. In this site I do not use delaying in-between requests because it is not only difficult programming but also because many users do not like to wait. I chose other method for "Slow scraping". It's very simple, basic and primitive: Instead of sending the request directly to the website for example "http://target.com" as below:<? php $ html = get_file-contents ('http://target.com'); // s1 ?>I use a Firefox browser to save the html file as "local-target-com.html" (select "Web Page HTML only"). Then in my program I need just to change the above statement to:$ html = get_file-contents ('local-target-com.html'); // s2Indeed, when writing PHP programs to "scrape" the website "http://www.e3ol.com/picture/html/1/1/../1_1_...shtml#pictop" my IP was blocked beacause I used the button "Get Data" (use file_get-contents()) many times. Fortunately I still can use the browser to open up other pages of the website. So I saved many pages (from 9 to 43), put them in the zip file for you to try. In contrast, in the same program I also use "get_file_constents()" (instant scraping) for "https://www./chinese/dictionary" but felt less likely to be "blocked". C.What's in the files "slow-scraping.zip" and "index.html" IN THE ZIP FILE (about 1 MB): index.html, slow-scraping.php, 35 html files saved by using Firefox: from e3ol-page9.html to e3ol-page43 (to be used with "get_file-contents()") 3kindoms-chapter1-en.pdf (Chapter1 for you to read) 2 images (jpg) You can download this file from: https://www.mediafire.com/file/b47m7h1zk7n7db7/slow-scraping.rar IN index.html: the PDF file is embeded. Let's read it then try to use annotations to translate the descriptions from chinese to english. You can type english translation into the "textarea", placed below each description. At the bottom of "index.html" you can copy the chinese description, paste it to embeded "https://www.bing.com/translator". and hear how it sounds in chinese. Finally, you can watch the pictures from chinese ebooks "San Guo Yan Yi". To scrape html we can use PHP DOMDocument class, PHP DOMXPath class or the function preg_match with Regular Expressions, for example in my file "slow-scraping.php":$dom_doc = new DOMDocument(); libxml_use_internal_errors(TRUE); //disable libxml errors if(!empty($html)){ //if any html is actually returned $dom_doc->loadHTML($html); libxml_clear_errors(); //remove errors $dom_xpath = new DOMXPath($dom_doc); // SCRAPING local-e3ol-com.html $chinese = $dom_xpath->query('//div[@class="div_clear text"]'); $mandarin = $chinese->item(0)->nodeValue; $image = $dom_xpath->query('//div[@class="div_clear con_pic"]/img[@src]'); $imgURL = "http://www.e3ol.com/". $image->item(0)->getAttribute("src"); // SCRAPING https://www.mdbg.net $url1 = 'https://www.mdbg.net/chinese/dictionary?page=worddict&wdrst=0&wdqb='.urlencode($mandarin); $html1 = file_get_contents($url1); preg_match("/]*>(.*?)<\/body>/is", $html1, $matches); $body = $matches[1]; }
- To use this web page, you need to have "XAMPP" (the sever with Apache-MySQL-PHP) installed
in your PC
- Open the folder "slow-scraping", right click the file "index.html", select "Open with Firefox".
- Replace: file:///D:/xampp/htdocs/slow-scraping/index.html -> localhost/slow-scraping/index.html.
- Press the key "End". Finally press "Enter".
- If your PC is slow, wait until you see 35 buttons (from 9 to 43 - for 34 pages)
- In the text box, page 9 has been selected automatically. Just click the "Get Data" button.
You will see a jpg image and the description lines appear.
- Next click "Click Me to add annotation". Then, in the text box, next page is selected (page 10).
- Continue to click "Get Data". . .
- After finishing getting the data from the last page (page 43) you do two things:
1- Save the html as complete (all jpg images are also saved).
2- Click the "Show less to print as PDF" button then print the html file as PDF file.
When done, click the button "Return to the main page".
in your PC
- Open the folder "slow-scraping", right click the file "index.html", select "Open with Firefox".
- Replace: file:///D:/xampp/htdocs/slow-scraping/index.html -> localhost/slow-scraping/index.html.
- Press the key "End". Finally press "Enter".
- If your PC is slow, wait until you see 35 buttons (from 9 to 43 - for 34 pages)
- In the text box, page 9 has been selected automatically. Just click the "Get Data" button.
You will see a jpg image and the description lines appear.
- Next click "Click Me to add annotation". Then, in the text box, next page is selected (page 10).
- Continue to click "Get Data". . .
- After finishing getting the data from the last page (page 43) you do two things:
1- Save the html as complete (all jpg images are also saved).
2- Click the "Show less to print as PDF" button then print the html file as PDF file.
When done, click the button "Return to the main page".
SOURCE CODE OF index.html
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta http-equiv="content-type" content="text/html; charset=utf-8"> <meta name="generator" content="PSPad editor, www.pspad.com"> <title>Slow Scraping</title> <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js"></script> <style> body{ background-color:#CCCC99; font-family:Arial; font-size:10pt; margin:0px; padding:0px; } #area1{ width:980px; height:300px; } #div2{ width:980px; height:300px;border:1px gray solid; overflow:auto; } #pre1{ width:980; height:400px; overflow:auto; } #areaDataB, #imageDiv{ padding:20px; } #3k-image{ border:1px solid gray; } button{ background-color:#DDDDDD; } .phpBut{ width:40px; } .tab5 td{ vertical-align:top; padding:3px; } .cell0{ width:50px; } .cell1{ width:70px; } .cell2{ width:620px; } .cell3{ width:60px; } .onePage{ } .txtArea5{ font-family:Arial; font-size:11pt; } </style> </head> <body style="background-color:#CCCC99" onload="init()"> <div align="center"> <div class="more"> <div style="font-family:Times;width:100%;height:34px; background-color:#336633;color:white;;font-size:20pt;"> <b><i>Instant or Slow scraping?</i></b> </div><br> <pre style="background-color:white;padding:20px;width:640px;font-size:12pt;font-family:Arial;text-align:left;border:1px solid gray"> Everyone likes instant scraping, but using it is more likely to cause the user's IP to be blocked or blacklisted. It is commonly thought that to avoid being blocked, it is necessary to make scraping to look like normal website using. There is another way, very easy to understand that is using a web browser to save the source code. By that we can "avoid multiple requests" to a web site that does not welcome to scraping. </pre> <div style="width:640px;height:230px;border:3px red solid;background:#FFFFCC;text-align:left;padding:15px"> <b><i>- To use this web page, you need to have "XAMPP" (the sever with Apache-MySQL-PHP) installed<br> in your PC<br> - Open the folder "slow-scraping", right click the file "index.html", select "Open with Firefox".<br> - Replace: <span style="color:#3333CC">file:///D:/xampp/htdocs</span>/slow-scraping/index.html -> <span style="color:#3333CC">localhost</span>/slow-scraping/index.html. <br> - Press the key "End". Finally press "Enter".<br> - If your PC is slow, wait until you see 35 buttons (from 9 to 43 - for 34 pages)<br> - In the text box, page 9 has been selected automatically. Just click the "Get Data" button.<br> You will see a jpg image and the description lines appear.<br> - Next click "Click Me to add annotation". Then, in the text box, next page is selected (page 10).<br> - Continue to click "Get Data". . .<br> - After finishing getting the data from the last page (page 43) you do two things:<br> 1- Save the html as complete (all jpg images are also saved).<br> 2- Click the "Show less to print as PDF" button then print the html file as PDF file.<br> When done, click the button "Return to the main page". </i></b> </div> <br> <div id="buttonDiv" style="width:880px; height:50px;overflow:auto; text-align:center;"> You should waite until the buttons are appeared here! </div> <span style="color:red"><b>NOTE:</b></span> Don't try to type in the box below (it has "readonly" attribute) <br> <br> <form > <input type="text" id="urlSt" name="urlSt" readonly value = "e3ol-page9.html" style="background-color:white;padding:4px;width:550px;height:32px;overflow:auto;background-color:white;border:1px gray solid;text-align:left"> <br><br> <span id="waitSP" style="color:red;font-weight:bold"></span> <button onclick="getData();return false">Get Data</button> </form> <button id="addAnnotation" onclick="simplifyTable()" style="color:red;display:none"> Click Me <span style="color:black">(add annotation)</span></button> </div> <!-- class = "more" --> <br> <div id="imageDiv" style="width:800px;border:1px solid gray;background-color:#FFFFCC;"> <button id="moreBut" onclick="moreLess()">Show less to print as PDF</button> <br> </div> <div class="more"> <textarea id="area5" style="display:none;width:1000px;height:300px;"> </textarea> <br> <b style="font-size:16pt">INTERESTING EMBEDED WEB PAGES</b> <br><br> <span style="color:red;font-weight:bold">NOTE:</span> In the embeded pages, exclude some buttons or links need to use, <br> we generally should not click on the other buttons or links. Because we might easily to get lost. <br><br> <div style="text-align:left;border:3px solid #666666;width:500px;height:65px;padding:8px;background:#FFFFCC"> <span style="color:green;font-weight:bold">https://www.bing.com/translator</span><br> Usage: Copy the description of one of above pages (in chinese)<br> Paste here. Finally click the button with "speaker image" to listen. </div> <br> <iframe id="bing-com" style = 'width:880px;height:410px;background-color:white' src="https://www.bing.com/translator"> </iframe> <br><br> <span style="color:green;font-weight:bold">First chapter of The roman of Three Kingdoms</span> (local PDF file) <br><br> <iframe src="3kindoms-chapter1-en.pdf" width="880" height="360" style="background-color:#CCCC99;"> <p><b>Example fallback content</b>: This browser does not support PDFs. Please download the PDF to view it: <a href="/pdf/sample-3pp.pdf">Download PDF</a>.</p> </iframe> <br> <br> <div style="text-align:left;border:3px solid #666666;width:500px;height:80px;padding:8px;background:#FFFFCC"> <span style="color:green;font-weight:bold">http://www.readers365.com/sgyylhh/1-000.htm</span><br> You can select one of the 60 ebooks by clicking on the links, located on top of each book<br> Watch pictures and if you can, read comentaries (but they are in traditional chinese)<br> </div><br> <iframe id="pictures" style = 'width:880px;height:520px;background-color:white' src="http://www.readers365.com/sgyylhh/1-000.htm"> </iframe> <br> <br><br> <div style="width:100%;height:26px; background-color:#336633;color:white;padding-top:6px"> <b><i>Return to my blog (http://phanhung20.blogspot.com/)</i></b> <button onclick="window.location = 'http://phanhung20.blogspot.com';">OK</button> </div> </div> <!-- more2 --> </div> <!-----center------> <script> var moreSt = true // global; var curPage = 9; function phpBut(thi){ nn = parseInt(thi.innerHTML); curPage = nn; url = 'e3ol-page'+ nn + '.html'; $('#urlSt').val(url); } function getData(){ $('#waitSP').html('PLEASE WAIT!'); var data = $('form:first').serialize(); $.post( 'slow-scraping.php', data, function(response){ xx = response.substring(0,7); response1 = response.replace(/^.{7}/g,''); if(xx == 'Success'){ $('#waitSP').html(''); alert(xx + '!'); $('#addAnnotation').show(); if(curPage >= 43){curPage = 8} curPage++; uu = 'e3ol-page'+ curPage + '.html'; $('#urlSt').val(uu); $('#imageDiv').append(response1 + '<br><hr size="4" color="#669900" width="780" style="margin:0px">'); }else{ alert('ERROR!'); } }, 'html', ); } function init(){ // $('#addAnnotation').hide(); $('#areaData').val(''); $('#areaDataB').val('') ; makeButtons(); } function makeButtons(){ st = ''; for(i=9;i<44;i++){ st += '<button class="phpBut" onclick="phpBut(this)">' + i + '</button>'; if(i%26 == 0){st += '<br>'} } $('#buttonDiv').html(st); } function simplifyTable(){ $('table.wordresults th').eq(2).remove(); $('#addAnnotation').hide(); $('table.wordresults th').eq(3).html('English'); $('table.wordresults tr').each(function(){ $(this).find('td').last().css('width:30px'); $(this).find('td').eq(2).remove(); }); mm = ''; $('tr.row').each(function(){ yy = $(this).find('td:first').html(); if(yy != ''){ zz = yy; xx = $(this).find('td').eq(1).find('div.pinyin span').eq(0).html() ; if($(this).find('td').eq(1).find('div.pinyin span').eq(1).length > 0){ xx = xx + ' '+ $(this).find('td').eq(1).find('div.pinyin span').eq(1).html() } } mm += ' <span style="color:blue">'+ zz + '</span>' + '['+ xx +']'; }); $('#div0').html(mm); pp = ''; $('tr.row').each(function(){ zz = '' ; $(this).find('td').each(function(idx, va){ zz += $(this).html() + '</td><td class="cell'+ (idx+1) +'">'; }) pp += '<tr><td class="cell0">' + zz + '</td></tr>'; }); qq = '<table class="tab5" border="1" cellpadding="0" cellspacing="0" width="780" style="background-color:white;border-collapse:collapse;">' + pp + '</table>'; qq = '<textarea class="txtArea5" style="width:800px;height:100px"></textarea><br><br>'+ '<div class="div4tab5" style = "width:800px;height:240px;overflow:auto">' + qq + '</div>' $('div.onePage:last').append(qq); $('.tab5:last tr').each(function(){ $(this).find('td:last').remove(); }); $('.tab5 a').on("click", function (e) { e.preventDefault(); }); if($('#div2').length > 0){ $('#div2').remove(); } } function moreLess(){ if(moreSt){ $('.more').hide(); $('.div4tab5').css('height',''); document.body.style.backgroundColor = "#FFFFCC"; $('#imageDiv').css("border", "none"); moreSt = false; $('#moreBut').html('Click here to return to the main page') }else{ $('.more').show(); $('.div4tab5').css('height','240px'); document.body.style.backgroundColor = "#CCCC99"; $('#imageDiv').css("border", "1px solid gray"); moreSt = true; $('#moreBut').html('Show less') } } </script> </body> </html>
SOURCE CODE OF slow-scraping.php
<?php $url = $_POST['urlSt']; try { $html = file_get_contents($url); $success = 'Success'; } catch(Exception $e) { echo "<p>Could not gather data!!!</p>"; }; $dom_doc = new DOMDocument(); libxml_use_internal_errors(TRUE); //disable libxml errors if(!empty($html)){ //if any html is actually returned $dom_doc->loadHTML($html); libxml_clear_errors(); //remove errors $dom_xpath = new DOMXPath($dom_doc); // SCRAPING local-e3ol-com.html $chinese = $dom_xpath->query('//div[@class="div_clear text"]'); $mandarin = $chinese->item(0)->nodeValue; $image = $dom_xpath->query('//div[@class="div_clear con_pic"]/img[@src]'); $imgURL = "http://www.e3ol.com/". $image->item(0)->getAttribute("src"); // SCRAPING https://www.mdbg.net $url1 = 'https://www.mdbg.net/chinese/dictionary?page=worddict&wdrst=0&wdqb='.urlencode($mandarin); $html1 = file_get_contents($url1); preg_match("/<body[^>]*>(.*?)<\/body>/is", $html1, $matches); $body = $matches[1]; $ss = $imgURL.'<br>'; $ss .= '<br><img id="3k-image" style="border:1px solid gray" src="'.$imgURL.'"><br><br>'; $ss .= '<div class="mandarinSp" style="text-align:left">'.$mandarin.'<br></div>'; $ss .= '<div id="div2" style="display:none;width:750px; height:500px;overflow:auto;background-color:white;text-align:left">'.$body.'</div><br>'; echo $success.'<br><div class="onePage">'.$ss.'</div>'; } ?>