Java获取网页编码-阿里云开发者社区

使用爬虫从网上抓取到一个网页内容，要想能正确显示，必须要获取网页的原始编码，否则会出现乱码。首先需要获取网页内容，最简单的办法就是通过JDK自带的HttpURLConnection类，要实现更复杂的抓取操作，请使用开源的爬虫框架，如Crawler4j,Web-Harvest,JSpider,WebMagic,Heritrix,Nutch等，我并不是来说爬虫相关技术的，只是网页内容的获取需要使用到爬虫技术，所以顺带提提有关爬虫的框架，具体你们自己去研究。这里为了简便起见，我就以JDK自带的HttpURLConnection类来抓取网页内容，抓取示例代码如下：

   Java代码  
    
  
package com.yida.test;  
  
import java.io.BufferedReader;  
import java.io.IOException;  
import java.io.InputStreamReader;  
import java.net.HttpURLConnection;  
import java.net.URL;  
/** 
 * 提取网页内容 
 * @author Lanxiaowei 
 * 
 */  
public class FetchWebPageTest {  
    public static void main(String[] args) throws IOException {  
        String charset = "UTF-8";  
        String line = "";  
        StringBuffer buffer = new StringBuffer();  
        URL url = new URL("http://www.baidu.com");  
        //开始访问该URL  
        HttpURLConnection urlConnection = (HttpURLConnection)url.openConnection();  
        //获取服务器响应代码  
        int responsecode = urlConnection.getResponseCode();  
        String contentType = urlConnection.getContentType();  
        //打印出content-type值，然后就可以从content-type中提取出网页编码  
        System.out.println("content-type:" + contentType);  
        if(responsecode == 200){  
            //获取网页输入流  
            BufferedReader reader = new BufferedReader(new InputStreamReader(urlConnection.getInputStream(),charset));  
            while((line = reader.readLine()) != null){  
                buffer.append(line).append("\n");  
            }  
            System.out.println(buffer.toString());  
        }  
        else{  
            System.out.println("获取不到网页的源码，服务器响应代码为："+responsecode);  
        }  
        urlConnection.disconnect();  
    }  
}  

关键点在这一句代码：

   Java代码  
    
BufferedReader reader = new BufferedReader(new InputStreamReader(urlConnection.getInputStream(),charset));

这里的charset表示网页内容的字符集编码，上面的示例代码中charset是直接定义为UTF-8,实际我们期望是能自动判断，因为并不是所有网页内容的字符集编码都为UTF-8,这也是我们今天这篇文章的主题：如何获取网页内容的原始字符集编码。

首先，我们可以通过URLConnection类的getContentType()方法的返回值中获取，比如：

   Java代码  
    
String contentType = urlConnection.getContentType();

返回值类似这样：

   Java代码  
    
content-type:text/html; charset=utf-8

然后我们从字符串中提取出字符集编码，剩下这就是字符串处理了，没什么难度，你们都懂的！

当然，URLConnection类的getContentType()方法的返回值并不能保证一定会包含字符集编码，这时我们就需要另辟蹊径，我们都知道一般HTML页面源代码中都会包含<meta标签，如：

   Html代码  
    
  
<%@ page language="java" contentType="text/html; charset=UTF-8"  
    pageEncoding="UTF-8"%>  
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">  
<html>  
<head>  
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">  
<title>欢迎页面</title>  
</head>  
<body>  
欢迎页面  
</body>  
</html>    

关键点在这里：

   Html代码  
    
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

我们可以通过正则表达式从中提取出编码，示例代码如下：

   Java代码  
    
  
package com.yida.test;  
  
import java.util.regex.Matcher;  
import java.util.regex.Pattern;  
/** 
 * 从HTML网页的meta元素中提取页面编码 
 * 这里采用的是正则表达式方式提取，你也可以采用XML解析的方式来提取[HTML其实就是XML] 
 * @author Lanxiaowei 
 * 
 */  
public class CharsetExtractTest {  
    public static void main(String[] args) {  
        //test1();  
        test2();  
    }  
      
    /** 
     * 常规情况 
     */  
    public static void test1() {  
        String content="<html xmlns=\"http://www.w3.org/1999/xhtml\">\n" +  
                "<head>\n" +  
                "<meta http-equiv = \"Content-Type\" content = \"text/html; charset=utf-8\" />\n" +  
                "<meta content=\"java获取 html网页编码\" name=\"Keywords\"/>...\n";  
        Pattern pattern = Pattern.compile("<meta\\s+http-equiv\\s*=\\s*\"Content-Type\"\\s+content\\s*=\\s*\"[\\s\\S]*?charset=(\\S+?)\"\\s*/>");  
        Matcher matcher=pattern.matcher(content);  
        if(matcher.find()){  
            System.out.println(matcher.group(1));  
        }  
    }  
      
    /** 
     * 非常规情况，比如http-equiv和content属性颠倒了 
     */  
    public static void test2() {  
        String content="<html xmlns=\"http://www.w3.org/1999/xhtml\">\n" +  
                "<head>\n" +  
                "<meta content = \"text/html; charset=utf-8\" http-equiv = \"Content-Type\" />\n" +  
                "<meta content=\"java获取 html网页编码\" name=\"Keywords\"/>...\n";  
        String regex = "(<meta\\s+http-equiv\\s*=\\s*\"Content-Type\"\\s+content\\s*=\\s*\"[\\s\\S]*?charset=(\\S+?)\"\\s*/>)|" +   
                "(<meta\\s+content\\s*=\\s*\"[\\s\\S]*?charset=(\\S+?)\"\\s+http-equiv\\s*=\\s*\"Content-Type\"\\s*/>)";  
        Pattern pattern = Pattern.compile(regex);  
        Matcher matcher=pattern.matcher(content);  
        if(matcher.find()){  
            System.out.println(matcher.group(4));  
        }  
    }  
}  

但遗憾的是，并不是每个网页内容都包含meta标签，因为总有些人不遵守HTML规范，这是我们该怎么办？还好我们还有一招可以来应对这种情形，那就是IBM的icu4j类库,icu4j可以从指定的字节数据从自动推断出其采用的字符集编码，具体示例代码如下：

   Java代码  
    
  
package com.yida.test;  
  
import java.io.IOException;  
import java.io.InputStream;  
  
import com.ibm.icu.text.CharsetDetector;  
import com.ibm.icu.text.CharsetMatch;  
  
/** 
 * 使用icu4j探测内容编码，这里的内容可以来自于字符串,字节数组,输入流等等 
 *  
 * @author Lanxiaowei 
 *  
 */  
public class Icu4jTest {  
    public static void main(String[] args) {  
        String data = "ICU4J 是IBM的国际化开发组件ICU的Java语言实现版本。";  
        String encode = getEncode(data);  
        System.out.println("encode:" + encode);  
    }  
      
    public static String getEncode(String data) {  
        return getEncode(data.getBytes());  
    }  
    public static String getEncode(byte[] data) {  
        CharsetDetector detector = new CharsetDetector();  
        detector.setText(data);  
        CharsetMatch match = detector.detect();  
        //取Confidence值最大的  
        String encoding = match.getName();  
        System.out.println("The Content in " + match.getName());  
        CharsetMatch[] matches = detector.detectAll();  
        System.out.println("All possibilities");  
        for (CharsetMatch m : matches) {  
            System.out.println("CharsetName:" + m.getName() + " Confidence:"  
                    + m.getConfidence());  
        }  
        return encoding;  
    }  
  
    public static String getEncode(InputStream data)  
            throws IOException {  
        CharsetDetector detector = new CharsetDetector();  
        detector.setText(data);  
        CharsetMatch match = detector.detect();  
        String encoding = match.getName();  
        System.out.println("The Content in " + match.getName());  
        CharsetMatch[] matches = detector.detectAll();  
        System.out.println("All possibilities");  
        for (CharsetMatch m : matches) {  
            System.out.println("CharsetName:" + m.getName() + " Confidence:"  
                    + m.getConfidence());  
        }  
        return encoding;  
    }  
}  

还有一点需要注意的就是，如果抓取到的网页内容输入流不管是使用GBK还是UTF-8，都全是乱码，那很有可能是该网页所在服务器对其采用GZip压缩，所以在获取到网页输入流时首先需要对其进行GZip解压缩，那如何确定对方服务器是否有对网页进行GZip压缩呢，一般可以通过响应头信息中获取到，如果响应头里包含了如下信息：

   Plaintext代码  
    
Accept-Encoding: gzip,deflate

则表明该网页被GZip压缩过，在获取网页内容之前，需要一个GZip解压缩过程，特此提醒！GZip解压缩示例代码如下：

   Java代码  
    
  
InputStream is = urlConnection.getInputStream();  
//用GZIPInputStream 对原始的输入流包装一下  
GZIPInputStream gis = new GZIPInputStream(he.getContent());  
BufferedReader reader = new BufferedReader(new InputStreamReader(gis,charset));  

上述所有示例代码打包上传到我的百度网盘，代码下载地址如下：

http://pan.baidu.com/s/1kTgQi0Z

如果你还有什么问题请加我Ｑ-Q：7-3-6-0-3-1-3-0-5，

或者加裙
一起交流学习！

转载：http://iamyida.iteye.com/blog/2206228

Java获取网页编码

热门文章

最新文章

相关课程

相关电子书

相关实验场景