运行nutch报错unzipBestEffort returned null怎么办
运行nutch报错unzipBestEffort returned null怎么办
小编今天带大家了解运行nutch报错unzipBestEffort returned null怎么办,文中知识点介绍的非常详细。觉得有帮助的朋友可以跟着小编一起浏览文章的内容,希望能够帮助更多想解决这个问题的朋友找到问题的答案,下面跟着小编一起深入学习“运行nutch报错unzipBestEffort returned null怎么办”的知识吧。
报错信息:fetch of http://szs.mof.gov.cn/zhengwuxinxi/zhengcefabu/201402/t20140224_1046354.html failed with: java.io.IOException: unzipBestEffort returned null
完整的报错信息为:
2014-03-12 16:48:38,031 ERROR http.Http - Failed to get protocol output java.io.IOException: unzipBestEffort returned null at org.apache.nutch.protocol.http.api.HttpBase.processGzipEncoded(HttpBase.java:317) at org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:164) at org.apache.nutch.protocol.http.Http.getResponse(Http.java:64) at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:140) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:703) 2014-03-12 16:48:38,031 INFO fetcher.Fetcher - fetch of http://szs.mof.gov.cn/zhengwuxinxi/zhengcefabu/201402/t20140224_1046354.html failed with: java.io.IOException: unzipBestEffort returned null 2014-03-12 16:48:38,031 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=0
由此可知抛出异常的代码位于src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java(lib-http插件)类的processGzipEncoded方法的317行:
byte[] content; if (getMaxContent() >= 0) { content = GZIPUtils.unzipBestEffort(compressed, getMaxContent()); } else { content = GZIPUtils.unzipBestEffort(compressed); } if (content == null) throw new IOException("unzipBestEffort returned null");
nutch2.7srcpluginprotocol-httpsrcjavaorgapachenutchprotocolhttpHttpResponse.java(protocol-http插件)的164行调用了processGzipEncoded方法:
readPlainContent(in); String contentEncoding = getHeader(Response.CONTENT_ENCODING); if ("gzip".equals(contentEncoding) || "x-gzip".equals(contentEncoding)) { content = http.processGzipEncoded(content, url); } else if ("deflate".equals(contentEncoding)) { content = http.processDeflateEncoded(content, url); } else { if (Http.LOG.isTraceEnabled()) { Http.LOG.trace("fetched " + content.length + " bytes from " + url); } }
通过Firefox的Firebug工具可查看该URL的响应头为Content-Encoding:gzip,Transfer-Encoding:chunked。
解决方法如下:
1、修改文件nutch2.7srcjavaorgapachenutchmetadataHttpHeaders.java,增加一个field:
public final static String TRANSFER_ENCODING = "Transfer-Encoding";
2、修改文件nutch2.7srcpluginprotocol-httpsrcjavaorgapachenutchprotocolhttpHttpResponse.java,替换第160行代码readPlainContent(in);为如下代码
String transferEncoding = getHeader(Response.TRANSFER_ENCODING); if(transferEncoding != null && "chunked".equalsIgnoreCase(transferEncoding.trim())){ readChunkedContent(in, line); }else{ readPlainContent(in); }
3、http内容长度限制不能使用负值,只能使用一个大整数:
<property> <name>http.content.limit</name> <value>655360000</value> </property>
4、因为修改了核心代码和插件代码,所以需要重新编译打包发布,执行nutch2.7build.xml的默认target:runtime
cd nutch2.7 ant
感谢大家的阅读,以上就是“运行nutch报错unzipBestEffort returned null怎么办”的全部内容了,学会的朋友赶紧操作起来吧。相信高防服务器网小编一定会给大家带来更优质的文章。谢谢大家对高防服务器网网站的支持!
[微信提示:高防服务器能助您降低 IT 成本,提升运维效率,使您更专注于核心业务创新。
[图文来源于网络,不代表本站立场,如有侵权,请联系高防服务器网删除]
[