Java URL類踩坑指南

zhisheng 發布于2019-08-16 10:53 / 2249人閱讀

摘要：類的源碼對象的方法其中會消耗大量時間。所以，如果在基于哈希表的容器中存儲對象，簡直就是災難。下面這段代碼，對比了和在存儲次時的表現輸出為所以，基于哈希表實現的容器最好不要用。這也給我們啟發結尾的最好還是加上以上，本周末發現的一些坑。

背景介紹

最近再做一個RSS閱讀工具給自己用，其中一個環節是從服務器端獲取一個包含了RSS源列表的json文件，再根據這個json文件下載、解析RSS內容。核心代碼如下：

class PresenterImpl(val context: Context, val activity: MainActivity) : IPresenter {
    private val URL_API = "https://vimerzhao.github.io/others/rssreader/RSS.json"

    override fun getRssResource(): RssSource {
        val gson = GsonBuilder().create()
        return gson.fromJson(getFromNet(URL_API), RssSource::class.java)
    }

    private fun getFromNet(url: String): String {
        val result = URL(url).readText()
        return result
    }

    ......
}

之前一直執行地很好，直到前兩天我購買了一個vimerzhao.top的域名，并將原來的域名vimerzhao.github.io重定向到了vimerzhao.top。這個工具就無法使用了，但在瀏覽器輸入URL_API卻能得到數據：

那為什么URL.readText()沒有拿到數據呢？

不支持重定向

可以通過下面代碼測試：

import java.net.*;
import java.io.*;

public class TestRedirect {
    public static void main(String args[]) {
        try {
            URL url1 = new URL("https://vimerzhao.github.io/others/rssreader/RSS.json");
            URL url2 = new URL("http://vimerzhao.top/others/rssreader/RSS.json");
            read(url1);
            System.out.println("=--------------------------------=");
            read(url2);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
    public static void read(URL url) {
        try {
            BufferedReader in = new BufferedReader(
                    new InputStreamReader(url.openStream()));

            String inputLine;
            while ((inputLine = in.readLine()) != null) {
                System.out.println(inputLine);
            }
            in.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

得到結果如下：


301 Moved Permanently

301 Moved Permanently
nginx


=--------------------------------=
{"theme":"tech","author":"zhaoyu","email":"dutzhaoyu@gmail.com","version":"0.01","contents":[{"category":"綜合版塊","websites":[{"tag":"門戶網站","url":["http://geek.csdn.net/admin/news_service/rss","http://blog.jobbole.com/feed/","http://feed.cnblogs.com/blog/sitehome/rss","https://segmentfault.com/feeds","http://www.codeceo.com/article/category/pick/feed"]},{"tag":"知名社區","url":["https://stackoverflow.com/feeds","https://www.v2ex.com/index.xml"]},{"tag":"官方博客","url":["https://www.blog.google/rss/","https://blog.jetbrains.com/feed/"]},{"tag":"個人博客-行業","url":["http://feed.williamlong.info/","https://www.liaoxuefeng.com/feed/articles"]},{"tag":"個人博客-學術","url":["http://www.norvig.com/rss-feed.xml"]}]},{"category":"編程語言","websites":[{"tag":"Kotlin","url":["https://kotliner.cn/api/rss/latest"]},{"tag":"Python","url":["https://www.python.org/dev/peps/peps.rss/"]},{"tag":"Java","url":["http://www.codeceo.com/article/category/develop/java/feed"]}]},{"category":"行業動態","websites":[{"tag":"Android","url":["http://www.codeceo.com/article/category/develop/android/feed"]}]},{"category":"亂七八遭","websites":[{"tag":"Linux-綜合","url":["https://linux.cn/rss.xml","http://www.linuxidc.com/rssFeed.aspx","http://www.codeceo.com/article/tag/linux/feed"]},{"tag":"Linux-發行版","url":["https://blog.linuxmint.com/?feed=rss2","https://manjaro.github.io/feed.xml"]}]}]}

HTTP返回碼301，即發生了重定向。可在瀏覽器上這個過程太快以至于我們看不到這個301界面的出現。這里需要說明的是URL.readText()是Kotlin中一個擴展函數，本質還是調用了URL類的openStream方法，部分源碼如下：

.....
/**
 * Reads the entire content of this URL as a String using UTF-8 or the specified [charset].
 *
 * This method is not recommended on huge files.
 *
 * @param charset a character set to use.
 * @return a string with this URL entire content.
 */
@kotlin.internal.InlineOnly
public inline fun URL.readText(charset: Charset = Charsets.UTF_8): String = readBytes().toString(charset)

/**
 * Reads the entire content of the URL as byte array.
 *
 * This method is not recommended on huge files.
 *
 * @return a byte array with this URL entire content.
 */
public fun URL.readBytes(): ByteArray = openStream().use { it.readBytes() }

所以上面的測試代碼即說明了URL.readText()失敗的原因。
不過URL不支持重定向是否合理？為什么不支持？還有待探究。

不穩定的equals方法

首先看下equals的說明(URL (Java Platform SE 7 ))：

Compares this URL for equality with another object.
If the given object is not a URL then this method immediately returns false.
Two URL objects are equal if they have the same protocol, reference equivalent hosts, have the same port number on the host, and the same file and fragment of the file.
Two hosts are considered equivalent if both host names can be resolved into the same IP addresses; else if either host name can"t be resolved, the host names must be equal without regard to case; or both host names equal to null.
Since hosts comparison requires name resolution, this operation is a blocking operation.
Note: The defined behavior for equals is known to be inconsistent with virtual hosting in HTTP.

接下來再看一段代碼：

import java.net.*;
public class TestEquals {
    public static void main(String args[]) {
        try {
            // vimerzhao的博客主頁
            URL url1 = new URL("https://vimerzhao.github.io/");
            // zhanglanqing的博客主頁
            URL url2 = new URL("https://zhanglanqing.github.io/");
            // vimerzhao博客主頁重定向后的域名
            URL url3 = new URL("http://vimerzhao.top/");
            System.out.println(url1.equals(url2));
            System.out.println(url1.equals(url3));
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

根據定義輸出結果是什么呢？運行之后是這樣：

true
false

你可能猜對了，但如果我把電腦斷網之后再次執行，結果卻是：

false
false

但其實3個域名的IP地址都是相同的，可以ping一下：

zhaoyu@Inspiron ~/Project $ ping vimezhao.github.io
PING sni.github.map.fastly.net (151.101.77.147) 56(84) bytes of data.
64 bytes from 151.101.77.147: icmp_seq=1 ttl=44 time=396 ms
^C
--- sni.github.map.fastly.net ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 396.692/396.692/396.692/0.000 ms
zhaoyu@Inspiron ~/Project $ ping zhanglanqing.github.io
PING sni.github.map.fastly.net (151.101.77.147) 56(84) bytes of data.
64 bytes from 151.101.77.147: icmp_seq=1 ttl=44 time=396 ms
^C
--- sni.github.map.fastly.net ping statistics ---
2 packets transmitted, 1 received, 50% packet loss, time 1000ms
rtt min/avg/max/mdev = 396.009/396.009/396.009/0.000 ms
zhaoyu@Inspiron ~/Project $ ping vimezhao.top
ping: unknown host vimezhao.top
zhaoyu@Inspiron ~/Project $ ping vimerzhao.top
PING sni.github.map.fastly.net (151.101.77.147) 56(84) bytes of data.
64 bytes from 151.101.77.147: icmp_seq=1 ttl=44 time=409 ms
^C
--- sni.github.map.fastly.net ping statistics ---
2 packets transmitted, 1 received, 50% packet loss, time 1001ms
rtt min/avg/max/mdev = 409.978/409.978/409.978/0.000 ms

首先看一下有網絡連接的情況，vimerzhao.github.io和zhanglanqing.github.io是我和我同學的博客，雖然內容不一樣但是指向相同的IP，協議、端口等都相同，所以相等了；而vimerzhao.github.io雖然和vimerzhao.top指向同一個博客，但是一個是https一個是http，協議不同，所以判斷為不相等。相信這和大多數人的直覺是相背的：指向不同博客的URL相等了，但指向相同博客的URL卻不相等！
再分析斷網之后的結果：首先查看URL的源碼：

    public boolean equals(Object obj) {
        if (!(obj instanceof URL))
            return false;
        URL u2 = (URL)obj;

        return handler.equals(this, u2);
    }

再看handler對象的源碼：

    protected boolean equals(URL u1, URL u2) {
        String ref1 = u1.getRef();
        String ref2 = u2.getRef();
        return (ref1 == ref2 || (ref1 != null && ref1.equals(ref2))) &&
               sameFile(u1, u2);
    }

sameFile源碼：

    protected boolean sameFile(URL u1, URL u2) {
        // Compare the protocols.
        if (!((u1.getProtocol() == u2.getProtocol()) ||
              (u1.getProtocol() != null &&
               u1.getProtocol().equalsIgnoreCase(u2.getProtocol()))))
            return false;

        // Compare the files.
        if (!(u1.getFile() == u2.getFile() ||
              (u1.getFile() != null && u1.getFile().equals(u2.getFile()))))
            return false;

        // Compare the ports.
        int port1, port2;
        port1 = (u1.getPort() != -1) ? u1.getPort() : u1.handler.getDefaultPort();
        port2 = (u2.getPort() != -1) ? u2.getPort() : u2.handler.getDefaultPort();
        if (port1 != port2)
            return false;

        // Compare the hosts.
        if (!hostsEqual(u1, u2))
            return false;// 無網絡連接時會觸發這一句

        return true;
    }

最后是hostsEqual的源碼：

    protected boolean hostsEqual(URL u1, URL u2) {
        InetAddress a1 = getHostAddress(u1);
        InetAddress a2 = getHostAddress(u2);
        // if we have internet address for both, compare them
        if (a1 != null && a2 != null) {
            return a1.equals(a2);
        // else, if both have host names, compare them
        } else if (u1.getHost() != null && u2.getHost() != null)
            return u1.getHost().equalsIgnoreCase(u2.getHost());
         else
            return u1.getHost() == null && u2.getHost() == null;
    }

在有網絡的情況下，a1和a2都不是null所以會觸發return a1.equals(a2)，返回true；而沒有網絡時則會觸發return u1.getHost().equalsIgnoreCase(u2.getHost());即第二個判斷，顯然url1的host（vimerzhao.github.io）和url2的host（zhanglanqing.github.io）不等，所以返回false，導致if (!hostsEqual(u1, u2))判斷為真，return false執行。
可見，URL類的equals方法不僅違反直覺還缺乏一致性，在不同環境會有不同結果，十分危險！

耗時的equals方法

此外，equals還是個耗時的操作，因為在有網絡的情況下需要進行DNS解析，hashCode()同理，這里以hashCode()為例說明。URL類的hashCode()源碼：

    public synchronized int hashCode() {
        if (hashCode != -1)
            return hashCode;

        hashCode = handler.hashCode(this);
        return hashCode;
    }

handler對象的hashCode()方法：

    protected int hashCode(URL u) {
        int h = 0;

        // Generate the protocol part.
        String protocol = u.getProtocol();
        if (protocol != null)
            h += protocol.hashCode();

        // Generate the host part.
        InetAddress addr = getHostAddress(u);
        if (addr != null) {
            h += addr.hashCode();
        } else {
            String host = u.getHost();
            if (host != null)
                h += host.toLowerCase().hashCode();
        }

        // Generate the file part.
        String file = u.getFile();
        if (file != null)
            h += file.hashCode();

        // Generate the port part.
        if (u.getPort() == -1)
            h += getDefaultPort();
        else
            h += u.getPort();

        // Generate the ref part.
        String ref = u.getRef();
        if (ref != null)
            h += ref.hashCode();

        return h;
    }

其中getHostAddress()會消耗大量時間。所以，如果在基于哈希表的容器中存儲URL對象，簡直就是災難。下面這段代碼，對比了URL和URI在存儲50次時的表現：

import java.net.*;
import java.util.*;

public class TestHash {
    public static void main(String args[]) {
        HashSet list1 = new HashSet<>();
        HashSet list2 = new HashSet<>();
        try {
            URL url1 = new URL("https://vimerzhao.github.io/");
            URI url2 = new URI("https://zhanglanqing.github.io/");
            long cur = System.currentTimeMillis();
            int cnt = 50;
            for (int i = 0; i < cnt; i++) {
                list1.add(url1);
            }
            System.out.println(System.currentTimeMillis() - cur);
            cur = System.currentTimeMillis();
            for (int i = 0; i < cnt; i++) {
                list2.add(url2);
            }
            System.out.println(System.currentTimeMillis() - cur);

        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

輸出為：

271
0

所以，基于哈希表實現的容器最好不要用URL。

TrailingSlash的作用

所謂TrailingSlash就是域名結尾的斜杠。比如我們在瀏覽器看到vimerzhao.top,復制后粘貼發現是http://vimerzhao.top/。首先用下面代碼測試：

import java.net.*;
import java.io.*;

public class TestTrailingSlash {
    public static void main(String args[]) {
        try {
            URL url1 = new URL("https://vimerzhao.github.io/");
            URL url2 = new URL("https://vimerzhao.github.io");
            System.out.println(url1.equals(url2));
            outputInfo(url1);
            outputInfo(url2);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
    public static void outputInfo(URL url) {
        System.out.println("------" + url.toString() + "----------");
        System.out.println(url.getRef());
        System.out.println(url.getFile());
        System.out.println(url.getHost());
        System.out.println("----------------");
    }
}

得到結果如下：

false
------https://vimerzhao.github.io/----------
null
/
vimerzhao.github.io
----------------
------https://vimerzhao.github.io----------
null

vimerzhao.github.io
----------------

其實，無論用前面的read()方法讀或者地址欄直接輸入url，url1和url2的內容都是相同的，但是加/表示這是一個目錄，不加表示這是一個文件，所以二者getFile()的結果不同，導致equals判斷為false。在地址欄輸入時甚至不會覺察到這個TrailingSlash，所返回的結果也一樣，但equals判斷竟然為false，真是防不勝防！
這里還有一個問題就是：一個是文件，令一個是目錄，為什么都能得到相同結果？
調查一番后發現：其實再請求的時候如果有/，那么就會在這個目錄下找index.html文件；如果沒有，以vimerzhao.top/tags為例，則會先找tags，如果找不到就會自動在后面添加一個/，再在tags目錄下找index.html文件。如圖：

這里有一個有趣的測試，編寫兩段代碼如下：

import java.net.*;
import java.io.*;

public class TestTrailingSlash {
    public static void main(String args[]) {
        try {
            URL urlWithSlash = new URL("http://vimerzhao.top/tags/");
            int cnt = 5;
            long cur = System.currentTimeMillis();
            for (int i = 0; i < cnt; i++) {
                read(urlWithSlash);
            }
            System.out.println(System.currentTimeMillis() - cur);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
    public static void read(URL url) {
        try {
            BufferedReader in = new BufferedReader(
                    new InputStreamReader(url.openStream()));

            String inputLine;
            while ((inputLine = in.readLine()) != null) {
                //System.out.println(inputLine);
            }
            in.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

import java.net.*;
import java.io.*;

public class TestWithoutTrailingSlash {
    public static void main(String args[]) {
        try {
            URL urlWithoutSlash = new URL("http://vimerzhao.top/tags");
            int cnt = 5;
            long cur = System.currentTimeMillis();
            for (int i = 0; i < cnt; i++) {
                read(urlWithoutSlash);
            }
            System.out.println(System.currentTimeMillis() - cur);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
    public static void read(URL url) {
        try {
            BufferedReader in = new BufferedReader(
                    new InputStreamReader(url.openStream()));

            String inputLine;
            while ((inputLine = in.readLine()) != null) {
                //System.out.println(inputLine);
            }
            in.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

使用如下腳本測試：

#!/bin/sh
for i in {1..20}; do
    java TestTrailingSlash > out1
    java TestWithoutTrailingSlash > out2
done

將輸出的時間做成表格：

可以發現，添加了/的速度更快，這是因為省去了查找是否有tags文件的操作。這也給我們啟發：URL結尾的/最好還是加上！

以上，本周末發現的一些坑。

參考

Official Google Webmaster Central Blog: To slash or not to slash

url rewriting - When should I use a trailing slash in my URL? - Stack Overflow

What Does a Slash at the End of a Website"s URL Mean?

Mr. Gosling - why did you make URL equals suck?!? - Invert Your Mind ? Invert Your Mind

java - URLConnection Doesn"t Follow Redirect - Stack Overflow

java - Proper way to check for URL equality - Stack Overflow

http - How to compare two URLs in java? - Stack Overflow

GPU云服務器云服務器類踩坑 java指南 java指南文檔 java 中url

文章版權歸作者所有，未經允許請勿轉載,若此文章存在違規行為，您可以聯系管理員刪除。

轉載請注明本文地址：http://specialneedsforspecialkids.com/yun/70640.html

Next.js項目實戰踩坑指南

摘要：項目實戰踩坑指南移動端，滾動卡頓解決方案主容器增加樣式路由跳轉后樣式丟失原因下樣式根據頁面動態加載，瀏覽器緩存文件造成樣式不更新。跨域及傳遞的問題第一步，登錄成功后服務器返回。第二步，瀏覽器自動緩存，再后續請求中攜帶此。項目實戰踩坑指南 1. 移動端overflow:auto，ios滾動卡頓解決方案：主容器增加樣式-webkit-overflow-scrolling: touc...

用戶83 2019-08-23 13:46 評論0 收藏0
監聽微信返回事件踩坑指南

摘要：瀏覽器返回等于重新進入上一個頁面，會觸發刷新動作，而微信不會。也就是困擾我多時的微信返回不刷新。也就是說當時微信返回還是會觸發渲染事件的具體是什么事件也不清楚，因為當時沒有深究，但是確實是觸發了。 PC瀏覽器返回等于重新進入上一個頁面，會觸發刷新動作，而微信不會。也就是困擾我多時的微信返回不刷新。大概再2017年初和2016末（大概也是從那個時候我開始做微信公眾號），還可以通過在se...

adam1q84 2019-08-22 15:14 評論0 收藏0
vue 開發中遇到的問題匯總（踩坑指南）

摘要：組件中使用定時器及銷毀問題如果我們在頁面中使用了一個定時器，當從頁面跳轉到頁面時，如果不手動清除這個定時器，那么它仍舊會執行，這不是我們所期望的。公司年初開始從jquery轉型到vue開發，思想上從jquery的操作DOM到vue的操作數據，剛開始還不太習慣，但用了一段時間發現確實比較方便。在剛開始用vue的時候，也踩了一些坑，現在分享出來，供剛入門上手開發vue的朋友參考，都是一些...

wean 2019-08-20 18:52 評論0 收藏0
Nuxt.js的踩坑指南（常見問題匯總）

摘要：本文會不定期更新在中遇到的問題進行匯總。轉發請注明出處，尊重作者，謝謝注意版本為，適合低版本指南，不通用以上。強烈推薦作者文檔版踩坑指南，點擊跳轉本文會不定期更新在nuxt.js中遇到的問題進行匯總。轉發請注明出處，尊重作者，謝謝！注意：版本為1.0+，適合低版本nuxt指南，不通用2.0+以上。強烈推薦作者文檔版踩坑指南，點擊跳轉

maochunguang 2019-08-22 16:37 評論0 收藏0