1
Better Encoding and Newline Support In The Diff Algorithms
3
[NOTE: This is work-in-progress.]
8
Currently, the diff handling routines in libsvn_diff know nothing
9
about character encodings and eol characters. It assumes an
10
ASCII-based encoding and LF as line separator. This leads to a lot of
13
* Diff output will be inconsistently encoded.
14
* Files with different line endings cause unexpected results (i.e. CR
16
* Diff output gets inconsistent line endings.
17
* Non-ASCII based encodings, such as UTF16 aren't supported at all by
18
subversion out-of-the-box.
20
Solving this situation seems to be a lot of work. The motivation for
21
starting this was issue #1533 'diff output doesn't use correct
22
encoding'. This issue is solved, making the diff code assume the
23
locale encoding for file contents rather than UTF8, but the problems
24
discussed in this file are still present.
29
Currently, the headers are written using the locale encoding, which
30
is not always what's wanted. If the encoding of the files is known
31
(via svn:mime-type, for example), the headers should probably be
32
written using that encoding.
34
Note that this applies to property change information and property
35
values in the svn: namespace as well. For other properties, we can't
36
do anything but treat them as opaque.
41
According to the GNU diff documentation, on systems with newline
42
separators other than just LF, the newlines are normalized to the
43
system markers, except when --binary is used.
45
Currently, our diff library understands nothing but LF as newline.
46
Making it accept CRLF and CR as well is not hard.
48
Since we know the newline marker used in the file via the
49
svn:eol-style property, we can handle this quite well. If
50
svn:eol-style is not set, I suggest we output newlines as-is, and use
51
APR_EOL_STR to output newlines in headers. That's consistent with how
52
GNU diff behaves with the --binary option.
54
When svn:eol-style is set, we should use that style for the headers.
55
The values might be different for the original and the new file; it
56
seems logical to use the value from the modified file. Note that in
57
this case, newlines will be inconsistent anyway. Also, the
58
libsvn_client should make sure the files are translated into their
59
newline style before comparing them (this is necessary since working
60
files don't have their newlines normalized if svn:eol-style is changed
61
in the working revision). In the usual case, when svn:eol-style is
62
not changed, this will give consistent newlines for the whole diff.
63
If svn:eol-style is changed, the diff will contain every line in the
64
file with eol marker changes. This is what happens currently if you
65
do a repos_to_repos diff with svn:eol-style changed. If svn:eol-style
66
is set to native, then APR_EOL_STR should be used, as usual.
68
This requires that the svn_client_diff* functions read the
69
svn:eol-style property of the modified file and pass that information
70
to svn_diff_file_output_unified. svn_diff_file_output_unified needs
71
an eolstr argument, giving the newline marker to use for headers.
76
To support encodings that aren't ASCII-based (meaning that the first
77
128 bytes always means the same as in ASCII), Subversion needs to know
78
the encodings of the files being diffed. We don't currently have a
79
canonical way of detecting the encoding. It has been suggested to use
80
the charset parameter of svn:mime-type for this purpose. Whatever
81
method we choose, we need to cope with the fact that not all files
82
have this information available. In this case, we might assume the
83
locale/console encoding.
85
When the encodings of the files are known, the diff tokenizer should
86
use that to decide what newline separator it expects. A simple
87
solution is to just recode "\n", "\r\n" and "\r" into the file
88
encodings and search for that. Beware that to support UTF16 and other
89
forms of Unicode, we need to support null bytes in these strings.
91
NOTE: Supporting non-byte-oriented encodings such as UTF16 will
92
require work in other parts of the client libraries as well. I'm
93
discussing it here to not design a solution where we can't support
96
To support this, svn_diff_file_diff will need arguments for the
97
encodings of the original and modified files.
102
Merging (i.e. diff3) can be handled in similar ways to diff. The
103
eol-style of the .mine file should be used for the conflict markers
104
and the files should be translated to their newline styles if needed.
106
The encoding part is a bit trickier. If the encoding of all the three
107
files is the same, then conflict markers should use that encoding as
110
NOTE: For UTF16 and UTF32, the BOM might be problematic. Ideally, we
111
need to be careful to not add extra BOMs inside the file. One idea is
112
to strip the BOMs before merging and ensure that the resulting file
113
has a BOM after the merge. I'm not sure how much encoding specific
114
code we want to add to our diff library. Maybe UTF16 would be
115
considered common enough to not handle it like "just another
116
encoding". For UTF8, we may need to handle the BOM as well, since
117
that's allowed. We need to be careful not to add BOMs that aren't in
118
the files, since that will break applications (and we don't want to
119
silently change the contents of users' files!)
121
If the encodings are different for the three files, merging could
122
easily lead to an inconsistent mess, unless the encodings share some
123
subset (like when changing from US-ASCII to UTF-8). I think we should
124
leave those rare cases to the user, who can recode and merge by hand
125
or use some other tool.