* * *

Author Topic: How to stream.Read UTF16 files?  (Read 3449 times)

paskal

  • Full Member
  • ***
  • Posts: 178
How to stream.Read UTF16 files?
« on: May 18, 2012, 09:05:35 am »
Here is a procedure:
Code: [Select]
function ReadFile(FileName: string): string;
var
  MyString: AnsiString;
  stream: TFileStream;
  StringLenth: Integer;
begin
    StringLenth := FileSize(FileName);
  stream := TFileStream.Create(UTF8ToSys(FileName), fmShareDenyNone );
  try
    stream.Seek(0, soFromBeginning);
    SetLength(MyString, StringLenth);
    stream.Read(MyString[1], StringLenth);
  finally
    stream.Free();
  end;
  form1.Memo1.lines.Text := ConvertEncoding (MyString,'ansi', 'cp1200');
end;
I try to read a file, saved as UTF16LE (codepage 1200);
I tried different combinations of ConvertEncoding but I got no result.
Memo1 displays only the firsts three chars of the file, the first two ones are shown as question marks (they are the BOM chars), the third one is shown as it should.
What is odd is that when I read a UTF8 file, it is displayed properly, despite ConvertEncoding (MyString,'ansi', 'cp1200');
Lazarus 1,1; build 40379; FPC2,6,1

marcov

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1679
Re: How to stream.Read UTF16 files?
« Reply #1 on: May 18, 2012, 10:41:08 am »
utf16LE is a 2-bytes per array item type. Use a 2-byte stringtype like widestring or unicodestring.

Note that in that case you need to divide the filesize by two to get the value for setlength.

theo

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1842
Re: How to stream.Read UTF16 files?
« Reply #2 on: May 18, 2012, 10:50:52 am »
You could try charencstreams.pas
http://wiki.freepascal.org/Theodp

paskal

  • Full Member
  • ***
  • Posts: 178
Re: How to stream.Read UTF16 files?
« Reply #3 on: May 18, 2012, 02:01:29 pm »
charencstreams.pas seems to work file (at least reading, so far).
Does it guess the encoding for files without BOM, or it finds it some other way? With a very short string it gets a wrong encoding, but with a longer it is fine.
Lazarus 1,1; build 40379; FPC2,6,1

theo

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1842
Re: How to stream.Read UTF16 files?
« Reply #4 on: May 18, 2012, 03:56:27 pm »
Yes, it guesses as far as possible.
If you know the encoding, you can disable this behaviour.

fCES.ForceType:=true;

See also demo/charenc/



paskal

  • Full Member
  • ***
  • Posts: 178
Re: How to stream.Read UTF16 files?
« Reply #5 on: May 19, 2012, 12:46:55 pm »
I have some problems with longer files (100kB+).
Meanwhile, I found something odd in the examples charenc in unit1.pas there is:



procedure TForm1.btnOpenClick(Sender: TObject);
begin
  if OpenDialog1.Execute then

  begin
    if not cbForceType.Checked then fCES.Reset;
    fCES.LoadFromFile(UTF8ToSys(OpenDialog1.FileName));
    Memo1.text := fCES.UTF8Text;
    cbBOM.Checked := fCES.HasBOM;
    cbForceType.Checked := fCES.ForceType;
    UniEnc.ItemIndex := Ord(fCES.UniStreamType);
    ANSIEnc.Enabled := fCES.UniStreamType = ufANSI;
    if ANSIEnc.Enabled then ANSIEnc.ItemIndex := ANSIEnc.Items.IndexOf(fCES.ANSIEnc) else ANSIEnc.ItemIndex := 0;
  end;
end;   


It works okay.
But if I make a small change:


procedure TForm1.btnOpenClick(Sender: TObject);
var
  mystring: string;
begin
  if OpenDialog1.Execute then

  begin
    if not cbForceType.Checked then fCES.Reset;
    fCES.LoadFromFile(UTF8ToSys(OpenDialog1.FileName));
    mystring:=fCES.UTF8Text;
    Memo1.text := fCES.UTF8Text;
    cbBOM.Checked := fCES.HasBOM;
    cbForceType.Checked := fCES.ForceType;
    UniEnc.ItemIndex := Ord(fCES.UniStreamType);
    ANSIEnc.Enabled := fCES.UniStreamType = ufANSI;
    if ANSIEnc.Enabled then ANSIEnc.ItemIndex := ANSIEnc.Items.IndexOf(fCES.ANSIEnc) else ANSIEnc.ItemIndex := 0;
  end;
end;   


then a wrong result is displayed. Probably each time the function is called, a different result occurs. Should it be this way?

Also, I managed to crash the example.
Here is how it happens:
1. I open a file (UTF16BE)
2. I open another file (UTF8)
3. I open the first file again -> Exception class external....

Opening the first file many time makes no problem.
Then I tried something else:
1. I open a file (UTF16BE)
1. I open a file (UTF16BE again)->  Exception class external....

By the way, tere is a bug in the sample code in http://wiki.freepascal.org/Theodp
fCES.LoadFromFile(OpenDialog1.FileName); shall be fCES.LoadFromFile( UTF8ToSys( OpenDialog1.FileName)); because currently it can open only files with basic Latin names.
« Last Edit: May 19, 2012, 01:04:05 pm by paskal »
Lazarus 1,1; build 40379; FPC2,6,1

paskal

  • Full Member
  • ***
  • Posts: 178
Re: How to stream.Read UTF16 files?
« Reply #6 on: June 25, 2012, 10:27:05 am »
Can anyone else reproduce the problem?
I have attached a ZIP containing two files- one of them is in UTF8 and the other one- in UTF16BE (as indicated in the filenames) encoding.
With the example application from UTF8Tools/Demo/charenc/ when the UTF8 file (TestFileUTF8.txt )is open and right after it- the UTF16BE file (TestFileUTF16BE.txt) is open, the application crashes.
I have tried it on two different computers with WinXP, probably with slightly different Lazarus versions.
The last test is done on Lazarus #1.1; Date: 2012-05-29; FPC 2.6.1; Svn 37447.
Lazarus 1,1; build 40379; FPC2,6,1

KpjComp

  • Hero Member
  • *****
  • Posts: 680
Re: How to stream.Read UTF16 files?
« Reply #7 on: June 25, 2012, 01:01:55 pm »
charencstreams calls a procedure called, WideSwapEndian,  This assumes the text is null terminated.

WideSwapEndian ideally needs another parameter for maximum length.

For now though you could just fake it by adding a Null.

eg. In unit1 of the charenc demo project change the btnOpenClick event to->

Code: [Select]
procedure TForm1.btnOpenClick(Sender: TObject);
var
  nullword:word = 0;
begin
  if OpenDialog1.Execute then
  begin
    if not cbForceType.Checked then fCES.Reset;
    fCES.LoadFromFile(OpenDialog1.FileName);
    fCES.Seek(0,soEnd);
    fCES.WriteBuffer(Nullword,sizeof(nullword));
    Memo1.text := fCES.UTF8Text;
    cbBOM.Checked := fCES.HasBOM;
    cbForceType.Checked := fCES.ForceType;
    UniEnc.ItemIndex := Ord(fCES.UniStreamType);
    ANSIEnc.Enabled := fCES.UniStreamType = ufANSI;
    if ANSIEnc.Enabled then ANSIEnc.ItemIndex := ANSIEnc.Items.IndexOf(fCES.ANSIEnc) else ANSIEnc.ItemIndex := 0;
  end;             

paskal

  • Full Member
  • ***
  • Posts: 178
Re: How to stream.Read UTF16 files?
« Reply #8 on: June 25, 2012, 01:35:12 pm »
Did you try if this solution works for you?

So what you do is add two lines:
Code: [Select]
    fCES.Seek(0,soEnd);
    fCES.WriteBuffer(Nullword,sizeof(nullword));
So, firsts Lazarus did not find NullStr so I replaced it with NullStr.
After running the app, it behaves the same way :(
Lazarus 1,1; build 40379; FPC2,6,1

KpjComp

  • Hero Member
  • *****
  • Posts: 680
Re: How to stream.Read UTF16 files?
« Reply #9 on: June 25, 2012, 01:39:38 pm »
It's not NullStr,  it's NullWord.
And it's defined at the beginning of the procedure I posted.

Code: [Select]
var
  nullword:word = 0;

IOW: I'm adding two extra 0 bytes to the end of the stream to prevent the WideSwapEndian from overflowing.

Quote
Did you try if this solution works for you?

Yes..

paskal

  • Full Member
  • ***
  • Posts: 178
Re: How to stream.Read UTF16 files?
« Reply #10 on: June 25, 2012, 02:05:47 pm »
It's not NullStr,  it's NullWord.
And it's defined at the beginning of the procedure I posted.

Code: [Select]
var
  nullword:word = 0;
Ooops, I did not see it.

IOW: I'm adding two extra 0 bytes to the end of the stream to prevent the WideSwapEndian from overflowing.


Do you mean that I should either
Code: [Select]
    fCES.WriteBuffer(NullWord,sizeof(NullWord));
    fCES.WriteBuffer(NullWord,sizeof(NullWord));
or
Code: [Select]
var
  nullword:Dword = 0;

Quote
Did you try if this solution works for you?

Yes..
Now it works for me, too. Thanks!

Anyway, wont' this Nullwords result in an extra string, which I should remove?
« Last Edit: June 25, 2012, 02:07:57 pm by paskal »
Lazarus 1,1; build 40379; FPC2,6,1

KpjComp

  • Hero Member
  • *****
  • Posts: 680
Re: How to stream.Read UTF16 files?
« Reply #11 on: June 25, 2012, 02:14:46 pm »
Quote
Do you mean that I should either

As well as the var, you just need ->
Code: [Select]
    fCES.Seek(0,soEnd);
    fCES.WriteBuffer(Nullword,sizeof(nullword));
You don't need two, as NullWord = 2 bytes.  And we only need two bytes to prevent WideSwapEndian from overflowing.  But it won't harm using Dword = 4 bytes, you never know your UTF16 file might be corrupt and have a none Even size.

For the long term, ideally WideSwapEndian wants fixing, as adding 0 bytes to the end of the memory stream is a bit of a hack.

KpjComp

  • Hero Member
  • *****
  • Posts: 680
Re: How to stream.Read UTF16 files?
« Reply #12 on: June 25, 2012, 02:27:20 pm »
Quote
Anyway, wont' this Nullwords result in an extra string, which I should remove?

Possibly, that's why it's a bit of a hack.

For a proper fix try this in charencstreams ->
Change WideSwapEndian to ->
Code: [Select]
procedure WideSwapEndian(PWC: PWideChar;size:integer);
begin
  while size >= sizeof(widechar) do
  begin
    PWC^ := WideChar(SwapEndian(Word(PWC^)));
    inc(PWC);
    dec(size,sizeof(widechar));
  end;
end;

And the two places this is called pass the Size.
eg.
Code: [Select]
function TUniStream.GetUTF8Text: AnsiString;
........
        if fUniStreamType = ufUtf16be then WideSwapEndian(PWC,size);
and
Code: [Select]
procedure TUniStream.SetUTF8Text(AString: AnsiString);   
.......
          if fUniStreamType = ufUtf16be then WideSwapEndian(@WS[1],size);

paskal

  • Full Member
  • ***
  • Posts: 178
Re: How to stream.Read UTF16 files?
« Reply #13 on: June 26, 2012, 01:12:54 pm »
Okay, it seems to work fine now. :)
There is something more unclear about this package- is it appropriate for using in GPL and LGPL applications?
Lazarus 1,1; build 40379; FPC2,6,1

KpjComp

  • Hero Member
  • *****
  • Posts: 680
Re: How to stream.Read UTF16 files?
« Reply #14 on: June 26, 2012, 01:25:34 pm »
Quote
including without limitation
the rights to use, copy, modify, merge, publish, distribute, sublicense,

The term sublicense to me would indicate you can choose.

 

Recent

Get Lazarus at SourceForge.net. Fast, secure and Free Open Source software downloads