Help: Setup segmentation rule for Chinese
Thread poster: Jessicaliu
Jessicaliu
Jessicaliu  Identity Verified
Hong Kong
Local time: 10:38
Chinese to English
+ ...
Jan 18, 2011

Hi! I find there is no segmentation rule for Chinese in OmegaT. Therefore, my source text is segmented in paragraph.

I tried to setup my own segmentation rule as follows.

Intention: set a segment after "。" (Chinese period)
Before: 。
After:

But, it's not working. All the sentence are still stick together. Could anyone tell me what to do? Many thanks.


 
Didier Briel
Didier Briel  Identity Verified
France
Local time: 03:38
English to French
+ ...
Did you check Break/Exception? Jan 18, 2011

Jessicaliu wrote:
Hi! I find there is no segmentation rule for Chinese in OmegaT. Therefore, my source text is segmented in paragraph.

I tried to setup my own segmentation rule as follows.

Intention: set a segment after "。" (Chinese period)
Before: 。
After:

But, it's not working. All the sentence are still stick together. Could anyone tell me what to do? Many thanks.

Is the box Break/Exception in your rule checked?

Is your project set to sentence segmentation (check in Project Properties)?

Didier


 
Jessicaliu
Jessicaliu  Identity Verified
Hong Kong
Local time: 10:38
Chinese to English
+ ...
TOPIC STARTER
box checked (break) Jan 19, 2011

Thank you for reply. I check the box.

The version I use is 2.2.0_2.

I also tried several source texts and move up the rule to the first one on the list. But, it seems that OmegaT does not recoginze my segmentation rule.


 
Didier Briel
Didier Briel  Identity Verified
France
Local time: 03:38
English to French
+ ...
A simple test: set your source language to Japanese Jan 19, 2011

Jessicaliu wrote:

Thank you for reply. I check the box.

The version I use is 2.2.0_2.

I also tried several source texts and move up the rule to the first one on the list. But, it seems that OmegaT does not recoginze my segmentation rule.

You can do a simple test: set your source language to Japanese (just temporarily for the test), as the end of sentence character is the same.

If it works, then there's something wrong in your rule (for instance, the end of sentence character is not the right one).

If it doesn't work, then there's another issue.

For instance, you did not answer my other question:
Is your project set to sentence segmentation (check in Project Properties)?

Didier


 
Jessicaliu
Jessicaliu  Identity Verified
Hong Kong
Local time: 10:38
Chinese to English
+ ...
TOPIC STARTER
It works. Thank you a lot. Jan 19, 2011

Thank you Didier.

I checked the sentence-level segmenting.

But, your reply remind me that I forget to change "language pattern" from default to a specific Chinese language code that OmegaT is able to recoginize. I changed it to ZH-HK. And, it works!


 
Didier Briel
Didier Briel  Identity Verified
France
Local time: 03:38
English to French
+ ...
Glad it works Jan 19, 2011

Jessicaliu wrote:
I checked the sentence-level segmenting.

But, your reply remind me that I forget to change "language pattern" from default to a specific Chinese language code that OmegaT is able to recoginize. I changed it to ZH-HK. And, it works!

Thank you for the feedback.
It might help finding the issue for another user in the future.

Didier


 
Neirda
Neirda  Identity Verified
China
Local time: 10:38
Chinese to French
+ ...
Related question Mar 26, 2013

I am trying to set up some more segmentation rules for Chinese, and this is what I'd like to get :

No segmentation exception after a period 。is followed by a closing quotation mark ” (when said period is set up to be segmented in all other instances).

Example : “我想此句不要分开。”译员说。Shouldn't be segmented.
Under current rules, it gets segmented as follows :
“我想此句不要分开。[segment]”译员说。[segment]

... See more
I am trying to set up some more segmentation rules for Chinese, and this is what I'd like to get :

No segmentation exception after a period 。is followed by a closing quotation mark ” (when said period is set up to be segmented in all other instances).

Example : “我想此句不要分开。”译员说。Shouldn't be segmented.
Under current rules, it gets segmented as follows :
“我想此句不要分开。[segment]”译员说。[segment]

My set up :
Break/Exception : unchecked
Pattern Before : [。?!]
Pattern After : [’”"]

But it doesn not seem to work, when all other rules set up for this language are all working fine. Any clue ?

Thank you !

[Edited at 2013-03-26 05:47 GMT]
Collapse


 
Didier Briel
Didier Briel  Identity Verified
France
Local time: 03:38
English to French
+ ...
Move your non-breaking rules to the top Mar 26, 2013

Pierret Adrien wrote:

I am trying to set up some more segmentation rules for Chinese, and this is what I'd like to get :

No segmentation exception after a period 。is followed by a closing quotation mark ” (when said period is set up to be segmented in all other instances).

Example : “我想此句不要分开。”译员说。Shouldn't be segmented.
Under current rules, it gets segmented as follows :
“我想此句不要分开。[segment]”译员说。[segment]

My set up :
Break/Exception : unchecked
Pattern Before : [。?!]
Pattern After : [’”"]

But it doesn not seem to work, when all other rules set up for this language are all working fine. Any clue ?

The only thing I can thing of right now (except if you're not using the right quotation marks), is that perhaps your non-breaking rule is below the breaking rule.
You have to move your non-breaking rule above all breaking rules.

If it still doesn't work, I recommend asking the question in the Yahoo support group:
http://tech.groups.yahoo.com/group/OmegaT/
where there are knowledgeable people in your time zone (so you would get faster answers), and where you could express yourself in Chinese or French if needed.

Didier

[Edited at 2013-03-26 15:35 GMT]


 
Neirda
Neirda  Identity Verified
China
Local time: 10:38
Chinese to French
+ ...
Look no more Mar 27, 2013

Yes, my non-breaking rule was placed under my breaking rule. I didn't know it mattered. Problem solved, thank you.

By the way, is there a place where we can consult a comprehensive list of signs used in the making of segmentation rules ?

I understood that [] means "any one of those signs", but what about +, or {} ? I couldn't get the grasp of it, and documentation doesn't mention it as far as I checked.

Anyway, I'll be sure to check out that Yahoo group, th
... See more
Yes, my non-breaking rule was placed under my breaking rule. I didn't know it mattered. Problem solved, thank you.

By the way, is there a place where we can consult a comprehensive list of signs used in the making of segmentation rules ?

I understood that [] means "any one of those signs", but what about +, or {} ? I couldn't get the grasp of it, and documentation doesn't mention it as far as I checked.

Anyway, I'll be sure to check out that Yahoo group, thank you Didier.
Collapse


 
Didier Briel
Didier Briel  Identity Verified
France
Local time: 03:38
English to French
+ ...
The documentation is a good starting point Mar 27, 2013

Pierret Adrien wrote:
By the way, is there a place where we can consult a comprehensive list of signs used in the making of segmentation rules ?

I understood that [] means "any one of those signs", but what about +, or {} ? I couldn't get the grasp of it, and documentation doesn't mention it as far as I checked.

Chapter 16. Regular expressions is a good starting point, although it doesn't cover everything. The same chapter gives a link to http://docs.oracle.com/javase/1.5.0/docs/api/java/util/regex/Pattern.html which contains a comprehensive list of all expressions used. For instance, for '+', look at 'Greedy quantifiers'.

For a beginner's approach to regular expressions, searching for 'regular expressions tutorial' gives plenty of links in a search engine. Note that OmegaT uses Java regular expressions (as mentioned above), which syntax may vary slightly compared with other dialects.

Didier


 
Neirda
Neirda  Identity Verified
China
Local time: 10:38
Chinese to French
+ ...
Very helpful Mar 29, 2013

Thank you, your link is actually helpful enough so I can improve Chinese segmentation a bit (I'll give you my feedback should you be interested in including it in a future release).

 
Weedy Tan
Weedy Tan  Identity Verified
Taiwan
Local time: 10:38
Chinese to English
+ ...
Chinese sentence segmentation rules Jan 7, 2014

Pierret Adrien wrote:

Thank you, your link is actually helpful enough so I can improve Chinese segmentation a bit (I'll give you my feedback should you be interested in including it in a future release).



Hi Pierret,

I am trying to make my Chinese sentence segmentation rules in OmegaT and just read about your effort here.

Would it be too much to ask if you can give me a more comprehensive and detailed segmentation rules you have done so far?

Thanks in advance,

Weedy


 


There is no moderator assigned specifically to this forum.
To report site rules violations or get help, please contact site staff »


Help: Setup segmentation rule for Chinese






Anycount & Translation Office 3000
Translation Office 3000

Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.

More info »
TM-Town
Manage your TMs and Terms ... and boost your translation business

Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.

More info »